Evaluate generated model responses against ground truth or baseline responses. Supports various evaluation modes including personalization comparison.
python -m wisent evaluate-responses --input FILE --output FILE [OPTIONS]
python -m wisent evaluate-responses \ --input ./responses/truthfulqa_steered.json \ --output ./evaluation/truthfulqa_results.json
python -m wisent evaluate-responses \ --input ./responses/personality_steered.json \ --baseline ./responses/personality_baseline.json \ --trait "British personality" \ --trait-description "Responds with British expressions and cultural references" \ --output ./evaluation/personality_results.json
| Argument | Default | Description |
|---|---|---|
| --input | required | Input JSON file with generated responses |
| --output | required | Output JSON file for evaluation results |
| --baseline | - | Baseline responses JSON file (for comparison) |
| --task | from input | Task name (overrides task from input JSON) |
| --trait | - | Personality trait to evaluate |
| --trait-description | - | Description of the personality trait |
| --verbose | false | Verbose output |
Stay in the loop. Never miss out.
Subscribe to our newsletter and unlock Wisent insights.