evaluate-responses

Evaluate generated model responses against ground truth or baseline responses. Supports various evaluation modes including personalization comparison.

Basic Usage
python -m wisent evaluate-responses --input FILE --output FILE [OPTIONS]

Examples

Basic Evaluation
python -m wisent evaluate-responses \
  --input ./responses/truthfulqa_steered.json \
  --output ./evaluation/truthfulqa_results.json
Personalization Evaluation
python -m wisent evaluate-responses \
  --input ./responses/personality_steered.json \
  --baseline ./responses/personality_baseline.json \
  --trait "British personality" \
  --trait-description "Responds with British expressions and cultural references" \
  --output ./evaluation/personality_results.json

Arguments

ArgumentDefaultDescription
--inputrequiredInput JSON file with generated responses
--outputrequiredOutput JSON file for evaluation results
--baseline-Baseline responses JSON file (for comparison)
--taskfrom inputTask name (overrides task from input JSON)
--trait-Personality trait to evaluate
--trait-description-Description of the personality trait
--verbosefalseVerbose output

Related Commands

Stay in the loop. Never miss out.

Subscribe to our newsletter and unlock Wisent insights.