Run evaluations using classification or steering mode; this command evaluates model performance primarily. To evaluate model performance using either classification or steering mode run this primary command.
python -m wisent tasks [TASK_NAMES] [OPTIONS]
# Run MMLU benchmark with logistic classifier python -m wisent tasks mmlu \ --model meta-llama/Llama-3.1-8B-Instruct \ --layer 15 \ --limit 100 \ --classifier-type logistic \ --verbose
# Run TruthfulQA with steering python -m wisent tasks truthfulqa_mc1 \ --model meta-llama/Llama-3.1-8B-Instruct \ --layer 15 \ --steering-mode \ --steering-strength 1.5 \ --steering-method CAA
# Run all 37 available benchmarks python -m wisent tasks --all \ --model meta-llama/Llama-3.1-8B-Instruct \ --layer 15
# Train on one task, evaluate on another python -m wisent tasks \ --train-task mmlu \ --eval-task truthfulqa_mc1 \ --cross-benchmark \ --model meta-llama/Llama-3.1-8B-Instruct \ --layer 15
# Generate synthetic pairs for a custom trait python -m wisent tasks \ --synthetic \ --trait "responds more helpfully" \ --num-synthetic-pairs 30 \ --model meta-llama/Llama-3.1-8B-Instruct \ --layer 15
| Argument | Description |
|---|---|
| task_names | Comma-separated list of task names or path to CSV/JSON file |
| --list-tasks | List all 37 available benchmark tasks |
| --task-info TASK | Show detailed information about a specific task |
| --all | Run all 37 available benchmarks |
| --skills | Select tasks by skill categories (coding, mathematics, reasoning) |
| --risks | Select tasks by risk categories (harmfulness, toxicity, hallucination) |
| Argument | Default | Description |
|---|---|---|
| --model | Llama-3.1-8B-Instruct | Model name or path |
| --layer | 15 | Layer(s) for activations (15, 14-16, or 14,15,16) |
| --device | auto | Device (cuda, cpu, mps) |
| Argument | Default | Description |
|---|---|---|
| --classifier-type | logistic | Classifier type (logistic, mlp) |
| --detection-threshold | 0.6 | Classification threshold |
| --token-aggregation | average | Token aggregation (average, final, first, max, min) |
| Argument | Default | Description |
|---|---|---|
| --steering-mode | false | Enable steering mode |
| --steering-strength | 1.0 | Steering vector strength |
| --steering-method | CAA | Steering method |
| Argument | Default | Description |
|---|---|---|
| --limit | None | Limit total samples |
| --training-limit | None | Limit training samples |
| --testing-limit | None | Limit testing samples |
| --split-ratio | 0.8 | Train/test split ratio |
| --seed | 42 | Random seed |
| Argument | Description |
|---|---|
| --save-classifier | Save trained classifier to path |
| --load-classifier | Load classifier from path |
| --save-steering-vector | Save steering vector to path |
| --load-steering-vector | Load steering vector from path |
| --output | Output directory for results (default: ./results) |
Run python -m wisent tasks --list-tasks to see all 37 available benchmarks including:
Stay in the loop. Never miss out.
Subscribe to our newsletter and unlock Wisent insights.