tasks

Run evaluations using classification or steering mode; this command evaluates model performance primarily. To evaluate model performance using either classification or steering mode run this primary command.

Basic Usage

python -m wisent tasks [TASK_NAMES] [OPTIONS]

Examples

Classification Mode

# Run MMLU benchmark with logistic classifier
python -m wisent tasks mmlu \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --layer 15 \
  --limit 100 \
  --classifier-type logistic \
  --verbose

Steering Mode

# Run TruthfulQA with steering
python -m wisent tasks truthfulqa_mc1 \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --layer 15 \
  --steering-mode \
  --steering-strength 1.5 \
  --steering-method CAA

Run All Benchmarks

# Run all 37 available benchmarks
python -m wisent tasks --all \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --layer 15

Cross-Benchmark Evaluation

# Train on one task, evaluate on another
python -m wisent tasks \
  --train-task mmlu \
  --eval-task truthfulqa_mc1 \
  --cross-benchmark \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --layer 15

Synthetic Pairs Mode

# Generate synthetic pairs for a custom trait
python -m wisent tasks \
  --synthetic \
  --trait "responds more helpfully" \
  --num-synthetic-pairs 30 \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --layer 15

Arguments

Task Selection

Argument	Description
task_names	Comma-separated list of task names or path to CSV/JSON file
--list-tasks	List all 37 available benchmark tasks
--task-info TASK	Show detailed information about a specific task
--all	Run all 37 available benchmarks
--skills	Select tasks by skill categories (coding, mathematics, reasoning)
--risks	Select tasks by risk categories (harmfulness, toxicity, hallucination)

Model Configuration

Argument	Default	Description
--model	Llama-3.1-8B-Instruct	Model name or path
--layer	15	Layer(s) for activations (15, 14-16, or 14,15,16)
--device	auto	Device (cuda, cpu, mps)

Classification Options

Argument	Default	Description
--classifier-type	logistic	Classifier type (logistic, mlp)
--detection-threshold	0.6	Classification threshold
--token-aggregation	average	Token aggregation (average, final, first, max, min)

Steering Options

Argument	Default	Description
--steering-mode	false	Enable steering mode
--steering-strength	1.0	Steering vector strength
--steering-method	CAA	Steering method

Data Options

Argument	Default	Description
--limit	None	Limit total samples
--training-limit	None	Limit training samples
--testing-limit	None	Limit testing samples
--split-ratio	0.8	Train/test split ratio
--seed	42	Random seed

Save/Load Options

Argument	Description
--save-classifier	Save trained classifier to path
--load-classifier	Load classifier from path
--save-steering-vector	Save steering vector to path
--load-steering-vector	Load steering vector from path
--output	Output directory for results (default: ./results)

Available Benchmarks

Run python -m wisent tasks --list-tasks to see all 37 available benchmarks including:

truthfulqa_mc1, truthfulqa_mc2 - Truthfulness evaluation
mmlu - Massive Multitask Language Understanding
hellaswag - Commonsense reasoning
arc_challenge, arc_easy - AI2 Reasoning Challenge
winogrande - Commonsense reasoning
gsm8k - Grade school math
humaneval - Code generation
And many more...

Stay in the loop. Never miss out.

Subscribe to our newsletter and unlock Wisent insights.

Contact Careers Privacy Policy Terms of Service