tasks

Run evaluations using classification or steering mode; this command evaluates model performance primarily. To evaluate model performance using either classification or steering mode run this primary command.

Basic Usage
python -m wisent tasks [TASK_NAMES] [OPTIONS]

Examples

Classification Mode
# Run MMLU benchmark with logistic classifier
python -m wisent tasks mmlu \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --layer 15 \
  --limit 100 \
  --classifier-type logistic \
  --verbose
Steering Mode
# Run TruthfulQA with steering
python -m wisent tasks truthfulqa_mc1 \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --layer 15 \
  --steering-mode \
  --steering-strength 1.5 \
  --steering-method CAA
Run All Benchmarks
# Run all 37 available benchmarks
python -m wisent tasks --all \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --layer 15
Cross-Benchmark Evaluation
# Train on one task, evaluate on another
python -m wisent tasks \
  --train-task mmlu \
  --eval-task truthfulqa_mc1 \
  --cross-benchmark \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --layer 15
Synthetic Pairs Mode
# Generate synthetic pairs for a custom trait
python -m wisent tasks \
  --synthetic \
  --trait "responds more helpfully" \
  --num-synthetic-pairs 30 \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --layer 15

Arguments

Task Selection

ArgumentDescription
task_namesComma-separated list of task names or path to CSV/JSON file
--list-tasksList all 37 available benchmark tasks
--task-info TASKShow detailed information about a specific task
--allRun all 37 available benchmarks
--skillsSelect tasks by skill categories (coding, mathematics, reasoning)
--risksSelect tasks by risk categories (harmfulness, toxicity, hallucination)

Model Configuration

ArgumentDefaultDescription
--modelLlama-3.1-8B-InstructModel name or path
--layer15Layer(s) for activations (15, 14-16, or 14,15,16)
--deviceautoDevice (cuda, cpu, mps)

Classification Options

ArgumentDefaultDescription
--classifier-typelogisticClassifier type (logistic, mlp)
--detection-threshold0.6Classification threshold
--token-aggregationaverageToken aggregation (average, final, first, max, min)

Steering Options

ArgumentDefaultDescription
--steering-modefalseEnable steering mode
--steering-strength1.0Steering vector strength
--steering-methodCAASteering method

Data Options

ArgumentDefaultDescription
--limitNoneLimit total samples
--training-limitNoneLimit training samples
--testing-limitNoneLimit testing samples
--split-ratio0.8Train/test split ratio
--seed42Random seed

Save/Load Options

ArgumentDescription
--save-classifierSave trained classifier to path
--load-classifierLoad classifier from path
--save-steering-vectorSave steering vector to path
--load-steering-vectorLoad steering vector from path
--outputOutput directory for results (default: ./results)

Available Benchmarks

Run python -m wisent tasks --list-tasks to see all 37 available benchmarks including:

  • truthfulqa_mc1, truthfulqa_mc2 - Truthfulness evaluation
  • mmlu - Massive Multitask Language Understanding
  • hellaswag - Commonsense reasoning
  • arc_challenge, arc_easy - AI2 Reasoning Challenge
  • winogrande - Commonsense reasoning
  • gsm8k - Grade school math
  • humaneval - Code generation
  • And many more...

Stay in the loop. Never miss out.

Subscribe to our newsletter and unlock Wisent insights.