Wisent evaluates steering quality across 168 benchmark families drawn from the LM evaluation harness. These benchmarks cover reasoning tasks, domain expertise assessments, coding challenges, math problems, and safety-related evaluations. Together they provide a thorough picture of how well the steering actually works.
General reasoning and world knowledge benchmarks
Mathematical reasoning and computation
Code generation and understanding
Safety evaluation and alignment testing
Natural language understanding tasks
Domain-specific benchmarks
python -m wisent.cli evaluate --benchmark mmlu --num-samples 100
python -m wisent.cli evaluate --benchmark gsm8k --num-samples 50
python -m wisent.cli evaluate --benchmark truthfulqa --steering-vector honesty.pt --steering-strength 1.5
python -m wisent.cli evaluate --list-benchmarks
python -m wisent.cli evaluate --benchmark mmlu,hellaswag,arc_challenge --num-samples 100
from wisent.core.benchmark_registry import (
get_lm_eval_tasks,
get_all_benchmarks,
get_benchmark_config
)
# Get all available benchmarks
benchmarks = get_all_benchmarks()
print(f"Total benchmarks: {len(benchmarks)}")
# Get lm-eval-harness tasks
lm_eval_tasks = get_lm_eval_tasks()
print(f"LM-eval tasks: {len(lm_eval_tasks)}")
# Get config for specific benchmark
config = get_benchmark_config("mmlu")
print(f"MMLU config: {config}")from wisent import WisentModel
from wisent.core.evaluators import TaskEvaluator
# Load model
model = WisentModel("Qwen/Qwen3-4B")
# Create evaluator
evaluator = TaskEvaluator(
model=model.model,
tokenizer=model.tokenizer,
benchmark="mmlu"
)
# Run evaluation
results = evaluator.evaluate(num_samples=100)
print(f"Accuracy: {results['accuracy']:.2%}")
print(f"Per-subject breakdown:")
for subject, score in results['subjects'].items():
print(f" {subject}: {score:.2%}")from wisent import WisentModel
import torch
# Load model and steering vector
model = WisentModel("Qwen/Qwen3-4B")
vector = torch.load("honesty.pt")
# Evaluate without steering
baseline = model.evaluate_benchmark("truthfulqa", num_samples=100)
print(f"Baseline TruthfulQA: {baseline['accuracy']:.2%}")
# Evaluate with steering
model.set_steering_vector(vector, layer=15, strength=1.5)
steered = model.evaluate_benchmark("truthfulqa", num_samples=100)
print(f"Steered TruthfulQA: {steered['accuracy']:.2%}")
print(f"Improvement: {steered['accuracy'] - baseline['accuracy']:.2%}")Here is the complete list of all 168 recognized benchmark tasks. Note that individual categories often include multiple related subtasks underneath them. MMLU alone, for instance, spans 57 distinct subjects.
For the complete benchmark configuration and registry, see:
Stay in the loop. Never miss out.
Subscribe to our newsletter and unlock Wisent insights.