Primitives

Representation Reading

Representation Control

Evaluation

Evaluators

CLI Commands

Important Considerations

Definitions

Roadmap

Supported Benchmarks

Wisent evaluates steering quality across 168 benchmark families drawn from the LM evaluation harness. These benchmarks cover reasoning tasks, domain expertise assessments, coding challenges, math problems, and safety-related evaluations. Together they provide a thorough picture of how well the steering actually works.

Benchmark Categories

Reasoning & Knowledge

General reasoning and world knowledge benchmarks

mmluhellaswagwinograndearc_challengearc_easypiqaboolqopenbookqatriviaqanaturalqa

Math & Quantitative

Mathematical reasoning and computation

gsm8kmathminerva_mathmathqaasdivsvampmawpsaqua_rat

Coding & Programming

Code generation and understanding

humanevalmbpplivecodebenchcodex_evalappsds1000

Safety & Alignment

Safety evaluation and alignment testing

truthfulqatoxigenbbqwinobiasboldrealtoxicityprompts

Language & NLU

Natural language understanding tasks

squadracequaccoqadroplambadawsccoparte

Specialized Domains

Domain-specific benchmarks

medqapubmedqamedmcqalegalqafinqasciqcommonsenseqastrategyqa

CLI Examples

Evaluate on MMLU

python -m wisent.cli evaluate --benchmark mmlu --num-samples 100

Evaluate on GSM8K (math)

python -m wisent.cli evaluate --benchmark gsm8k --num-samples 50

Evaluate with steering vector

python -m wisent.cli evaluate --benchmark truthfulqa --steering-vector honesty.pt --steering-strength 1.5

List all available benchmarks

python -m wisent.cli evaluate --list-benchmarks

Evaluate on multiple benchmarks

python -m wisent.cli evaluate --benchmark mmlu,hellaswag,arc_challenge --num-samples 100

Python API

Using benchmark registry

from wisent.core.benchmark_registry import (
    get_lm_eval_tasks,
    get_all_benchmarks,
    get_benchmark_config
)

# Get all available benchmarks
benchmarks = get_all_benchmarks()
print(f"Total benchmarks: {len(benchmarks)}")

# Get lm-eval-harness tasks
lm_eval_tasks = get_lm_eval_tasks()
print(f"LM-eval tasks: {len(lm_eval_tasks)}")

# Get config for specific benchmark
config = get_benchmark_config("mmlu")
print(f"MMLU config: {config}")

Running benchmark evaluation

from wisent import WisentModel
from wisent.core.evaluators import TaskEvaluator

# Load model
model = WisentModel("Qwen/Qwen3-4B")

# Create evaluator
evaluator = TaskEvaluator(
    model=model.model,
    tokenizer=model.tokenizer,
    benchmark="mmlu"
)

# Run evaluation
results = evaluator.evaluate(num_samples=100)

print(f"Accuracy: {results['accuracy']:.2%}")
print(f"Per-subject breakdown:")
for subject, score in results['subjects'].items():
    print(f"  {subject}: {score:.2%}")

Evaluating steering effectiveness on benchmarks

from wisent import WisentModel
import torch

# Load model and steering vector
model = WisentModel("Qwen/Qwen3-4B")
vector = torch.load("honesty.pt")

# Evaluate without steering
baseline = model.evaluate_benchmark("truthfulqa", num_samples=100)
print(f"Baseline TruthfulQA: {baseline['accuracy']:.2%}")

# Evaluate with steering
model.set_steering_vector(vector, layer=15, strength=1.5)
steered = model.evaluate_benchmark("truthfulqa", num_samples=100)
print(f"Steered TruthfulQA: {steered['accuracy']:.2%}")
print(f"Improvement: {steered['accuracy'] - baseline['accuracy']:.2%}")

Complete Benchmark List

Here is the complete list of all 168 recognized benchmark tasks. Note that individual categories often include multiple related subtasks underneath them. MMLU alone, for instance, spans 57 distinct subjects.

aclueaexamsagievalai2_arcalgebra_linanliappsarabic_examsarc_challengearc_easyasdivbasque_triviabbhbbqbelebelebigbenchblimpboldboolqcbcevalcmmlucnn_dailymailcode_x_gluecodex_evalcommonsenseqacopacoqacsqacrows_pairscsatqadataset_qadropds1000eq_benchethicaieuroparlfdafinqafloresgemgluegpqagsm8khaeraehellaswaghendryckshlehumanevalifevaliwsltjapanese_mtjcommonsenseqakluekobestkoldkorean_commongenkorean_wmtlambadalegalbenchlegalqalogiqalsat_qamathmathqamawpsmbppmedmcqamedqamgsmminerva_mathmmluprommlumodel_written_evalsmultilegalpilemutualnaturalqanewsqanq_openokapiopenbookqapawspawsxpilepiqapolemopubmedqaqa4mreqasperquacqualityracerealmsrealtoxicitypromptsrecordrtesciqscrollssiqasquadstoryclozestrategyqasuper_gluesuper_gpqasvampswagtinygsmtoxigentriviaqatruthfulqatydiqaunscramblewebqswicwikitextwinograndewinobiaswnliwscxcopaxnlixquadxstoryclozexsumxwinograd

For the complete benchmark configuration and registry, see:

View benchmark_registry.py on GitHub

View all_lm_eval_task_families.json on GitHub

Stay in the loop. Never miss out.

Subscribe to our newsletter and unlock Wisent insights.

Contact Careers Privacy Policy Terms of Service