Supported Benchmarks

Wisent evaluates steering quality across 168 benchmark families drawn from the LM evaluation harness. These benchmarks cover reasoning tasks, domain expertise assessments, coding challenges, math problems, and safety-related evaluations. Together they provide a thorough picture of how well the steering actually works.

Benchmark Categories

Reasoning & Knowledge

General reasoning and world knowledge benchmarks

mmluhellaswagwinograndearc_challengearc_easypiqaboolqopenbookqatriviaqanaturalqa

Math & Quantitative

Mathematical reasoning and computation

gsm8kmathminerva_mathmathqaasdivsvampmawpsaqua_rat

Coding & Programming

Code generation and understanding

humanevalmbpplivecodebenchcodex_evalappsds1000

Safety & Alignment

Safety evaluation and alignment testing

truthfulqatoxigenbbqwinobiasboldrealtoxicityprompts

Language & NLU

Natural language understanding tasks

squadracequaccoqadroplambadawsccoparte

Specialized Domains

Domain-specific benchmarks

medqapubmedqamedmcqalegalqafinqasciqcommonsenseqastrategyqa

CLI Examples

Evaluate on MMLU
python -m wisent.cli evaluate --benchmark mmlu --num-samples 100
Evaluate on GSM8K (math)
python -m wisent.cli evaluate --benchmark gsm8k --num-samples 50
Evaluate with steering vector
python -m wisent.cli evaluate --benchmark truthfulqa --steering-vector honesty.pt --steering-strength 1.5
List all available benchmarks
python -m wisent.cli evaluate --list-benchmarks
Evaluate on multiple benchmarks
python -m wisent.cli evaluate --benchmark mmlu,hellaswag,arc_challenge --num-samples 100

Python API

Using benchmark registry
from wisent.core.benchmark_registry import (
    get_lm_eval_tasks,
    get_all_benchmarks,
    get_benchmark_config
)

# Get all available benchmarks
benchmarks = get_all_benchmarks()
print(f"Total benchmarks: {len(benchmarks)}")

# Get lm-eval-harness tasks
lm_eval_tasks = get_lm_eval_tasks()
print(f"LM-eval tasks: {len(lm_eval_tasks)}")

# Get config for specific benchmark
config = get_benchmark_config("mmlu")
print(f"MMLU config: {config}")
Running benchmark evaluation
from wisent import WisentModel
from wisent.core.evaluators import TaskEvaluator

# Load model
model = WisentModel("Qwen/Qwen3-4B")

# Create evaluator
evaluator = TaskEvaluator(
    model=model.model,
    tokenizer=model.tokenizer,
    benchmark="mmlu"
)

# Run evaluation
results = evaluator.evaluate(num_samples=100)

print(f"Accuracy: {results['accuracy']:.2%}")
print(f"Per-subject breakdown:")
for subject, score in results['subjects'].items():
    print(f"  {subject}: {score:.2%}")
Evaluating steering effectiveness on benchmarks
from wisent import WisentModel
import torch

# Load model and steering vector
model = WisentModel("Qwen/Qwen3-4B")
vector = torch.load("honesty.pt")

# Evaluate without steering
baseline = model.evaluate_benchmark("truthfulqa", num_samples=100)
print(f"Baseline TruthfulQA: {baseline['accuracy']:.2%}")

# Evaluate with steering
model.set_steering_vector(vector, layer=15, strength=1.5)
steered = model.evaluate_benchmark("truthfulqa", num_samples=100)
print(f"Steered TruthfulQA: {steered['accuracy']:.2%}")
print(f"Improvement: {steered['accuracy'] - baseline['accuracy']:.2%}")

Complete Benchmark List

Here is the complete list of all 168 recognized benchmark tasks. Note that individual categories often include multiple related subtasks underneath them. MMLU alone, for instance, spans 57 distinct subjects.

aclueaexamsagievalai2_arcalgebra_linanliappsarabic_examsarc_challengearc_easyasdivbasque_triviabbhbbqbelebelebigbenchblimpboldboolqcbcevalcmmlucnn_dailymailcode_x_gluecodex_evalcommonsenseqacopacoqacsqacrows_pairscsatqadataset_qadropds1000eq_benchethicaieuroparlfdafinqafloresgemgluegpqagsm8khaeraehellaswaghendryckshlehumanevalifevaliwsltjapanese_mtjcommonsenseqakluekobestkoldkorean_commongenkorean_wmtlambadalegalbenchlegalqalogiqalsat_qamathmathqamawpsmbppmedmcqamedqamgsmminerva_mathmmluprommlumodel_written_evalsmultilegalpilemutualnaturalqanewsqanq_openokapiopenbookqapawspawsxpilepiqapolemopubmedqaqa4mreqasperquacqualityracerealmsrealtoxicitypromptsrecordrtesciqscrollssiqasquadstoryclozestrategyqasuper_gluesuper_gpqasvampswagtinygsmtoxigentriviaqatruthfulqatydiqaunscramblewebqswicwikitextwinograndewinobiaswnliwscxcopaxnlixquadxstoryclozexsumxwinograd

For the complete benchmark configuration and registry, see:

Stay in the loop. Never miss out.

Subscribe to our newsletter and unlock Wisent insights.