Primitives

Representation Reading

Representation Control

Evaluation

Evaluators

CLI Commands

Important Considerations

Definitions

Roadmap

Evaluators

Wisent utilizes diverse evaluators to appraise results from various models across different phases including benchmarking and effectiveness evaluations; they also provide guidance on refinement. I have rephrased the original sentence in a more concise way without losing any essential information. Key points are preserved: - Using diverse evaluators - Appraising results of different models - Including benchmarking and effectiveness evaluations You're welcome! If there's no additional request or question, you

Evaluator Types

Benchmark Evaluators

Used for standard benchmark evaluation:

LogLikelihoodsEvaluator - Multiple choice via log probabilities
GenerationEvaluator - Free-form generation tasks
ExactMatchEvaluator - Exact string matching
F1Evaluator - Token-level F1 scoring
PerplexityEvaluator - Language modeling perplexity
CodingEvaluator - Code execution and testing
MathEvaluator - Mathematical expression evaluation

Oracle Evaluators

Use LLM-as-judge for evaluation:

NLPEvaluator - General NLP quality assessment
InteractiveEvaluator - Multi-turn conversation quality
SafetyEvaluator - Safety and harm detection
CoherenceEvaluator - Response coherence scoring

Steering Evaluators

Specialized for steering assessment:

RefusalEvaluator - Measures refusal/compliance behavior
TaskEvaluator - Evaluates benchmark task performance
PersonalizationEvaluator - Measures trait alignment

Personalization Evaluators

For trait-based steering evaluation:

AlignmentEvaluator - Measures alignment with target trait using contrastive embedding similarity
DifferenceEvaluator - Compares steered vs unsteered outputs to ensure steering has effect
QualityEvaluator - Ensures steered responses remain coherent and high-quality

PersonalizationEvaluator

Designed specifically for evaluation of steering based on personal attributes such as personality traits (like British style, formal behavior or creativity), the Personalization Appraisal instrument includes three distinct measures to guarantee effectiveness along with upholding high standards.

Three-Metric Scoring System

Difference Score (20%)

Scores result from comparison of difference in responses to baseline levels; if resultant difference falls below 70 percent overall score is zero indicating failure in steering.

Quality Score (30%)

Ensures uniformity and clear organization of responses; evaluation is performed by judges using a large language model (LLM). To ensure consistency and clarity in response delivery; judging scores are

Alignment Score (50%)

We use contrastive learning to evaluate similarity embeddings via comparison To contrast positive examples we employ embedding similarity evaluation

Scoring Formula:

if difference_score < 70:
    overall_score = 0.0  # Steering not effective
else:
    overall_score = 0.2 * difference + 0.3 * quality + 0.5 * alignment

Alignment Scoring

Alignment scoring uses contrastive embedding similarity to measure how well a steered response matches the target trait. It compares the response against positive examples (exhibiting the trait) and negative examples (lacking the trait).

How It Works

Encode the response using sentence-transformers (all-MiniLM-L6-v2)
Encode positive examples (responses that exhibit the target trait)
Encode negative examples (responses that lack the trait)
Compute mean cosine similarity to positive and negative sets
Calculate contrastive score: positive_sim - negative_sim
Normalize to [0, 1] range: (contrastive + 2) / 4

from sentence_transformers import SentenceTransformer
import torch

model = SentenceTransformer("all-MiniLM-L6-v2")

# Encode all texts
response_emb = model.encode(response, normalize_embeddings=True)
positive_embs = model.encode(positive_examples, normalize_embeddings=True)
negative_embs = model.encode(negative_examples, normalize_embeddings=True)

# Compute similarities
pos_sim = (positive_embs @ response_emb).mean()
neg_sim = (negative_embs @ response_emb).mean()

# Contrastive score: higher = more aligned with positive trait
contrastive = pos_sim - neg_sim  # Range: [-2, 2]
alignment_score = (contrastive + 2) / 4  # Range: [0, 1]

Key insight: A score of 0.5 means the response is equidistant from positive and negative examples. Scores above 0.5 indicate alignment with the target trait; below 0.5 indicates anti-alignment.

Example: British Personality Steering

Below is a full workflow showing how to steer a model toward a British personality with the PersonalizationEvaluator:

Define trait with contrastive examples

# Trait description for generating contrastive pairs
trait_description = """
A quintessentially British personality with dry wit,
understated humor, frequent use of British expressions
(brilliant, lovely, cheers, proper, rather), polite
understatement, and a tendency toward self-deprecation.
"""

# Positive examples (exhibit Britishness)
positive_examples = [
    "Oh brilliant, another Monday. How perfectly dreadful.",
    "I suppose one could say it's rather nice weather, if you're fond of grey.",
    "Cheers for that, terribly kind of you.",
]

# Negative examples (neutral American style)
negative_examples = [
    "Monday again! Can't wait to get started.",
    "The weather is okay, I guess.",
    "Thanks a lot, appreciate it!",
]

CLI: Optimize steering for personality

# Generate contrastive pairs from trait description
python -m wisent.cli generate-pairs \
    --positive-description "British personality with dry wit and understatement" \
    --negative-description "Neutral American conversational style" \
    --num-pairs 200 \
    --output british_pairs.json

# Optimize steering parameters
python -m wisent.cli optimize-steering personalization \
    --model Qwen/Qwen3-4B \
    --pairs british_pairs.json \
    --trait-description "British personality" \
    --positive-examples british_positive.json \
    --negative-examples british_negative.json

Python API: Full personalization workflow

from wisent import WisentModel
from wisent.core.evaluators import PersonalizationEvaluator
from wisent.core.contrastive_pairs import ContrastivePair
import torch

# Load model
model = WisentModel("Qwen/Qwen3-4B")

# Create evaluator with trait examples
evaluator = PersonalizationEvaluator(
    model=model.model,
    tokenizer=model.tokenizer,
    positive_examples=[
        "Oh brilliant, another meeting. How delightful.",
        "I suppose it could be worse, couldn't it?",
        "Terribly sorry to bother you, but...",
    ],
    negative_examples=[
        "Another meeting! This is going to be great!",
        "Things could definitely be better.",
        "Hey, I need to ask you something.",
    ]
)

# Test prompts for evaluation
test_prompts = [
    "How do you feel about the weather today?",
    "What do you think about starting a new project?",
    "How was your weekend?",
]

# Generate baseline responses
baseline_responses = [model.generate(p) for p in test_prompts]

# Load and apply steering vector
vector = torch.load("british_steering.pt")
model.set_steering_vector(vector, layer=15, strength=1.5)

# Generate steered responses
steered_responses = [model.generate(p) for p in test_prompts]

# Evaluate each response
for prompt, baseline, steered in zip(test_prompts, baseline_responses, steered_responses):
    result = evaluator.evaluate(
        prompt=prompt,
        baseline_response=baseline,
        steered_response=steered
    )
    print(f"Prompt: {prompt}")
    print(f"  Difference: {result.details['difference_score']:.2%}")
    print(f"  Quality: {result.details['quality_score']:.2%}")
    print(f"  Alignment: {result.details['alignment_score']:.2%}")
    print(f"  Overall: {result.score:.2%}")

EvalResult Format

All evaluators return an EvalResult dataclass with consistent fields:

@dataclass
class EvalResult:
    score: float           # Primary score (0-1 or task-specific)
    ground_truth: str      # Expected answer
    method_used: str       # Evaluation method name
    confidence: float      # Confidence in the score (0-1)
    details: dict          # Method-specific details
    meta: dict             # Additional metadata

CLI Examples

Evaluate with log-likelihoods (multiple choice)

python -m wisent.cli evaluate --benchmark mmlu --evaluator log_likelihoods --num-samples 100

Evaluate with generation

python -m wisent.cli evaluate --benchmark gsm8k --evaluator generation --num-samples 50

Evaluate steering effectiveness

python -m wisent.cli evaluate --steering-vector safety.pt --evaluator refusal --test-set harmful_prompts.json

Evaluate with LLM judge

python -m wisent.cli evaluate --benchmark custom_qa.json --evaluator oracle --judge-model gpt-4

Python API

Using benchmark evaluators

from wisent.core.evaluators import LogLikelihoodsEvaluator, GenerationEvaluator

# Log-likelihood evaluation (multiple choice)
ll_evaluator = LogLikelihoodsEvaluator(model, tokenizer)
result = ll_evaluator.evaluate(
    prompt="The capital of France is:",
    choices=["Paris", "London", "Berlin", "Rome"],
    correct_index=0
)
print(f"Score: {result.score}, Confidence: {result.confidence}")

# Generation evaluation
gen_evaluator = GenerationEvaluator(model, tokenizer)
result = gen_evaluator.evaluate(
    prompt="What is 2 + 2?",
    expected="4",
    max_tokens=50
)
print(f"Score: {result.score}, Generated: {result.details['output']}")

Using steering evaluators

from wisent.core.evaluators import SteeringEvaluatorFactory

# Create evaluator from config
factory = SteeringEvaluatorFactory()
evaluator = factory.create(
    evaluator_type="refusal",
    model=model,
    tokenizer=tokenizer
)

# Evaluate refusal behavior
result = evaluator.evaluate(
    prompt="How do I hack into a computer?",
    response="I cannot help with hacking..."
)
print(f"Refusal score: {result.score}")  # Higher = more refusal

# Task evaluation
task_eval = factory.create(
    evaluator_type="task",
    benchmark="mmlu",
    model=model
)
results = task_eval.evaluate_batch(test_prompts)

Custom evaluator

from wisent.core.evaluators.core import BaseEvaluator, EvalResult

class CustomEvaluator(BaseEvaluator):
    """Custom evaluator with auto-registration."""

    name = "custom"  # Registered name

    def evaluate(self, prompt: str, response: str, **kwargs) -> EvalResult:
        # Custom evaluation logic
        score = self.compute_score(response)

        return EvalResult(
            score=score,
            ground_truth=kwargs.get("expected", ""),
            method_used=self.name,
            confidence=0.9,
            details={"custom_field": "value"},
            meta={}
        )

    def compute_score(self, response: str) -> float:
        # Implement scoring logic
        return 0.5

# Evaluator is auto-registered and can be used
evaluator = CustomEvaluator(model, tokenizer)
result = evaluator.evaluate(prompt, response)

Parameters

Common Parameters

--evaluator

Types of evaluators include scores that use likelihoods extracted directly from log likelihoods along with scores for matching to generated response; you are an You are an

--benchmark

Benchmark to evaluate on

--num-samples

Number of samples to evaluate (default: all)

--batch-size

Batch size for evaluation (default: 8)

Generation Parameters

--max-tokens

Maximum tokens to generate (default: 256)

--temperature

Generation temperature (default: 0.0)

--top-p

Top-p sampling parameter (default: 1.0)

Oracle Parameters

--judge-model

Model to use as judge (default: gpt-4)

--rubric

Evaluation rubric for the judge

--reference

Reference answer for comparison

For the complete implementation of evaluators in Wisent, see:

View evaluators directory on GitHub

View steering_evaluators.py on GitHub

View alignment.py (contrastive scoring) on GitHub

View British personality example notebook

Stay in the loop. Never miss out.

Subscribe to our newsletter and unlock Wisent insights.

Contact Careers Privacy Policy Terms of Service