Evaluators

Wisent utilizes diverse evaluators to appraise results from various models across different phases including benchmarking and effectiveness evaluations; they also provide guidance on refinement. I have rephrased the original sentence in a more concise way without losing any essential information. Key points are preserved: - Using diverse evaluators - Appraising results of different models - Including benchmarking and effectiveness evaluations You're welcome! If there's no additional request or question, you

Evaluator Types

Benchmark Evaluators

Used for standard benchmark evaluation:

  • LogLikelihoodsEvaluator - Multiple choice via log probabilities
  • GenerationEvaluator - Free-form generation tasks
  • ExactMatchEvaluator - Exact string matching
  • F1Evaluator - Token-level F1 scoring
  • PerplexityEvaluator - Language modeling perplexity
  • CodingEvaluator - Code execution and testing
  • MathEvaluator - Mathematical expression evaluation

Oracle Evaluators

Use LLM-as-judge for evaluation:

  • NLPEvaluator - General NLP quality assessment
  • InteractiveEvaluator - Multi-turn conversation quality
  • SafetyEvaluator - Safety and harm detection
  • CoherenceEvaluator - Response coherence scoring

Steering Evaluators

Specialized for steering assessment:

  • RefusalEvaluator - Measures refusal/compliance behavior
  • TaskEvaluator - Evaluates benchmark task performance
  • PersonalizationEvaluator - Measures trait alignment

Personalization Evaluators

For trait-based steering evaluation:

  • AlignmentEvaluator - Measures alignment with target trait using contrastive embedding similarity
  • DifferenceEvaluator - Compares steered vs unsteered outputs to ensure steering has effect
  • QualityEvaluator - Ensures steered responses remain coherent and high-quality

PersonalizationEvaluator

Designed specifically for evaluation of steering based on personal attributes such as personality traits (like British style, formal behavior or creativity), the Personalization Appraisal instrument includes three distinct measures to guarantee effectiveness along with upholding high standards.

Three-Metric Scoring System

Difference Score (20%)

Scores result from comparison of difference in responses to baseline levels; if resultant difference falls below 70 percent overall score is zero indicating failure in steering.

Quality Score (30%)

Ensures uniformity and clear organization of responses; evaluation is performed by judges using a large language model (LLM). To ensure consistency and clarity in response delivery; judging scores are

Alignment Score (50%)

We use contrastive learning to evaluate similarity embeddings via comparison To contrast positive examples we employ embedding similarity evaluation

Scoring Formula:

if difference_score < 70:
    overall_score = 0.0  # Steering not effective
else:
    overall_score = 0.2 * difference + 0.3 * quality + 0.5 * alignment

Alignment Scoring

Alignment scoring uses contrastive embedding similarity to measure how well a steered response matches the target trait. It compares the response against positive examples (exhibiting the trait) and negative examples (lacking the trait).

How It Works

  1. Encode the response using sentence-transformers (all-MiniLM-L6-v2)
  2. Encode positive examples (responses that exhibit the target trait)
  3. Encode negative examples (responses that lack the trait)
  4. Compute mean cosine similarity to positive and negative sets
  5. Calculate contrastive score: positive_sim - negative_sim
  6. Normalize to [0, 1] range: (contrastive + 2) / 4
from sentence_transformers import SentenceTransformer
import torch

model = SentenceTransformer("all-MiniLM-L6-v2")

# Encode all texts
response_emb = model.encode(response, normalize_embeddings=True)
positive_embs = model.encode(positive_examples, normalize_embeddings=True)
negative_embs = model.encode(negative_examples, normalize_embeddings=True)

# Compute similarities
pos_sim = (positive_embs @ response_emb).mean()
neg_sim = (negative_embs @ response_emb).mean()

# Contrastive score: higher = more aligned with positive trait
contrastive = pos_sim - neg_sim  # Range: [-2, 2]
alignment_score = (contrastive + 2) / 4  # Range: [0, 1]

Key insight: A score of 0.5 means the response is equidistant from positive and negative examples. Scores above 0.5 indicate alignment with the target trait; below 0.5 indicates anti-alignment.

Example: British Personality Steering

Below is a full workflow showing how to steer a model toward a British personality with the PersonalizationEvaluator:

Define trait with contrastive examples
# Trait description for generating contrastive pairs
trait_description = """
A quintessentially British personality with dry wit,
understated humor, frequent use of British expressions
(brilliant, lovely, cheers, proper, rather), polite
understatement, and a tendency toward self-deprecation.
"""

# Positive examples (exhibit Britishness)
positive_examples = [
    "Oh brilliant, another Monday. How perfectly dreadful.",
    "I suppose one could say it's rather nice weather, if you're fond of grey.",
    "Cheers for that, terribly kind of you.",
]

# Negative examples (neutral American style)
negative_examples = [
    "Monday again! Can't wait to get started.",
    "The weather is okay, I guess.",
    "Thanks a lot, appreciate it!",
]
CLI: Optimize steering for personality
# Generate contrastive pairs from trait description
python -m wisent.cli generate-pairs \
    --positive-description "British personality with dry wit and understatement" \
    --negative-description "Neutral American conversational style" \
    --num-pairs 200 \
    --output british_pairs.json

# Optimize steering parameters
python -m wisent.cli optimize-steering personalization \
    --model Qwen/Qwen3-4B \
    --pairs british_pairs.json \
    --trait-description "British personality" \
    --positive-examples british_positive.json \
    --negative-examples british_negative.json
Python API: Full personalization workflow
from wisent import WisentModel
from wisent.core.evaluators import PersonalizationEvaluator
from wisent.core.contrastive_pairs import ContrastivePair
import torch

# Load model
model = WisentModel("Qwen/Qwen3-4B")

# Create evaluator with trait examples
evaluator = PersonalizationEvaluator(
    model=model.model,
    tokenizer=model.tokenizer,
    positive_examples=[
        "Oh brilliant, another meeting. How delightful.",
        "I suppose it could be worse, couldn't it?",
        "Terribly sorry to bother you, but...",
    ],
    negative_examples=[
        "Another meeting! This is going to be great!",
        "Things could definitely be better.",
        "Hey, I need to ask you something.",
    ]
)

# Test prompts for evaluation
test_prompts = [
    "How do you feel about the weather today?",
    "What do you think about starting a new project?",
    "How was your weekend?",
]

# Generate baseline responses
baseline_responses = [model.generate(p) for p in test_prompts]

# Load and apply steering vector
vector = torch.load("british_steering.pt")
model.set_steering_vector(vector, layer=15, strength=1.5)

# Generate steered responses
steered_responses = [model.generate(p) for p in test_prompts]

# Evaluate each response
for prompt, baseline, steered in zip(test_prompts, baseline_responses, steered_responses):
    result = evaluator.evaluate(
        prompt=prompt,
        baseline_response=baseline,
        steered_response=steered
    )
    print(f"Prompt: {prompt}")
    print(f"  Difference: {result.details['difference_score']:.2%}")
    print(f"  Quality: {result.details['quality_score']:.2%}")
    print(f"  Alignment: {result.details['alignment_score']:.2%}")
    print(f"  Overall: {result.score:.2%}")

EvalResult Format

All evaluators return an EvalResult dataclass with consistent fields:

@dataclass
class EvalResult:
    score: float           # Primary score (0-1 or task-specific)
    ground_truth: str      # Expected answer
    method_used: str       # Evaluation method name
    confidence: float      # Confidence in the score (0-1)
    details: dict          # Method-specific details
    meta: dict             # Additional metadata

CLI Examples

Evaluate with log-likelihoods (multiple choice)
python -m wisent.cli evaluate --benchmark mmlu --evaluator log_likelihoods --num-samples 100
Evaluate with generation
python -m wisent.cli evaluate --benchmark gsm8k --evaluator generation --num-samples 50
Evaluate steering effectiveness
python -m wisent.cli evaluate --steering-vector safety.pt --evaluator refusal --test-set harmful_prompts.json
Evaluate with LLM judge
python -m wisent.cli evaluate --benchmark custom_qa.json --evaluator oracle --judge-model gpt-4

Python API

Using benchmark evaluators
from wisent.core.evaluators import LogLikelihoodsEvaluator, GenerationEvaluator

# Log-likelihood evaluation (multiple choice)
ll_evaluator = LogLikelihoodsEvaluator(model, tokenizer)
result = ll_evaluator.evaluate(
    prompt="The capital of France is:",
    choices=["Paris", "London", "Berlin", "Rome"],
    correct_index=0
)
print(f"Score: {result.score}, Confidence: {result.confidence}")

# Generation evaluation
gen_evaluator = GenerationEvaluator(model, tokenizer)
result = gen_evaluator.evaluate(
    prompt="What is 2 + 2?",
    expected="4",
    max_tokens=50
)
print(f"Score: {result.score}, Generated: {result.details['output']}")
Using steering evaluators
from wisent.core.evaluators import SteeringEvaluatorFactory

# Create evaluator from config
factory = SteeringEvaluatorFactory()
evaluator = factory.create(
    evaluator_type="refusal",
    model=model,
    tokenizer=tokenizer
)

# Evaluate refusal behavior
result = evaluator.evaluate(
    prompt="How do I hack into a computer?",
    response="I cannot help with hacking..."
)
print(f"Refusal score: {result.score}")  # Higher = more refusal

# Task evaluation
task_eval = factory.create(
    evaluator_type="task",
    benchmark="mmlu",
    model=model
)
results = task_eval.evaluate_batch(test_prompts)
Custom evaluator
from wisent.core.evaluators.core import BaseEvaluator, EvalResult

class CustomEvaluator(BaseEvaluator):
    """Custom evaluator with auto-registration."""

    name = "custom"  # Registered name

    def evaluate(self, prompt: str, response: str, **kwargs) -> EvalResult:
        # Custom evaluation logic
        score = self.compute_score(response)

        return EvalResult(
            score=score,
            ground_truth=kwargs.get("expected", ""),
            method_used=self.name,
            confidence=0.9,
            details={"custom_field": "value"},
            meta={}
        )

    def compute_score(self, response: str) -> float:
        # Implement scoring logic
        return 0.5

# Evaluator is auto-registered and can be used
evaluator = CustomEvaluator(model, tokenizer)
result = evaluator.evaluate(prompt, response)

Parameters

Common Parameters

--evaluator
Types of evaluators include scores that use likelihoods extracted directly from log likelihoods along with scores for matching to generated response; you are an You are an
--benchmark
Benchmark to evaluate on
--num-samples
Number of samples to evaluate (default: all)
--batch-size
Batch size for evaluation (default: 8)

Generation Parameters

--max-tokens
Maximum tokens to generate (default: 256)
--temperature
Generation temperature (default: 0.0)
--top-p
Top-p sampling parameter (default: 1.0)

Oracle Parameters

--judge-model
Model to use as judge (default: gpt-4)
--rubric
Evaluation rubric for the judge
--reference
Reference answer for comparison

Stay in the loop. Never miss out.

Subscribe to our newsletter and unlock Wisent insights.