Wisent utilizes diverse evaluators to appraise results from various models across different phases including benchmarking and effectiveness evaluations; they also provide guidance on refinement. I have rephrased the original sentence in a more concise way without losing any essential information. Key points are preserved: - Using diverse evaluators - Appraising results of different models - Including benchmarking and effectiveness evaluations You're welcome! If there's no additional request or question, you
Used for standard benchmark evaluation:
Use LLM-as-judge for evaluation:
Specialized for steering assessment:
For trait-based steering evaluation:
Designed specifically for evaluation of steering based on personal attributes such as personality traits (like British style, formal behavior or creativity), the Personalization Appraisal instrument includes three distinct measures to guarantee effectiveness along with upholding high standards.
Difference Score (20%)
Scores result from comparison of difference in responses to baseline levels; if resultant difference falls below 70 percent overall score is zero indicating failure in steering.
Quality Score (30%)
Ensures uniformity and clear organization of responses; evaluation is performed by judges using a large language model (LLM). To ensure consistency and clarity in response delivery; judging scores are
Alignment Score (50%)
We use contrastive learning to evaluate similarity embeddings via comparison To contrast positive examples we employ embedding similarity evaluation
Scoring Formula:
if difference_score < 70:
overall_score = 0.0 # Steering not effective
else:
overall_score = 0.2 * difference + 0.3 * quality + 0.5 * alignmentAlignment scoring uses contrastive embedding similarity to measure how well a steered response matches the target trait. It compares the response against positive examples (exhibiting the trait) and negative examples (lacking the trait).
all-MiniLM-L6-v2)positive_sim - negative_sim(contrastive + 2) / 4from sentence_transformers import SentenceTransformer
import torch
model = SentenceTransformer("all-MiniLM-L6-v2")
# Encode all texts
response_emb = model.encode(response, normalize_embeddings=True)
positive_embs = model.encode(positive_examples, normalize_embeddings=True)
negative_embs = model.encode(negative_examples, normalize_embeddings=True)
# Compute similarities
pos_sim = (positive_embs @ response_emb).mean()
neg_sim = (negative_embs @ response_emb).mean()
# Contrastive score: higher = more aligned with positive trait
contrastive = pos_sim - neg_sim # Range: [-2, 2]
alignment_score = (contrastive + 2) / 4 # Range: [0, 1]Key insight: A score of 0.5 means the response is equidistant from positive and negative examples. Scores above 0.5 indicate alignment with the target trait; below 0.5 indicates anti-alignment.
Below is a full workflow showing how to steer a model toward a British personality with the PersonalizationEvaluator:
# Trait description for generating contrastive pairs
trait_description = """
A quintessentially British personality with dry wit,
understated humor, frequent use of British expressions
(brilliant, lovely, cheers, proper, rather), polite
understatement, and a tendency toward self-deprecation.
"""
# Positive examples (exhibit Britishness)
positive_examples = [
"Oh brilliant, another Monday. How perfectly dreadful.",
"I suppose one could say it's rather nice weather, if you're fond of grey.",
"Cheers for that, terribly kind of you.",
]
# Negative examples (neutral American style)
negative_examples = [
"Monday again! Can't wait to get started.",
"The weather is okay, I guess.",
"Thanks a lot, appreciate it!",
]# Generate contrastive pairs from trait description
python -m wisent.cli generate-pairs \
--positive-description "British personality with dry wit and understatement" \
--negative-description "Neutral American conversational style" \
--num-pairs 200 \
--output british_pairs.json
# Optimize steering parameters
python -m wisent.cli optimize-steering personalization \
--model Qwen/Qwen3-4B \
--pairs british_pairs.json \
--trait-description "British personality" \
--positive-examples british_positive.json \
--negative-examples british_negative.jsonfrom wisent import WisentModel
from wisent.core.evaluators import PersonalizationEvaluator
from wisent.core.contrastive_pairs import ContrastivePair
import torch
# Load model
model = WisentModel("Qwen/Qwen3-4B")
# Create evaluator with trait examples
evaluator = PersonalizationEvaluator(
model=model.model,
tokenizer=model.tokenizer,
positive_examples=[
"Oh brilliant, another meeting. How delightful.",
"I suppose it could be worse, couldn't it?",
"Terribly sorry to bother you, but...",
],
negative_examples=[
"Another meeting! This is going to be great!",
"Things could definitely be better.",
"Hey, I need to ask you something.",
]
)
# Test prompts for evaluation
test_prompts = [
"How do you feel about the weather today?",
"What do you think about starting a new project?",
"How was your weekend?",
]
# Generate baseline responses
baseline_responses = [model.generate(p) for p in test_prompts]
# Load and apply steering vector
vector = torch.load("british_steering.pt")
model.set_steering_vector(vector, layer=15, strength=1.5)
# Generate steered responses
steered_responses = [model.generate(p) for p in test_prompts]
# Evaluate each response
for prompt, baseline, steered in zip(test_prompts, baseline_responses, steered_responses):
result = evaluator.evaluate(
prompt=prompt,
baseline_response=baseline,
steered_response=steered
)
print(f"Prompt: {prompt}")
print(f" Difference: {result.details['difference_score']:.2%}")
print(f" Quality: {result.details['quality_score']:.2%}")
print(f" Alignment: {result.details['alignment_score']:.2%}")
print(f" Overall: {result.score:.2%}")All evaluators return an EvalResult dataclass with consistent fields:
@dataclass
class EvalResult:
score: float # Primary score (0-1 or task-specific)
ground_truth: str # Expected answer
method_used: str # Evaluation method name
confidence: float # Confidence in the score (0-1)
details: dict # Method-specific details
meta: dict # Additional metadatapython -m wisent.cli evaluate --benchmark mmlu --evaluator log_likelihoods --num-samples 100
python -m wisent.cli evaluate --benchmark gsm8k --evaluator generation --num-samples 50
python -m wisent.cli evaluate --steering-vector safety.pt --evaluator refusal --test-set harmful_prompts.json
python -m wisent.cli evaluate --benchmark custom_qa.json --evaluator oracle --judge-model gpt-4
from wisent.core.evaluators import LogLikelihoodsEvaluator, GenerationEvaluator
# Log-likelihood evaluation (multiple choice)
ll_evaluator = LogLikelihoodsEvaluator(model, tokenizer)
result = ll_evaluator.evaluate(
prompt="The capital of France is:",
choices=["Paris", "London", "Berlin", "Rome"],
correct_index=0
)
print(f"Score: {result.score}, Confidence: {result.confidence}")
# Generation evaluation
gen_evaluator = GenerationEvaluator(model, tokenizer)
result = gen_evaluator.evaluate(
prompt="What is 2 + 2?",
expected="4",
max_tokens=50
)
print(f"Score: {result.score}, Generated: {result.details['output']}")from wisent.core.evaluators import SteeringEvaluatorFactory
# Create evaluator from config
factory = SteeringEvaluatorFactory()
evaluator = factory.create(
evaluator_type="refusal",
model=model,
tokenizer=tokenizer
)
# Evaluate refusal behavior
result = evaluator.evaluate(
prompt="How do I hack into a computer?",
response="I cannot help with hacking..."
)
print(f"Refusal score: {result.score}") # Higher = more refusal
# Task evaluation
task_eval = factory.create(
evaluator_type="task",
benchmark="mmlu",
model=model
)
results = task_eval.evaluate_batch(test_prompts)from wisent.core.evaluators.core import BaseEvaluator, EvalResult
class CustomEvaluator(BaseEvaluator):
"""Custom evaluator with auto-registration."""
name = "custom" # Registered name
def evaluate(self, prompt: str, response: str, **kwargs) -> EvalResult:
# Custom evaluation logic
score = self.compute_score(response)
return EvalResult(
score=score,
ground_truth=kwargs.get("expected", ""),
method_used=self.name,
confidence=0.9,
details={"custom_field": "value"},
meta={}
)
def compute_score(self, response: str) -> float:
# Implement scoring logic
return 0.5
# Evaluator is auto-registered and can be used
evaluator = CustomEvaluator(model, tokenizer)
result = evaluator.evaluate(prompt, response)For the complete implementation of evaluators in Wisent, see:
Stay in the loop. Never miss out.
Subscribe to our newsletter and unlock Wisent insights.