Wisent uses three different systems of extraction: one to extract responses during assessments, another that focuses on selection of contrasting pairs through judges chosen according to benchmark standards and yet another system for contrasting pair selection using other data sources that do not employ LM evaluation. More recently they also include synonyms.
Parses model outputs to extract final answers for evaluation scoring.
We obtain contrast sets through benchmark data with use of LM Eval Harness; this toolkit contains over one hundred separate extractors that are fine tuned for specific library tasks. Nothing else needs to be written.
Retrieve opposing sets of data from datasets saved on Hugging Face which do not belong to the lm eval harness; utilize at least 90 distinct extraction tools targeting solely for this purpose.
datasets libraryTools extract contrastive pairs through evaluation scores obtained by using lm eval harness along with systematic matching of tasks to relevant extractors using a registry system.
Code invocation relies on prefix matching along with extraction of code such as in case of code `mmlu`.
This tool manages datasets from Hugging Face that exceed the limits of the lm eval harness. This tool is used for managing datasets sourced from Hugging Face exceeding the limitations of the lm eval harness. To preserve original meaning: This tool manages datasets from Hugging Face that surpass limits of lm eval harness.
AIME, HMMT, LiveMathBench, PolyMath, MATH500, SuperGPQA
LiveCodeBench, SWE-bench, HumanEval+, APPS, DS1000, Mercury
HarmBench, JailbreakBench, SorryBench, WildGuard, DoNotAnswer, AgentHarm
HaluEval, FaithBench, SimpleQA, FACTSGrounding
AgentBench, ToolBench, BFCL, TravelPlanner, TauBench
AlpacaEval, ArenaHard, IFEval
Evaluation relies on analysis of score outputs; different methods are used in various extractors. To avoid being detected via automation please
from wisent.core.benchmark_extractors import get_extractor, GSM8KExtractor
# Get extractor for a task
extractor = get_extractor("gsm8k")
# Extract answer from model output
output = """Let me solve this step by step.
5 + 3 = 8
#### 8"""
answer = extractor.extract_answer(output)
print(f"Extracted: {answer}") # "8"
# Check answer correctness
is_correct = extractor.check_answer(answer, "8") # Truefrom wisent.core.contrastive_pairs.lm_eval_pairs.lm_extractor_registry import get_extractor
# Get extractor for an lm-eval task
extractor = get_extractor("truthfulqa_mc1")
# Extract contrastive pairs from benchmark
pairs = extractor.extract_contrastive_pairs(limit=100)
for pair in pairs[:3]:
print(f"Prompt: {pair.prompt[:50]}...")
print(f"Positive: {pair.positive_response.model_response[:50]}...")
print(f"Negative: {pair.negative_response.model_response[:50]}...")
print()from wisent.core.contrastive_pairs.huggingface_pairs.hf_extractor_registry import get_extractor
# Get extractor for a HuggingFace dataset
extractor = get_extractor("harmbench")
# Extract contrastive pairs
pairs = extractor.extract_contrastive_pairs(limit=50)
print(f"Extracted {len(pairs)} contrastive pairs")
for pair in pairs[:2]:
print(f"Prompt: {pair.prompt[:80]}...")
print(f"Safe response: {pair.positive_response.model_response[:50]}...")
print(f"Unsafe response: {pair.negative_response.model_response[:50]}...")
print()from wisent.core.contrastive_pairs.lm_eval_pairs.atoms import LMEvalBenchmarkExtractor
from wisent.core.contrastive_pairs.lm_eval_pairs.lm_extractor_registry import register_extractor
class MyCustomExtractor(LMEvalBenchmarkExtractor):
"""Custom extractor for a new benchmark."""
task_name = "my_benchmark"
def extract_contrastive_pairs(self, limit=None):
# Load your data
docs = self.load_task_docs(limit=limit)
pairs = []
for doc in docs:
pair = self._build_pair(
question=doc["question"],
correct=doc["correct_answer"],
incorrect=doc["wrong_answer"],
)
pairs.append(pair)
return pairs
# Register the extractor
register_extractor("my_benchmark", MyCustomExtractor)Features from both `lm eval` and extractors provided by Hugging Face are integrated into a unified registry; tasks with matching names receive higher priority within this integration. Rephrased: Both `lm eval` features and those supplied as extractors via Hugging Face are now integrated into a single unified registry. Tasks sharing common names rank higher within this integration.
# Lookup order: 1. Exact match (case-insensitive) 2. Prefix fallback for hierarchical names: - "mmlu_anatomy" → "mmlu" - "bigbench_causal_judgment" → "bigbench" - "aradice_arabicmmlu_high_history" → "aradice" 3. Raise UnsupportedBenchmarkError if no match
For the complete implementation of extractors in Wisent, see:
Stay in the loop. Never miss out.
Subscribe to our newsletter and unlock Wisent insights.