Extractors

Wisent uses three different systems of extraction: one to extract responses during assessments, another that focuses on selection of contrasting pairs through judges chosen according to benchmark standards and yet another system for contrasting pair selection using other data sources that do not employ LM evaluation. More recently they also include synonyms.

Extractor Architecture

1. BenchmarkExtractor (Answer Extraction)

Parses model outputs to extract final answers for evaluation scoring.

  • GSM8KExtractor - Extracts numerical answers (#### format, JSON, text patterns)
  • LiveCodeBenchExtractor - Extracts code from markdown blocks
  • HLEExtractor - Extracts answers from HLE format responses
  • SuperGPQAExtractor - Extracts A/B/C/D multiple choice answers

2. LMEvalBenchmarkExtractor (lm-eval-harness Integration)

We obtain contrast sets through benchmark data with use of LM Eval Harness; this toolkit contains over one hundred separate extractors that are fine tuned for specific library tasks. Nothing else needs to be written.

  • Uses lm-eval's task loading infrastructure
  • Handles multiple choice, generation, and perplexity tasks
  • Supports task groups (mmlu_*, bigbench_*, etc.)
  • Prefix matching for hierarchical task names

3. HuggingFaceBenchmarkExtractor (Direct HF Datasets)

Retrieve opposing sets of data from datasets saved on Hugging Face which do not belong to the lm eval harness; utilize at least 90 distinct extraction tools targeting solely for this purpose.

  • Loads datasets directly via HuggingFace datasets library
  • Handles custom dataset formats and splits
  • Supports safety benchmarks (HarmBench, JailbreakBench, SorryBench)
  • Supports coding benchmarks (LiveCodeBench, SWE-bench, AIME)

LM-Eval-Harness Extractors

Tools extract contrastive pairs through evaluation scores obtained by using lm eval harness along with systematic matching of tasks to relevant extractors using a registry system.

Supported Task Families

mmlutruthfulqahellaswagwinograndearcpiqaboolqopenbookqalambadawsccopartecbwicmultircrecordsquadracebabihendrycks_mathhendrycks_ethicsbigbenchaexamsagievalafrimgsmafrimmlu

Code invocation relies on prefix matching along with extraction of code such as in case of code `mmlu`.

HuggingFace Dataset Extractors

This tool manages datasets from Hugging Face that exceed the limits of the lm eval harness. This tool is used for managing datasets sourced from Hugging Face exceeding the limitations of the lm eval harness. To preserve original meaning: This tool manages datasets from Hugging Face that surpass limits of lm eval harness.

Math/Reasoning

AIME, HMMT, LiveMathBench, PolyMath, MATH500, SuperGPQA

Coding

LiveCodeBench, SWE-bench, HumanEval+, APPS, DS1000, Mercury

Safety/Alignment

HarmBench, JailbreakBench, SorryBench, WildGuard, DoNotAnswer, AgentHarm

Hallucination

HaluEval, FaithBench, SimpleQA, FACTSGrounding

Agent/Tool Use

AgentBench, ToolBench, BFCL, TravelPlanner, TauBench

Instruction Following

AlpacaEval, ArenaHard, IFEval

Answer Extractors

Evaluation relies on analysis of score outputs; different methods are used in various extractors. To avoid being detected via automation please

GSM8KExtractor

Strategy 1
JSON format: {"final_answer": "123"}
Strategy 2
Hash format: #### 123
Strategy 3
Text patterns: "The answer is 123"
Fallback
Raises NumericalExtractionError if no match

LiveCodeBenchExtractor

Strategy 1
Markdown code blocks: ```python ... ```
Strategy 2
Function definitions: def func()...
Strategy 3
Class definitions: class MyClass...
Fallback
Returns full text if it contains code keywords

SuperGPQAExtractor

Strategy 1
Answer format: "Answer: A"
Strategy 2
Bracket format: (A) or [A]
Strategy 3
Standalone letter: "...so B is correct"
Fallback
First character if A-D

Python API

Using answer extractors
from wisent.core.benchmark_extractors import get_extractor, GSM8KExtractor

# Get extractor for a task
extractor = get_extractor("gsm8k")

# Extract answer from model output
output = """Let me solve this step by step.
5 + 3 = 8
#### 8"""
answer = extractor.extract_answer(output)
print(f"Extracted: {answer}")  # "8"

# Check answer correctness
is_correct = extractor.check_answer(answer, "8")  # True
Using LM-eval contrastive pair extractors
from wisent.core.contrastive_pairs.lm_eval_pairs.lm_extractor_registry import get_extractor

# Get extractor for an lm-eval task
extractor = get_extractor("truthfulqa_mc1")

# Extract contrastive pairs from benchmark
pairs = extractor.extract_contrastive_pairs(limit=100)

for pair in pairs[:3]:
    print(f"Prompt: {pair.prompt[:50]}...")
    print(f"Positive: {pair.positive_response.model_response[:50]}...")
    print(f"Negative: {pair.negative_response.model_response[:50]}...")
    print()
Using HuggingFace contrastive pair extractors
from wisent.core.contrastive_pairs.huggingface_pairs.hf_extractor_registry import get_extractor

# Get extractor for a HuggingFace dataset
extractor = get_extractor("harmbench")

# Extract contrastive pairs
pairs = extractor.extract_contrastive_pairs(limit=50)

print(f"Extracted {len(pairs)} contrastive pairs")
for pair in pairs[:2]:
    print(f"Prompt: {pair.prompt[:80]}...")
    print(f"Safe response: {pair.positive_response.model_response[:50]}...")
    print(f"Unsafe response: {pair.negative_response.model_response[:50]}...")
    print()
Registering a custom extractor
from wisent.core.contrastive_pairs.lm_eval_pairs.atoms import LMEvalBenchmarkExtractor
from wisent.core.contrastive_pairs.lm_eval_pairs.lm_extractor_registry import register_extractor

class MyCustomExtractor(LMEvalBenchmarkExtractor):
    """Custom extractor for a new benchmark."""

    task_name = "my_benchmark"

    def extract_contrastive_pairs(self, limit=None):
        # Load your data
        docs = self.load_task_docs(limit=limit)

        pairs = []
        for doc in docs:
            pair = self._build_pair(
                question=doc["question"],
                correct=doc["correct_answer"],
                incorrect=doc["wrong_answer"],
            )
            pairs.append(pair)

        return pairs

# Register the extractor
register_extractor("my_benchmark", MyCustomExtractor)

Registry Lookup

Features from both `lm eval` and extractors provided by Hugging Face are integrated into a unified registry; tasks with matching names receive higher priority within this integration. Rephrased: Both `lm eval` features and those supplied as extractors via Hugging Face are now integrated into a single unified registry. Tasks sharing common names rank higher within this integration.

# Lookup order:
1. Exact match (case-insensitive)
2. Prefix fallback for hierarchical names:
   - "mmlu_anatomy" → "mmlu"
   - "bigbench_causal_judgment" → "bigbench"
   - "aradice_arabicmmlu_high_history" → "aradice"
3. Raise UnsupportedBenchmarkError if no match

Stay in the loop. Never miss out.

Subscribe to our newsletter and unlock Wisent insights.