Ground Truth Evaluator

A Ground Truth Evaluator assesses classifier and detection system effectiveness by aligning their results with known accurate reference values (ground truth data). To measure classifier performance using known correct outcomes, an evaluator compares classifications to actual ground truth information. To measure classifier performance via known right outcomes, evalu

The ground truth evaluator provides multiple methods for determining whether generated responses are truthful or represent hallucinations, serving as the benchmark against which classifier performance is measured. The system supports exact match evaluation that compares generated responses against expected answers using either the lm-eval harness metrics or simple string comparison, though this approach proves problematic for free-form text generation. Substring matching offers a more flexible alternative by checking whether the expected answer appears anywhere within the generated response, but still faces challenges with natural language variation.

For more nuanced evaluation, the interactive method presents each generated response to a human evaluator alongside the expected answer, prompting for manual classification as truthful or hallucination with real-time feedback. The user-specified method allows pre-labeling of responses with ground truth labels that can be provided as input, enabling batch evaluation scenarios. The manual review option marks responses for later human assessment without immediate classification, while the "good" debug mode labels everything as truthful for testing purposes, and the "none" method skips ground truth evaluation entirely when only classifier training is needed.

Evaluation Methods

Exact Match

Uses metrics from harness comparisons with expected results or direct comparison of strings to evaluate generated outputs. To compare generated responses against expected answers using evaluation metrics

Substring Matching

Evaluates responses flexibly by checking if the anticipated outcome turns up somewhere in produced output. To check if the expected result is included anywhere in the generated

Interactive Evaluation

Responses are given to human judges and receive immediate feedback during manual sorting.

Batch & Debug Modes

Supports specification of labels by users, automated review with annotations, debug mode options, and choice not to perform evaluations at all. You are an

To fully grasp how Ground Truth Evaluator functions within Wisent including detailed implementation of scoring criteria, logical comparisons, and report generation capabilities, study the source code. The development

View evaluators on GitHub

Arrow right

Stay in the loop. Never miss out.

Subscribe to our newsletter and unlock Wisent insights.