A Ground Truth Evaluator is a component that measures the performance of classifiers and other detection systems by comparing their outputs against known correct answers (ground truth data).
The ground truth evaluator provides multiple methods for determining whether generated responses are truthful or represent hallucinations, serving as the benchmark against which classifier performance is measured. The system supports exact match evaluation that compares generated responses against expected answers using either the lm-eval harness metrics or simple string comparison, though this approach proves problematic for free-form text generation. Substring matching offers a more flexible alternative by checking whether the expected answer appears anywhere within the generated response, but still faces challenges with natural language variation.
For more nuanced evaluation, the interactive method presents each generated response to a human evaluator alongside the expected answer, prompting for manual classification as truthful or hallucination with real-time feedback. The user-specified method allows pre-labeling of responses with ground truth labels that can be provided as input, enabling batch evaluation scenarios. The manual review option marks responses for later human assessment without immediate classification, while the "good" debug mode labels everything as truthful for testing purposes, and the "none" method skips ground truth evaluation entirely when only classifier training is needed.
Compares generated responses against expected answers using lm-eval harness metrics or simple string comparison.
Checks whether the expected answer appears anywhere within the generated response for more flexible evaluation.
Presents responses to human evaluators for manual classification with real-time feedback.
Supports user-specified labels, manual review marking, debug modes, and option to skip evaluation entirely.
For a complete understanding of how the Ground Truth Evaluator works in Wisent-Guard, including the full implementation of evaluation metrics, comparison logic, and reporting functionality, explore the source code:
View ground_truth_evaluator.py on GitHub