Classifier

A Classifier is a function determining whether a representation is present in a particular residual stream.

Classifiers are tools we use to detect the presence of a particular trait in our responses. This corresponds to the process known as representation readingThe process of extracting and interpreting internal model representations. Through this, we can assign a particular score to each token in a response. We use the activations from training that we get from contrastive pair sets as features we can later use at inference time to assign a score to a particular token with a representation. For example, we can identify tokens that correspond to a hallucination. Or leaking personal information. Or outputting bad code.

Wisent-Guard supports a variety of classifiers. The core architecture centers around training neural networks on activation patterns extracted from transformer layers during text generation. When the system processes contrastive pairs such as correct versus incorrect answers to the same question, it captures the internal activations at specific layers and uses these high-dimensional vectors as training features. The Classifier class implements both logistic regressionA statistical method for binary classification using linear decision boundaries and multi-layer perceptronA neural network with multiple hidden layers for complex pattern recognition models that learn to distinguish between activation patterns corresponding to "harmful" content (labeled as 1) versus "harmless" content (labeled as 0).

During training, the system extracts activations from specific tokens using various targeting strategies like choice tokensTokens representing specific answer choices in multiple-choice questions, continuation tokensTokens that continue or complete a given prompt or context, or last tokensThe final tokens in a sequence, often containing key information, then flattens these activation tensors into feature vectors that feed into the classification models. The training process includes early stoppingA technique to prevent overfitting by stopping training when validation performance stops improving, cross-validation splits, and comprehensive metric tracking including ROC-AUC scores. Once trained, these classifiers can analyze new activation patterns during inference, providing probability scores between 0 and 1 that indicate the likelihood of problematic content.

At inference time, the system hooks into the model's forward pass to extract activations token-by-token as text generates, feeding each activation vector through the trained classifier to produce real-time scores. These scores can trigger various detection handling actions including simple pass-through, placeholder replacement, or regeneration attempts based on configurable thresholds.

The pass-through detection actionAllows content to proceed unchanged while logging the detection allows problematic content to proceed unchanged while still logging the detection for monitoring purposes. The replace-with-placeholder actionSubstitutes detected harmful content with predefined safe messages substitutes detected harmful content with predefined safe messages such as "Information may be inaccurate" or "Cannot provide harmful content." The regenerate-until-safe actionRepeatedly attempts to generate new responses until safe content is produced repeatedly attempts to generate new responses when problematic content is detected, continuing up to a maximum number of attempts until a safe response is produced or the limit is reached.

Classifiers can be saved and loaded so that training is not necessary every time you want to use them. Wisent-Guard supports saving and loading trained classifiers through the model_persistence.pyModule handling the saving and loading of trained models and classifiers file. This allows you to train a classifier once and reuse it multiple times, significantly reducing computational overhead and setup time for repeated evaluations or deployments.

Performance Metrics

Classification Metrics

  • Accuracy: Overall correctness
  • Precision: True positives / (True positives + False positives)
  • Recall: True positives / (True positives + False negatives)
  • F1-Score: Harmonic mean of precision and recall

Safety Metrics

  • False Positive Rate: Blocking safe content
  • False Negative Rate: Missing harmful content
  • Robustness: Performance on adversarial examples
  • Calibration: Confidence score reliability

Implementation Details

For a complete understanding of how classifiers work in Wisent-Guard, including the full implementation of various classifier types, training logic, and prediction methods, explore the source code:

View classifier.py on GitHub