Detection Handling

Detection Handling refers to the strategies and actions taken when the classifier identifies potentially harmful content in the model's activations.

Handling Strategies

Block Generation

Immediately stop generation and return a safety message.

Use case: Zero-tolerance policies for harmful content

Warn and Continue

Log the detection but allow generation to proceed with monitoring.

Use case: Research environments or when false positives are costly

Retry with Steering

Apply control vectors to steer the model away from harmful content.

Use case: When you want to provide helpful responses while avoiding harm

Log and Analyze

Record detailed information for later analysis and model improvement.

Use case: Continuous monitoring and safety research

Configuration Options

Threshold Settings

Confidence threshold for triggering actions
Different thresholds for different severity levels
Adaptive thresholds based on context

Custom Messages

Customizable safety messages
Context-aware responses
Suggested alternative topics

Logging Options

Detailed activation logging
Performance metrics tracking
False positive/negative analysis

Integration Points

Webhook notifications
External safety services
Human review queues

Implementation Details

For a complete understanding of how detection handling works in Wisent-Guard, including the full implementation of various handling strategies, configuration options, and processing logic, explore the source code:

View detection_handling.py on GitHub

Continue to Ground Truth Evaluator