Detection Handling

Detection Handling refers to the strategies and actions taken when the classifier identifies potentially harmful content in the model's activations.

Handling Strategies

Block Generation

Immediately stop generation and return a safety message.

Use case: Zero-tolerance policies for harmful content

Warn and Continue

Log the detection but allow generation to proceed with monitoring.

Use case: Research environments or when false positives are costly

Retry with Steering

Apply control vectors to steer the model away from harmful content.

Use case: When you want to provide helpful responses while avoiding harm

Log and Analyze

Record detailed information for later analysis and model improvement.

Use case: Continuous monitoring and safety research

Configuration Options

Threshold Settings

  • Confidence threshold for triggering actions
  • Different thresholds for different severity levels
  • Adaptive thresholds based on context

Custom Messages

  • Customizable safety messages
  • Context-aware responses
  • Suggested alternative topics

Logging Options

  • Detailed activation logging
  • Performance metrics tracking
  • False positive/negative analysis

Integration Points

  • Webhook notifications
  • External safety services
  • Human review queues

Implementation Details

For a complete understanding of how detection handling works in Wisent-Guard, including the full implementation of various handling strategies, configuration options, and processing logic, explore the source code:

View detection_handling.py on GitHub