Detection Handling refers to the strategies and actions taken when the classifier identifies potentially harmful content in the model's activations.
Immediately stop generation and return a safety message.
Use case: Zero-tolerance policies for harmful content
Log the detection but allow generation to proceed with monitoring.
Use case: Research environments or when false positives are costly
Apply control vectors to steer the model away from harmful content.
Use case: When you want to provide helpful responses while avoiding harm
Record detailed information for later analysis and model improvement.
Use case: Continuous monitoring and safety research
For a complete understanding of how detection handling works in Wisent-Guard, including the full implementation of various handling strategies, configuration options, and processing logic, explore the source code:
View detection_handling.py on GitHub