Methods and techniques for analyzing and interpreting model representations to detect specific traits and behaviors.
Machine learning models that analyze activation patterns to detect specific traits or behaviors in model representations.
Strategies and actions taken when classifiers identify potentially harmful or unwanted content in model activations.