Core building blocks and components of the Wisent-Guard system.
A set of weights used to generate responses. Each model has distinct parameters and special tokens for marking queries and responses.
A single processing block in a transformer model that updates the residual stream through attention mechanisms and MLPs.
A pair of examples demonstrating positive and negative instances for representation learning and behavior steering.
A collection of contrastive pairs used together for training classifiers and learning control directions.
All intermediate values computed during a forward pass, including the residual stream and other computed information.
The method used to collect and process activations from multiple tokens or layers for analysis and monitoring.
The statistical methods used to combine multiple activation vectors into meaningful representations for detection and control.
Machine learning models trained on activation patterns to detect unwanted representations and behaviors.
The system for processing and responding to detected unwanted patterns, including blocking and mitigation strategies.
A vector that gets added to particular layers to influence activations and steer the generation process.
The process of dynamically influencing model behavior during generation using control vectors and activation modification.
Tracks system performance metrics including latency and memory usage during model inference and representation engineering operations.
Detects and prevents the generation of incoherent, illogical, or meaningless responses.