Activation Aggregation Method

The Activation Aggregation Method defines how individual activation vectors are combined and processed to create meaningful representations for analysis and detection.

The activation aggregation logic refers only to representation control. In representation reading, we train a classifier directly on the activations we collect from the contrastive pair sets. The activation aggregation method allows the user to specify how to transform a set of activations, equal to twice the proportion and size of the total contrastive pair set used for training, into a reproducible vector you can use for steering. This will be added to activations at inference time to shape the behaviour of the model to be more consistent with the users expectations.

There are several methods that could be used for this task. Currently, Wisent-Guard supports only one method of aggregating the activations into one coherent vector. The currently implemented method is Contrastive Activation Addition (CAA), which computes the element-wise mean of the differences between positive and negative activations across all contrastive pairs in the training set. This approach generates a single control vector by subtracting each negative activation from its corresponding positive activation, then averaging these difference vectors to create a stable directional bias that can be applied consistently during inference. The resulting control vector maintains the same dimensionality as the original transformer layer activations and can be scaled by a strength parameter to control the magnitude of steering applied to the model's internal representations.

Implementation Details

For a complete understanding of how activation aggregation methods work in Wisent-Guard, explore the source code:

View aggregation.py on GitHub

Continue to Control Vector