CAA

CAA - Simple steering by averaging positive and negative activation patterns to create control vectors that are added to model activations during inference.

How CAA Works

CAA training works by running positive and negative example pairs through the AI model and capturing the internal activations at a specific layer, then computing the average activation pattern for all positive examples and the average for all negative examples, and subtracting them to create a "steering vector" that represents the mathematical difference between good and bad behavior.

During inference, this steering vector is simply added to the model's activations at that same layer, scaled by a strength parameter, specifically targeting the second-to-last token position in the sequence to influence what the AI generates next. The vector can optionally be normalized using L2 normalization to control its magnitude, and once trained, it's saved as a simple PyTorch tensor file that can be loaded and reused instantly without any additional training.

CAA is conceptually simple. All that is being done averaging neuron patterns and adding/subtracting them and adding them to a particular layer at inference time.

CLI Examples

# Basic CAA training with default settings

python -m wisent_guard.cli tasks honesty_pairs.json --from-json --steering-mode --steering-method CAA --layer 15 --save-steering-vector honesty_caa.pt

# CAA training with specific parameters

python -m wisent_guard.cli tasks politeness_pairs.json --from-json --steering-mode --steering-method CAA --layer 12 --max-new-tokens 50 --device cuda --save-steering-vector politeness_caa.pt --limit 100

# CAA inference using saved vector

python -m wisent_guard.cli tasks test_questions.json --from-json --steering-mode --steering-method CAA --layer 15 --load-steering-vector honesty_caa.pt --steering-strength 1.5

# CAA with normalization and memory monitoring

python -m wisent_guard.cli tasks safety_pairs.json --from-json --steering-mode --steering-method CAA --layer 18 --normalization l2 --show-memory-usage --allow-small-dataset

# CAA with token steering enabled

python -m wisent_guard.cli tasks helpfulness_pairs.json --from-json --steering-mode --steering-method CAA --layer 15 --enable-token-steering --token-steering-strategy second_to_last --save-steering-vector helpful_caa.pt

Parameters

CAA Specific Parameters

--normalization: none, l2 (default: none)

Token Steering Parameters

--enable-token-steering: Enable position-based steering
--token-steering-strategy: last_only, second_to_last, first_only, all_equal, exponential_decay, exponential_growth, linear_decay, linear_growth
--token-decay-rate: Decay rate for exponential strategies (default 0.5)
--token-min-strength: Minimum strength for decay strategies (default 0.1)
--token-max-strength: Maximum strength for growth strategies (default 1.0)

Implementation Details

For the complete implementation of the CAA steering method in Wisent-Guard, see:

caa.py Original Paper Original Implementation

Continue to BiPO