CAA - Simple steering by averaging positive and negative activation patterns to create control vectors that are added to model activations during inference.
CAA training works by running positive and negative example pairs through the AI model and capturing the internal activations at a specific layer, then computing the average activation pattern for all positive examples and the average for all negative examples, and subtracting them to create a "steering vector" that represents the mathematical difference between good and bad behavior.
During inference, this steering vector is simply added to the model's activations at that same layer, scaled by a strength parameter, specifically targeting the second-to-last token position in the sequence to influence what the AI generates next. The vector can optionally be normalized using L2 normalization to control its magnitude, and once trained, it's saved as a simple PyTorch tensor file that can be loaded and reused instantly without any additional training.
CAA is conceptually simple. All that is being done averaging neuron patterns and adding/subtracting them and adding them to a particular layer at inference time.
python -m wisent_guard.cli tasks honesty_pairs.json --from-json --steering-mode --steering-method CAA --layer 15 --save-steering-vector honesty_caa.pt
python -m wisent_guard.cli tasks politeness_pairs.json --from-json --steering-mode --steering-method CAA --layer 12 --max-new-tokens 50 --device cuda --save-steering-vector politeness_caa.pt --limit 100
python -m wisent_guard.cli tasks test_questions.json --from-json --steering-mode --steering-method CAA --layer 15 --load-steering-vector honesty_caa.pt --steering-strength 1.5
python -m wisent_guard.cli tasks safety_pairs.json --from-json --steering-mode --steering-method CAA --layer 18 --normalization l2 --show-memory-usage --allow-small-dataset
python -m wisent_guard.cli tasks helpfulness_pairs.json --from-json --steering-mode --steering-method CAA --layer 15 --enable-token-steering --token-steering-strategy second_to_last --save-steering-vector helpful_caa.pt
--normalization
: none, l2 (default: none)--enable-token-steering
: Enable position-based steering--token-steering-strategy
: last_only, second_to_last, first_only, all_equal, exponential_decay, exponential_growth, linear_decay, linear_growth--token-decay-rate
: Decay rate for exponential strategies (default 0.5)--token-min-strength
: Minimum strength for decay strategies (default 0.1)--token-max-strength
: Maximum strength for growth strategies (default 1.0)For the complete implementation of the CAA steering method in Wisent-Guard, see: