K-Steering - Gradient-based steering using a trained neural network classifier to push activations away from negative patterns in multiple directions.
K-Steering works by training a 3-layer neural network classifier to distinguish between positive and negative activation patterns, rather than computing control vectors directly. During training, it feeds the activation pairs through this classifier network and optimizes it using standard cross-entropy loss to predict whether an activation came from a positive or negative example.
For inference, K-Steering uses gradient-based steering by computing the gradient of the classifier's loss with respect to the input activations, then subtracting this gradient scaled by an alpha parameter (x - α∇L(x)) to push activations away from patterns the classifier associates with negative examples.
The method requires enabling gradients during inference and uses the classifier in training mode to compute meaningful gradients, targeting the second-to-last token position like other methods. Instead of saving a simple vector, K-Steering saves the entire trained classifier network. Similar to other methods, you can save and load the classifiers.
python -m wisent_guard.cli tasks classification_pairs.json --from-json --steering-mode --steering-method K-Steering --layer 15 --save-steering-vector classification_k.pt
python -m wisent_guard.cli tasks detection_pairs.json --from-json --steering-mode --steering-method K-Steering --layer 16 --k-hidden-dim 256 --k-num-layers 4 --save-steering-vector detection_k.pt
python -m wisent_guard.cli tasks analysis_pairs.json --from-json --steering-mode --steering-method K-Steering --layer 14 --k-epochs 50 --k-lr 0.01 --k-batch-size 16 --save-steering-vector analysis_k.pt
python -m wisent_guard.cli tasks evaluation_tasks.json --from-json --steering-mode --steering-method K-Steering --layer 15 --load-steering-vector classification_k.pt --k-alpha 0.05
python -m wisent_guard.cli tasks robustness_pairs.json --from-json --steering-mode --steering-method K-Steering --layer 13 --k-dropout 0.3 --k-epochs 100 --save-steering-vector robust_k.pt
python -m wisent_guard.cli tasks attention_pairs.json --from-json --steering-mode --steering-method K-Steering --layer 18 --device cuda --enable-token-steering --token-steering-strategy first_only --save-steering-vector attention_k.pt
--k-hidden-dim
: Hidden layer size (default 128)--k-num-layers
: Number of layers (default 3)--k-epochs
: Training epochs (default 20)--k-lr
: Learning rate (default 0.001)--k-batch-size
: Training batch size (default 32)--k-dropout
: Dropout rate (default 0.1)--k-alpha
: Gradient step size for steering (default 0.01)--enable-token-steering
: Enable position-based steering--token-steering-strategy
: last_only, second_to_last, first_only, all_equal, exponential_decay, exponential_growth, linear_decay, linear_growth--token-decay-rate
: Decay rate for exponential strategies (default 0.5)--token-min-strength
: Minimum strength for decay strategies (default 0.1)--token-max-strength
: Maximum strength for growth strategies (default 1.0)For the complete implementation of the K-Steering method in Wisent-Guard, see: