K-Steering

K-Steering - Gradient-based steering using a trained neural network classifier to push activations away from negative patterns in multiple directions.

How K-Steering Works

K-Steering works by training a 3-layer neural network classifier to distinguish between positive and negative activation patterns, rather than computing control vectors directly. During training, it feeds the activation pairs through this classifier network and optimizes it using standard cross-entropy loss to predict whether an activation came from a positive or negative example.

For inference, K-Steering uses gradient-based steering by computing the gradient of the classifier's loss with respect to the input activations, then subtracting this gradient scaled by an alpha parameter (x - α∇L(x)) to push activations away from patterns the classifier associates with negative examples.

The method requires enabling gradients during inference and uses the classifier in training mode to compute meaningful gradients, targeting the second-to-last token position like other methods. Instead of saving a simple vector, K-Steering saves the entire trained classifier network. Similar to other methods, you can save and load the classifiers.

CLI Examples

# Basic K-Steering training

python -m wisent_guard.cli tasks classification_pairs.json --from-json --steering-mode --steering-method K-Steering --layer 15 --save-steering-vector classification_k.pt

# K-Steering with custom network architecture

python -m wisent_guard.cli tasks detection_pairs.json --from-json --steering-mode --steering-method K-Steering --layer 16 --k-hidden-dim 256 --k-num-layers 4 --save-steering-vector detection_k.pt

# K-Steering with specific training parameters

python -m wisent_guard.cli tasks analysis_pairs.json --from-json --steering-mode --steering-method K-Steering --layer 14 --k-epochs 50 --k-lr 0.01 --k-batch-size 16 --save-steering-vector analysis_k.pt

# K-Steering inference with custom alpha

python -m wisent_guard.cli tasks evaluation_tasks.json --from-json --steering-mode --steering-method K-Steering --layer 15 --load-steering-vector classification_k.pt --k-alpha 0.05

# K-Steering with dropout for regularization

python -m wisent_guard.cli tasks robustness_pairs.json --from-json --steering-mode --steering-method K-Steering --layer 13 --k-dropout 0.3 --k-epochs 100 --save-steering-vector robust_k.pt

# K-Steering on GPU with first token targeting

python -m wisent_guard.cli tasks attention_pairs.json --from-json --steering-mode --steering-method K-Steering --layer 18 --device cuda --enable-token-steering --token-steering-strategy first_only --save-steering-vector attention_k.pt

Parameters

K-Steering Specific Parameters

  • --k-hidden-dim: Hidden layer size (default 128)
  • --k-num-layers: Number of layers (default 3)
  • --k-epochs: Training epochs (default 20)
  • --k-lr: Learning rate (default 0.001)
  • --k-batch-size: Training batch size (default 32)
  • --k-dropout: Dropout rate (default 0.1)
  • --k-alpha: Gradient step size for steering (default 0.01)

Token Steering Parameters

  • --enable-token-steering: Enable position-based steering
  • --token-steering-strategy: last_only, second_to_last, first_only, all_equal, exponential_decay, exponential_growth, linear_decay, linear_growth
  • --token-decay-rate: Decay rate for exponential strategies (default 0.5)
  • --token-min-strength: Minimum strength for decay strategies (default 0.1)
  • --token-max-strength: Maximum strength for growth strategies (default 1.0)

Implementation Details

For the complete implementation of the K-Steering method in Wisent-Guard, see: