BiPO - Bi-directional Preference Optimization using gradient descent to train learnable steering vectors through preference loss functions.
BiPO (Bi-directional Preference Optimization) is starting with a zero-initialized steering vector that has learnable parameters (requires_grad=True) and training it through gradient descent optimization rather than simple averaging. During training, it feeds positive and negative activation pairs through a preference loss function that encourages the steering vector to increase the model's preference for positive examples over negative ones, updating the vector through backpropagation for a specified number of epochs (default 100).
The loss function computes preference scores by taking dot products between activations and the steering vector, then uses a margin-based loss to optimize the vector. During inference, the trained vector is applied exactly like CAA by adding it to activations at the target layer with a strength multiplier, targeting the second-to-last token position.
This learned approach can potentially capture more complex patterns than simple averaging but requires more computational time for training.
python -m wisent_guard.cli tasks preference_pairs.json --from-json --steering-mode --steering-method BiPO --layer 15 --save-steering-vector preference_bipo.pt
python -m wisent_guard.cli tasks quality_pairs.json --from-json --steering-mode --steering-method BiPO --layer 14 --bipo-epochs 200 --bipo-lr 0.001 --save-steering-vector quality_bipo.pt
python -m wisent_guard.cli tasks style_pairs.json --from-json --steering-mode --steering-method BiPO --layer 16 --bipo-margin 0.5 --bipo-weight-decay 0.01 --save-steering-vector style_bipo.pt
python -m wisent_guard.cli tasks writing_tasks.json --from-json --steering-mode --steering-method BiPO --layer 15 --load-steering-vector preference_bipo.pt --steering-strength 0.8
python -m wisent_guard.cli tasks tone_pairs.json --from-json --steering-mode --steering-method BiPO --layer 12 --device cpu --bipo-epochs 50 --show-memory-usage --save-steering-vector tone_bipo.pt
python -m wisent_guard.cli tasks consistency_pairs.json --from-json --steering-mode --steering-method BiPO --layer 17 --enable-token-steering --token-steering-strategy all_equal --bipo-epochs 150 --save-steering-vector consistent_bipo.pt
--bipo-epochs
: Training epochs (default 100)--bipo-lr
: Learning rate (default 0.01)--bipo-margin
: Preference margin (default 1.0)--bipo-weight-decay
: L2 regularization (default 0.0)--enable-token-steering
: Enable position-based steering--token-steering-strategy
: last_only, second_to_last, first_only, all_equal, exponential_decay, exponential_growth, linear_decay, linear_growth--token-decay-rate
: Decay rate for exponential strategies (default 0.5)--token-min-strength
: Minimum strength for decay strategies (default 0.1)--token-max-strength
: Maximum strength for growth strategies (default 1.0)For the complete implementation of the BiPO steering method in Wisent-Guard, see: