BiPO

BiPO - Bi-directional Preference Optimization using gradient descent to train learnable steering vectors through preference loss functions.

How BiPO Works

BiPO (Bi-directional Preference Optimization) is starting with a zero-initialized steering vector that has learnable parameters (requires_grad=True) and training it through gradient descent optimization rather than simple averaging. During training, it feeds positive and negative activation pairs through a preference loss function that encourages the steering vector to increase the model's preference for positive examples over negative ones, updating the vector through backpropagation for a specified number of epochs (default 100).

The loss function computes preference scores by taking dot products between activations and the steering vector, then uses a margin-based loss to optimize the vector. During inference, the trained vector is applied exactly like CAA by adding it to activations at the target layer with a strength multiplier, targeting the second-to-last token position.

This learned approach can potentially capture more complex patterns than simple averaging but requires more computational time for training.

CLI Examples

# Basic BiPO training with default epochs

python -m wisent_guard.cli tasks preference_pairs.json --from-json --steering-mode --steering-method BiPO --layer 15 --save-steering-vector preference_bipo.pt

# BiPO with custom training parameters

python -m wisent_guard.cli tasks quality_pairs.json --from-json --steering-mode --steering-method BiPO --layer 14 --bipo-epochs 200 --bipo-lr 0.001 --save-steering-vector quality_bipo.pt

# BiPO with different margin and regularization

python -m wisent_guard.cli tasks style_pairs.json --from-json --steering-mode --steering-method BiPO --layer 16 --bipo-margin 0.5 --bipo-weight-decay 0.01 --save-steering-vector style_bipo.pt

# BiPO inference with moderate strength

python -m wisent_guard.cli tasks writing_tasks.json --from-json --steering-mode --steering-method BiPO --layer 15 --load-steering-vector preference_bipo.pt --steering-strength 0.8

# BiPO training on CPU with memory monitoring

python -m wisent_guard.cli tasks tone_pairs.json --from-json --steering-mode --steering-method BiPO --layer 12 --device cpu --bipo-epochs 50 --show-memory-usage --save-steering-vector tone_bipo.pt

# BiPO with all equal token steering

python -m wisent_guard.cli tasks consistency_pairs.json --from-json --steering-mode --steering-method BiPO --layer 17 --enable-token-steering --token-steering-strategy all_equal --bipo-epochs 150 --save-steering-vector consistent_bipo.pt

Parameters

BiPO Specific Parameters

--bipo-epochs: Training epochs (default 100)
--bipo-lr: Learning rate (default 0.01)
--bipo-margin: Preference margin (default 1.0)
--bipo-weight-decay: L2 regularization (default 0.0)

Token Steering Parameters

--enable-token-steering: Enable position-based steering
--token-steering-strategy: last_only, second_to_last, first_only, all_equal, exponential_decay, exponential_growth, linear_decay, linear_growth
--token-decay-rate: Decay rate for exponential strategies (default 0.5)
--token-min-strength: Minimum strength for decay strategies (default 0.1)
--token-max-strength: Maximum strength for growth strategies (default 1.0)

Implementation Details

For the complete implementation of the BiPO steering method in Wisent-Guard, see:

bipo.py Original Paper Original Implementation

Continue to DAC