Token Steered

Token Steered - Position-based steering control that applies different intervention strengths to specific token positions during generation.

How Token Steering Works

Token steering addresses a fundamental challenge in activation steering: as models generate longer sequences, the cumulative effects of steering interventions can compound and distort the intended behavior. Traditional steering applies the same intervention strength to every token position, but this can lead to over-steering in longer responses.

Token steering solves this by allowing fine-grained control over when and how strongly to apply steering interventions. You can target specific token positions (like only the first token or second-to-last), apply different strengths to different positions, or use decay/growth patterns that automatically adjust intervention strength based on sequence length.

This approach enables more precise control over model behavior while preventing the accumulation effects that can make longer generations unstable or overly influenced by the steering vector. Token steering works with any underlying steering method (CAA, BiPO, DAC, HPR, K-Steering) as an additional layer of control.

CLI Examples

# Apply steering only to the first token

python -m wisent_guard.cli tasks questions.json --from-json --steering-mode --steering-method CAA --layer 15 --load-steering-vector honesty.pt --enable-token-steering --token-steering-strategy first_only

# Apply steering only to the second-to-last token position

python -m wisent_guard.cli tasks responses.json --from-json --steering-mode --steering-method BiPO --layer 14 --load-steering-vector style.pt --enable-token-steering --token-steering-strategy second_to_last

# Apply equal steering strength to all token positions

python -m wisent_guard.cli tasks dialogue.json --from-json --steering-mode --steering-method DAC --layer 16 --load-steering-vector empathy.pt --enable-token-steering --token-steering-strategy all_equal

# Exponential decay - strong at start, weaker over time

python -m wisent_guard.cli tasks stories.json --from-json --steering-mode --steering-method HPR --layer 15 --load-steering-vector creativity.pt --enable-token-steering --token-steering-strategy exponential_decay --token-decay-rate 0.8

# Linear growth - weak at start, stronger over time

python -m wisent_guard.cli tasks analysis.json --from-json --steering-mode --steering-method K-Steering --layer 17 --load-steering-vector logic.pt --enable-token-steering --token-steering-strategy linear_growth --token-max-strength 2.0

# Custom decay with minimum strength threshold

python -m wisent_guard.cli tasks conversations.json --from-json --steering-mode --steering-method CAA --layer 13 --load-steering-vector politeness.pt --enable-token-steering --token-steering-strategy exponential_decay --token-decay-rate 0.6 --token-min-strength 0.2

Token Steering Strategies

Position-Based Strategies

first_only: Apply steering only to the first generated token
last_only: Apply steering only to the last token position
second_to_last: Target the second-to-last token (most common)
all_equal: Apply equal strength to all positions

Dynamic Strategies

exponential_decay: Strong initial steering, exponentially weakening
linear_decay: Linear reduction in steering strength
exponential_growth: Weak initial steering, exponentially strengthening
linear_growth: Linear increase in steering strength

Parameters

Token Steering Parameters

--enable-token-steering: Enable position-based steering control
--token-steering-strategy: Strategy for applying steering (first_only, last_only, second_to_last, all_equal, exponential_decay, linear_decay, exponential_growth, linear_growth)
--token-decay-rate: Rate of decay for exponential strategies (0.0-1.0, default 0.5)
--token-min-strength: Minimum strength threshold for decay strategies (default 0.1)
--token-max-strength: Maximum strength ceiling for growth strategies (default 1.0)

Implementation Details

For the complete implementation of the Token Steered method, explore the source code:

View token_steered.py on GitHub

Continue to Nonsensical Response Blocking