An MLP navigates by following gradient signals generated from the performance of a pre-trained classifier; this leads to effective delineation and resulting in practical linear steering vectors for use operationally.

How MLP Works

Training Multilayer Perceptrons (MLPs) involves using neural nets to perform classification tasks that differentiate between high versus low levels of activity. In contrast, hyperplanes rely on linear classifiers; therefore, MLPs learn decision boundaries that are not linear and consequently excel at discerning fine distinctions within the range of activations. To refine further: Trained with neural networks, MLPs distinguish clearly between high and low activity levels through classification tasks. By comparison,

The essential takeaway is that although distinguishing features may not be linear, significant guidance vectors exist. Using the difference of gradients comparing classifier scores to activation levels leads to directions that maximize classification changes; averaging those gradients using weightings for negatives then gives robust directions pointing toward high likelihoods of correct positive assignments. The main idea is clear: despite no strict discrimination thresholds being linear, useful directional components also exist. Calculation of gradients relative to classifier score

During training regularization includes use of both layer normalization together with activation functions such as GELU; optimization utilizes AdamW optimizer along with policies regarding cosine learning rates and applying stopping criteria prevents over fitting via early stopping techniques.

When to Use MLP

Non-linear Boundaries: When positive and negative activations aren't linearly separable
Complex Patterns: When the behavior involves multiple interacting factors
More Data Available: MLP benefits from larger datasets compared to CAA/HYPERPLANE
Gradient-based Direction: When you want a direction derived from optimization rather than statistics

Works: MLP learns nonlinear decision boundaries, handling cases where a⁻ is surrounded by a⁺.

Fails: When activations have no separable structure, no classifier can learn a meaningful boundary.

CLI Examples

Basic MLP training

python -m wisent.cli tasks honesty_pairs.json --from-json --steering-mode --steering-method MLP --layer 15 --save-steering-vector honesty_mlp.pt

MLP with custom architecture

python -m wisent.cli tasks safety_pairs.json --from-json --steering-mode --steering-method MLP --layer 12 --mlp-hidden-dim 512 --mlp-num-layers 3 --mlp-dropout 0.2 --save-steering-vector safety_mlp.pt

MLP with training parameters

python -m wisent.cli tasks refusal_pairs.json --from-json --steering-mode --steering-method MLP --layer 18 --mlp-epochs 200 --mlp-learning-rate 0.0005 --mlp-weight-decay 0.02 --save-steering-vector refusal_mlp.pt

MLP inference using saved vector

python -m wisent.cli tasks test_questions.json --from-json --steering-mode --steering-method MLP --layer 15 --load-steering-vector honesty_mlp.pt --steering-strength 1.5

Parameters

MLP Architecture Parameters

--mlp-hidden-dim

Hidden layer dimension (default: min(256, hidden_dim // 4))

--mlp-num-layers

Number of hidden layers (default: 2)

--mlp-dropout

Dropout rate for regularization (default: 0.1)

MLP Training Parameters

--mlp-epochs

Maximum training epochs (default: 100)

--mlp-learning-rate

Learning rate for AdamW optimizer (default: 0.001)

--mlp-weight-decay

Weight decay for regularization (default: 0.01)

--normalize

L2-normalize the resulting vector (default: true)

Common Steering Parameters

--layer

Layer index to apply steering

--steering-strength

Magnitude of steering effect during inference

--save-steering-vector

Path to save the trained steering vector

--load-steering-vector

Path to load a pre-trained steering vector

For the complete implementation of the MLP steering method in Wisent, see:

View mlp.py on GitHub

Stay in the loop. Never miss out.

Subscribe to our newsletter and unlock Wisent insights.

Contact Careers Privacy Policy Terms of Service