MLP

An MLP navigates by following gradient signals generated from the performance of a pre-trained classifier; this leads to effective delineation and resulting in practical linear steering vectors for use operationally.

How MLP Works

Training Multilayer Perceptrons (MLPs) involves using neural nets to perform classification tasks that differentiate between high versus low levels of activity. In contrast, hyperplanes rely on linear classifiers; therefore, MLPs learn decision boundaries that are not linear and consequently excel at discerning fine distinctions within the range of activations. To refine further: Trained with neural networks, MLPs distinguish clearly between high and low activity levels through classification tasks. By comparison,

The essential takeaway is that although distinguishing features may not be linear, significant guidance vectors exist. Using the difference of gradients comparing classifier scores to activation levels leads to directions that maximize classification changes; averaging those gradients using weightings for negatives then gives robust directions pointing toward high likelihoods of correct positive assignments. The main idea is clear: despite no strict discrimination thresholds being linear, useful directional components also exist. Calculation of gradients relative to classifier score

During training regularization includes use of both layer normalization together with activation functions such as GELU; optimization utilizes AdamW optimizer along with policies regarding cosine learning rates and applying stopping criteria prevents over fitting via early stopping techniques.

When to Use MLP

  • Non-linear Boundaries: When positive and negative activations aren't linearly separable
  • Complex Patterns: When the behavior involves multiple interacting factors
  • More Data Available: MLP benefits from larger datasets compared to CAA/HYPERPLANE
  • Gradient-based Direction: When you want a direction derived from optimization rather than statistics
MLP works with nonlinear boundaries

Works: MLP learns nonlinear decision boundaries, handling cases where a⁻ is surrounded by a⁺.

MLP fails with no structure

Fails: When activations have no separable structure, no classifier can learn a meaningful boundary.

CLI Examples

Basic MLP training
python -m wisent.cli tasks honesty_pairs.json --from-json --steering-mode --steering-method MLP --layer 15 --save-steering-vector honesty_mlp.pt
MLP with custom architecture
python -m wisent.cli tasks safety_pairs.json --from-json --steering-mode --steering-method MLP --layer 12 --mlp-hidden-dim 512 --mlp-num-layers 3 --mlp-dropout 0.2 --save-steering-vector safety_mlp.pt
MLP with training parameters
python -m wisent.cli tasks refusal_pairs.json --from-json --steering-mode --steering-method MLP --layer 18 --mlp-epochs 200 --mlp-learning-rate 0.0005 --mlp-weight-decay 0.02 --save-steering-vector refusal_mlp.pt
MLP inference using saved vector
python -m wisent.cli tasks test_questions.json --from-json --steering-mode --steering-method MLP --layer 15 --load-steering-vector honesty_mlp.pt --steering-strength 1.5

Parameters

MLP Architecture Parameters

--mlp-hidden-dim
Hidden layer dimension (default: min(256, hidden_dim // 4))
--mlp-num-layers
Number of hidden layers (default: 2)
--mlp-dropout
Dropout rate for regularization (default: 0.1)

MLP Training Parameters

--mlp-epochs
Maximum training epochs (default: 100)
--mlp-learning-rate
Learning rate for AdamW optimizer (default: 0.001)
--mlp-weight-decay
Weight decay for regularization (default: 0.01)
--normalize
L2-normalize the resulting vector (default: true)

Common Steering Parameters

--layer
Layer index to apply steering
--steering-strength
Magnitude of steering effect during inference
--save-steering-vector
Path to save the trained steering vector
--load-steering-vector
Path to load a pre-trained steering vector

For the complete implementation of the MLP steering method in Wisent, see:

Stay in the loop. Never miss out.

Subscribe to our newsletter and unlock Wisent insights.