A Layer is a single processing block in a transformer model that updates the residual stream. It typically consists of an attention mechanism and an MLP (feedforward network).
Model parameters are structured in layers. As information flows from the input to the output, it progresses through the model from one layer to another. For example, Llama 3.1-8B Instruct has 32 layers.
Each layer processes the information it receives from the previous layer and passes the result to the next layer. Think of it like an assembly line - each layer performs a specific transformation on the data before handing it off to the next stage.
In transformer models like those supported by Wisent, each layer typically contains attention mechanisms (which help the model focus on relevant parts of the input) and feedforward networks (which process and transform the information).

0-25%
25-75%
Optimal
75-100%
We recommend using automated optimization layers for
python -m wisent.cli tasks mmlu --layer -1 --model meta-llama/Llama-3.1-8B-Instruct --limit 10
python -m wisent.cli tasks mmlu --layer 15 --model meta-llama/Llama-3.1-8B-Instruct --limit 10
python -m wisent.cli tasks hellaswag --layer 10,15,20 --model meta-llama/Llama-3.1-8B-Instruct --limit 10
python -m wisent.cli tasks truthfulqa --layer 14-16 --model meta-llama/Llama-3.1-8B-Instruct --limit 10
Different layers show distinctive activation profiles; early ones concentrate on syntax whereas deeper ones grasp semantics and reasoning.
Middle layers typically contain the richest representations for most tasks, balancing between low-level features and high-level abstractions.
Different architecture types and sizes can exhibit their most effective layers in varying locations. Experimenting is necessary for discovering top layers.
Layer 15 is often a good starting point for 32-layer models
Test the middle 30-50% of your model's layers to find optimal performance
Larger models may need deeper layers for best results
Different tasks may benefit from different layer choices
More layers increase processing overhead
Stay in the loop. Never miss out.
Subscribe to our newsletter and unlock Wisent insights.