A Model is a set of weights used to generate responses. At the moment, Wisent only works with open source large language models. Each model has special tokens to mark the beginning of the model response and user query.
The model parameters are structured into layers. Each model has a fixed number of layers.
The numerical values (weights and biases) that the model learned during training. These determine how the model processes information and generates responses.
Specific text markers that models use to understand conversation structure and roles. Each model family uses different tokens to identify who is speaking.
<|user|>
and <|assistant|>
<|im_start|>user
and <|im_start|>assistant
[INST]
and [/INST]
Models with publicly available weights that can be downloaded, inspected, and modified. Unlike proprietary models (GPT-4, Claude), you have full access to the model's internals.
Wisent-Guard is optimized to work with models hosted on HuggingFace. However, you can also adapt the existing code to load your internal model or a model in any other format by changing the model.py file to load your model into existing Wisent-Guard pipeline.
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load a model and tokenizer
model_name = "Qwen/Qwen2.5-Instruct"
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Model characteristics
print(f"Model parameters: {model.num_parameters():,}")
print(f"Special tokens: {tokenizer.special_tokens_map}")
print(f"Vocabulary size: {tokenizer.vocab_size:,}")
User tags are special tokens that mark the beginning of user input in conversations. Different models use different tag formats, and specifying the correct tags is crucial for proper activation extraction.
<|user|>
- LLaMA models<|im_start|>user
- Qwen models[INST]
- Mistral modelsNote: You'll need to configure these manually
For detailed implementation and configuration options, check the model core file:
wisent_guard/core/model.pyThe model serves as the foundation for all representation engineering techniques. Its internal activations contain the representations we aim to detect and manipulate.
Every layer in the model produces activations that can be monitored, analyzed, and potentially modified to achieve desired behaviors.
Models can be modified through techniques like control vectors and steering to influence their output generation process.