This page covers your most common questions about Wisent technology.
Traditional guardrails require specification of specific filters pertinent to your use case; basically hoping the AI does exactly as intended. You can't follow what happens inside an AI's brain. Saying something like "Reject this request" at present doesn't reveal whether it would have said something more along the lines of "Here is the content you asked for but which I deem harmful". Traditional safety protections also seem quite fragile: Regex detection tends to miss things outside expected distributions. How do we incorporate notions such as hallucinations or harmfulness across different languages into traditional safety mechanisms? Alternatively if you prefer not to engineer prompts directly and instead intervene at the level of AI brain itself, this may be worth considering.
This method beats all other options we tested, cutting hallucination rates by 43 percent on Llama 3.1 8B when measured against TruthfulQA. The results look even better for harmfulness evaluation. Feel free to verify this yourself with the code in the evaluation folder. Here is an example command:
python -m wisent.cli tasks truthfulqa --model meta-llama/Llama-3.1-8B-Instruct --layer 15 --classifier-type logistic --save-classifier ./models/hallucination_classifier.pt --verbose
...and analyze the results in the guard_results.csv file using a human evaluator. Claude evaluator does not fare too well for the hallucination task so we recommend going through it by hand. You should get something like this: results spreadsheet. For out of distribution hallucinations that the classifier has not been trained on, the guard catches 12 out of 28 hallucinations, so 43%. This is at a cost of 5 false positives so without further detection the net impact is 7 instances of higher accuracy.
Wisent uses a unique approach for identifying unwelcome representations in the activation space. It is different from circuit breakers and SAE-based mechanistic interpretability to balance accuracy and speed. You can read about SAE-based approaches resulting in about 6 percent hallucination reduction here: LessWrong post.
Wisent is an experimental technology. Representation engineering requires a careful choice of hyperparameters. For every model, you need to setup the right model tokens, activation reading patterns and layers to read the activations from. You can read more about it here: arxiv paper or talk to me directly and I can help you set up Wisent for your specific use case: book a call. If you are struggling with latency or compute we can help!
Our guardrails provide another level of safety assurance alongside other methods that you typically employ; we develop tools powered by representation engineering for tracking thoughts from models and blocking them. To make it simpler: Our guards give extra assurance along with usual safeguards; we create tools using representation engineering to follow thoughts from
It begins by activating readings that identify corresponding activation patterns for specific harmful behavior; think of these as monitoring thought processes in the LLM. Doing this is preferable to using a benchmark score or final output evaluation.
This lets you monitor unusual behaviors in your LLM that trigger related patterns in its internal processing.
The method chosen for activation fingerprinting follows that of CAA; we convert pairs showing both positive and negative behavior to pairs which extract activation information through use of one token alone. For instance, suppose submitting pairings such as...
It becomes difficult to decide which token corresponds to differing levels of harm; whereas converting into this: You must now rewrite this response considering a different scenario involving medical treatments and side effects. Considering medical treatment outcomes and adverse reactions, determining which
Quickly it is evident what data should be read from. Difference in activation levels among A and B would ideally highlight distinctions in activation patterns; this allows identification of specific layers that show such differences. We could rely on results from other research to select which layer provides relevant information or conduct an exhaustive search over different layers to choose the highest performer.
Once extracted these activations, the first important decision is regarding how to proceed using this information; one option involves computing a vector based on average activations, normalizing that vector and comparing cosine similarity scores against token activations during inference time. Another suggestion is training classifiers directly on activation data and monitoring activations. Choice between these alternatives depends on prioritizing either speed or specific cases where performance of one method surpasses another. Optimization algorithms may be used to ensure all features are aligned correctly.
We can also track successive development of our model against harmful examples that we receive from you so as to better match your requirements. To Track subsequent iterations of our model based on harmful samples provided by you allows us to more closely align to your specifications. We
Schedule a call with our team or reach out through our community channels.
Stay in the loop. Never miss out.
Subscribe to our newsletter and unlock Wisent insights.