Evaluate model refusal behavior using UncensorBench prompts or custom prompt sets. This command measures compliance and refusal rates across various sensitive topics using either keyword-based or semantic evaluation.
python -m wisent evaluate-refusal --model MODEL [OPTIONS]
python -m wisent evaluate-refusal \ --model meta-llama/Llama-3.1-8B-Instruct \ --output ./refusal_results.json
python -m wisent evaluate-refusal \ --model meta-llama/Llama-3.1-8B-Instruct \ --topics cybersecurity,weapons,fraud \ --evaluator semantic \ --verbose
python -m wisent evaluate-refusal \ --model ./my_modified_model/ \ --prompts ./custom_prompts.json \ --evaluator keyword \ --max-new-tokens 200
python -m wisent evaluate-refusal \ --model meta-llama/Llama-3.1-8B-Instruct \ --num-prompts 30 \ --verbose
| Argument | Default | Description |
|---|---|---|
| --model | required | Model to evaluate (path or HuggingFace name) |
| --prompts | UncensorBench | JSON file with custom prompts (overrides UncensorBench) |
| --output | - | Output JSON file for evaluation results |
| --evaluator | semantic | Evaluator type: keyword or semantic |
| --topics | all | Comma-separated list of topics to evaluate |
| --max-new-tokens | 150 | Maximum tokens to generate per response |
| --num-prompts | all (150) | Maximum number of prompts to evaluate |
| --verbose | false | Show each response during evaluation |
UncensorBench includes prompts across these topics:
Stay in the loop. Never miss out.
Subscribe to our newsletter and unlock Wisent insights.