evaluate-refusal

Evaluate model refusal behavior using UncensorBench prompts or custom prompt sets. This command measures compliance and refusal rates across various sensitive topics using either keyword-based or semantic evaluation.

Basic Usage

python -m wisent evaluate-refusal --model MODEL [OPTIONS]

Examples

Basic Evaluation

python -m wisent evaluate-refusal \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --output ./refusal_results.json

Specific Topics

python -m wisent evaluate-refusal \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --topics cybersecurity,weapons,fraud \
  --evaluator semantic \
  --verbose

Custom Prompts

python -m wisent evaluate-refusal \
  --model ./my_modified_model/ \
  --prompts ./custom_prompts.json \
  --evaluator keyword \
  --max-new-tokens 200

Limited Evaluation

python -m wisent evaluate-refusal \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --num-prompts 30 \
  --verbose

Arguments

Argument	Default	Description
--model	required	Model to evaluate (path or HuggingFace name)
--prompts	UncensorBench	JSON file with custom prompts (overrides UncensorBench)
--output	-	Output JSON file for evaluation results
--evaluator	semantic	Evaluator type: keyword or semantic
--topics	all	Comma-separated list of topics to evaluate
--max-new-tokens	150	Maximum tokens to generate per response
--num-prompts	all (150)	Maximum number of prompts to evaluate
--verbose	false	Show each response during evaluation

Available Topics

UncensorBench includes prompts across these topics:

cybersecurity
piracy
weapons
drugs
fraud
manipulation
violence
privacy_invasion
illegal_activities
academic_dishonesty
gambling
controversial_speech
evasion
self_harm
adult_content

Evaluator Types

keyword - Detects refusal using keyword patterns (fast, rule-based)
semantic - Uses embedding similarity to detect refusals (more accurate, requires embeddings)

Related Commands

evaluate-responses - Evaluate general responses
modify-weights - Modify model for refusal suppression
optimize-weights - Optimize refusal parameters

Stay in the loop. Never miss out.

Subscribe to our newsletter and unlock Wisent insights.

Contact Careers Privacy Policy Terms of Service