Heretic Abliteration Pipeline
How Heretic automatically finds and removes refusal behavior from language models
Walk Through Pipeline
Reset
Harmful + Harmless
Directions
Apply
Refusals + KL Div
New Parameters
Evaluate
Iterate (200 trials)
Prompt Datasets
harmless + harmful
Residual Extraction
hidden states per layer
Refusal Direction
Computation
diff-of-means per layer
Directional Ablation
orthogonalize weights
vs refusal direction
Evaluation
Count refusals
Measure KL divergence
from original model
Optuna Optimizer
TPE parameter search
200 trials default
Decensored Model
Save locally
Upload to HuggingFace
Chat to test
Best trial
Interactive Diagram:
Hover over any component to learn more, or click "Walk Through Pipeline" to see the full optimization loop.