Heretic Abliteration Pipeline

How Heretic automatically finds and removes refusal behavior from language models

Harmful + Harmless Directions Apply Refusals + KL Div New Parameters Evaluate Iterate (200 trials) Prompt Datasets harmless + harmful Residual Extraction hidden states per layer Refusal Direction Computation diff-of-means per layer Directional Ablation orthogonalize weights vs refusal direction Evaluation Count refusals Measure KL divergence from original model Optuna Optimizer TPE parameter search 200 trials default Decensored Model Save locally Upload to HuggingFace Chat to test Best trial
Interactive Diagram: Hover over any component to learn more, or click "Walk Through Pipeline" to see the full optimization loop.