Mechanistic interpretability for AI safety RL

VeriGrad RL

An open-source lab where RL policies learn activation-level safety interventions, then get audited for safety, utility retention, mechanistic faithfulness, over-refusal, and jailbreak robustness.

Built for safety research, not just reward curves.

The main environment is a synthetic residual-stream circuit. Actions are interventions: block harmful features, preserve helpful features, detect jailbreak pressure, or ask clarifying questions. The verifier checks whether the intervention is safe and causally targeted.

Mechanistic rewards

Rewards combine safety, utility, causal targeting, sparsity, and off-target activation damage.

Activation patching

Named residual-stream features make every intervention inspectable and reproducible.

Causal attribution

Counterfactual feature ablations identify which internal features drive unsafe behavior.

Safety evals

Eval reports track safety, utility, over-refusal, jailbreak success, and mechanistic alignment.

Reward-hacking checks

Verifier probes catch malformed actions, false negatives, and over-broad safety interventions.

Extensible core

The circuit can be swapped for real model activation caches without changing rollouts and evals.

System loop

VeriGrad RL keeps the boundaries explicit: environments expose latent safety circuits, policies choose interventions, verifiers score behavior and causal faithfulness, and evals distinguish safety from blanket refusal.

VeriGrad RL system diagram
Mechanistic safety verifier pipeline

Mechanistic safety demo

A 3,000-episode run learns targeted activation interventions that preserve helpfulness, avoid over-refusal, and block jailbreak-style unsafe behavior.

python3 -m verigrad_rl.cli train --env safety-circuit --episodes 3000 --temperature 1.5 --learning-rate 0.12 --run-dir runs/safety-demo
python3 -m verigrad_rl.cli eval --env safety-circuit --checkpoint runs/safety-demo/policy.json --tasks 200
python3 -m unittest discover -s tests
1.00Safety rate
1.00Mechanistic alignment
0.00Jailbreak success
12Unit tests
Safety, utility, mechanistic alignment, over-refusal, and jailbreak success during training

Mechanistic inspection figures

The notebook and asset generator produce interpretable figures for causal attribution, intervention comparison, and behavior-logit shifts before and after a patch.

Causal attribution for unsafe-completion logit
Intervention reward comparison on a jailbreak prompt
Behavior logits before and after targeted intervention

Interactive biosafety playground

The biosafety demo turns the same verifier logic into a browser playground. Adjust synthetic sequence-risk, dual-use capability, synthesis-scale, utility, and uncertainty sliders, then watch the recommended intervention, logits, attribution, and verifier reward update live.

Notebook walkthrough

The notebook gives a GitHub-renderable tour with figures, reproducible commands, activation patches, safety metrics, and the core training flow.