Mechanistic framing
Uses named residual-stream features, activation patching, and causal attribution.
VeriGrad RL is an open-source mechanistic interpretability and AI safety project: RL learns activation-level interventions, then evals test whether those interventions are safe, useful, and causally faithful.
Uses named residual-stream features, activation patching, and causal attribution.
Optimizes for safe behavior without losing utility or over-refusing benign tasks.
Penalizes interventions that are behaviorally safe but mechanistically over-broad.
Tracks jailbreak success on held-out attack styles separately from ordinary harmful prompts.
Rollouts, verifiers, metrics, checkpoints, JSONL logs, and CI all work out of the box.
The synthetic circuit can be replaced by real activation caches or SAE features.
Connect the environment to real model activations: cache residual streams, decompose them with sparse autoencoders, and train intervention policies that steer unsafe directions while preserving helpful capabilities.