Project Pitch

VeriGrad RL is an open-source mechanistic interpretability and AI safety project: RL learns activation-level interventions, then evals test whether those interventions are safe, useful, and causally faithful.

What it showcases

Mechanistic framing

Uses named residual-stream features, activation patching, and causal attribution.

Safety objective

Optimizes for safe behavior without losing utility or over-refusing benign tasks.

Faithfulness checks

Penalizes interventions that are behaviorally safe but mechanistically over-broad.

Jailbreak evals

Tracks jailbreak success on held-out attack styles separately from ordinary harmful prompts.

Research infrastructure

Rollouts, verifiers, metrics, checkpoints, JSONL logs, and CI all work out of the box.

Extension path

The synthetic circuit can be replaced by real activation caches or SAE features.

Next impressive extension

Connect the environment to real model activations: cache residual streams, decompose them with sparse autoencoders, and train intervention policies that steer unsafe directions while preserving helpful capabilities.