Architecture

VeriGrad RL separates the moving pieces that tend to get tangled in safety-oriented RL post-training systems: behavior, reward, internal mechanisms, and evals.

VeriGrad RL system diagram

Core flow

  1. A safety environment samples a task with hidden residual-stream features.
  2. A policy chooses an activation intervention.
  3. The circuit applies the patch and predicts behavior.
  4. A verifier scores safety, utility, mechanistic targeting, sparsity, and off-target damage.
  5. Rollout collection stores the transition.
  6. The trainer updates the policy using reward advantage.
  7. Evaluators run deterministic checks on train, OOD benign, and jailbreak splits.
  8. Monitors probe the verifier for false positives, false negatives, and reward hacking.

Design choices

Named internals

The synthetic circuit exposes interpretable features such as harmful intent, helpful intent, jailbreak pressure, refusal prior, and uncertainty.

Structured rewards

Verifiers explain failures with machine-readable reason codes and metric components.

Deterministic evals

Evals use greedy policy behavior while training samples interventions.