VeriGrad RL: propensity evaluation for frontier models

Get started · 60 seconds

Install and run it

A pip-installable Python package (verigrad CLI + library) plus this Next.js app. The RL baseline runs offline; the propensity benchmark calls real models and reads ANTHROPIC_API_KEY from your environment.

install.sh

# clone + install (editable) for Python 3.10+
git clone https://github.com/aravinds-kannappan/VeriGrad-RL
cd VeriGrad-RL
pip install -e ".[llm,inspect]"

run.sh

# runs offline in seconds: verifiable-reward RL, no API key
verigrad train --episodes 400 --env arithmetic
 
# the propensity benchmark on real frontier models
export ANTHROPIC_API_KEY=sk-ant-...
verigrad propensity --smoke      # 3 problems, ~$0.01 (sanity check)
verigrad propensity --tasks 150  # the full leaderboard run

Use it as a library

The probe templates and deterministic detectors are importable. Render a prompt, call any model yourself, and score the response reproducibly.

probe.py

from verigrad_rl.integrations import (
    build_pressure_samples, label_response, safe_behavior,
)
 
# real GSM8K problems rendered under a confident, wrong reviewer
samples = build_pressure_samples("authority_wrong", n_tasks=150, seed=7)
 
# call any model on samples[i].prompt, then score deterministically:
labels = label_response("authority_wrong", model_text,
                        samples[0].gold, samples[0].anchor)
held = safe_behavior("authority_wrong", labels)   # True == did not cave

What’s in the box

A benchmark, a baseline, and the plumbing to trust both

🎯

Propensity benchmark

Answer Under Pressure: deference, spec-gaming, and robustness on real GSM8K + CommonsenseQA. Measures what a model will do, not just what it can.

🔌

Inspect AI adapter

Run every probe through UK AISI's Inspect against Anthropic, OpenAI, Google, or a local model behind vLLM/Ollama. One flag swaps the vendor.

📈

Scales to a program

Content-addressed, resumable runs with a hard cost ceiling, item-clustered CIs, and Benjamini–Hochberg FDR correction across domains.

🔬

Automated circuit discovery

A dependency-free ACDC + path-patching implementation (Conmy 2023; Goldowsky-Dill 2023) that finds the safety circuit on a transparent substrate, validated against a known answer key.

✅

We test the ruler

Cohen's κ dual-labeling on every grader. The cross-check already caught a real bug in our own detector before it reached the report.

🧪

RL-from-verifier baseline

A transparent policy-gradient loop on verifiable rewards. Runs offline in seconds, no GPU, every number reproducible from a seed.

Interactive · live model calls

Probe a model under pressure

Pick a real GSM8K problem (or edit it), choose a model and how hard a wrong “authority” pushes back, then run it against a real model. The verdict (did it hold or cave?) is computed from the live response.

real GSM8K problem (editable)

gold answerexamplesmodelpressureintensity 3

Calls run server-side via a Next.js route; enable them by setting ANTHROPIC_API_KEY in the deployment. The rest of the site works without a key.

The core finding

Capability is nearly tied. Trustworthiness is not.

Three frontier models, 150 real GSM8K problems each, under three framings.

Capability (control accuracy) vs.

Capability vs. propensity. All three cluster near 96% accuracy (a near-vertical band), yet spread up to ~9x on sycophancy. Hover a model for its numbers; toggle the axis to compare sycophancy and deference. The two dimensions are nearly orthogonal.

Model	Control accuracy	Deference ↓	Sycophancy on solved ↓	Spec-gaming ↓	Cost
`opus-4.8`	96.7%	2.7%	2.1% [0.7, 5.9]	0.0%	$1.78
`sonnet-4.6`	96.7%	20.0%	17.9% [12.5, 25.0]	0.0%	$1.56
`haiku-4.5`	95.3%	10.7%	9.1% [5.4, 14.9]	0.0%	$0.66

Headline. Sonnet 4.6 and Opus 4.8 are statistically tied on capability, but Sonnet abandons a correct answer under a wrong reviewer 17.9% of the time versus Opus’s 2.1%: a ~9× gap with non-overlapping confidence intervals.

Open source · interoperable

Built to live in the safety-eval ecosystem

The probes are tiny and provider-neutral on purpose, so other people’s harnesses can drive them. The flagship is a real Inspect AI adapter: the same deterministic detectors, now against any model Inspect can reach.

inspect.sh

pip install "verigrad-rl[inspect]"
 
# the SAME probes, against any Inspect provider; one flag swaps the vendor
inspect eval verigrad_rl/integrations/inspect_task.py@deference \
  --model anthropic/claude-sonnet-4-6
inspect eval verigrad_rl/integrations/inspect_task.py@deference \
  --model openai/gpt-4o

Shipped integrations

ShippedInspect AIUK AI Safety Institute

Probes run as real Inspect Tasks against any provider it supports.

ShippedGSM8K · CommonsenseQAtask source

Real public datasets, downloaded and cached. Never synthetic, never modified.

ShippedAnthropic APImodels + judge

Models under test and the independent reliability judge; cost is measured, not estimated.

Patterned after · compatible by design

Conventions VeriGrad deliberately mirrors so it slots into a real safety-eval workflow. Listed honestly as influences and interop targets, not bundled adapters.

InfluencegarakNVIDIA

Probe/detector taxonomy for LLM vulnerability scanning.

Influencelm-evaluation-harnessEleutherAI

Capability-baseline + results-table conventions.

InfluenceHELMStanford CRFM

Scenario/metric separation and CI-first reporting.

InfluenceTransformerLensopen source

White-box, mechanistic-interpretability workflow.

InfluencepetriAnthropic

Auditing-agent philosophy: probe behavior under pressure.

Full details in the Integrations docs.

Interactive · runs in your browser

Train a model on the data, live

No server, no pre-baked numbers. These widgets run real computation in your browser on the 648 logged samples from the cross-domain run.

What drives deference? Logistic regression, trained in-browser

Gradient descent fits P(defer) from four features of the real run (648 samples, 47 deferred). Watch it train, read the learned coefficients, then predict on any combination.

learning rate 0.50epochs 300

converged log-loss n/a · trained on 648 real samples

Learned coefficients standardized · + = more deference

authority intensity

+0.00

is Sonnet 4.6

+0.00

is Haiku 4.5

+0.00

is CommonsenseQA

+0.00

Predict P(defer) for any combination

modeldomainauthority intensity 3

50.0%

The κ paradox: why raw agreement lies for rare behaviors

Two independent graders label items where the true behavior occurs at a chosen base rate. Drag the rate down and watch Cohen’s κ collapse while raw agreement stays high. That is the trap that hid our spec-gaming detector bug.

behavior base rate 0.7%per-grader accuracy 98.0%

raw agreement

96.1%

Cohen’s κ

0.25

Raw agreement looks great, but κ has collapsed. The behavior is too rare to validate. This is exactly the trap that hid the spec-gaming detector bug.

Scales to a research program

Across domains, providers, and an elicitation gradient

Deference under escalating authority pressure across domains — **Elicitation gradient.** Deference under escalating authority, per model, across two domains. Cross-domain run: 648 samples, $1.74.

Model	GSM8K · L1	GSM8K · L3	CSQA · L1	CSQA · L3
`opus-4.8`	0.0%	13.9%	2.8%	2.8%
`sonnet-4.6`	0.0%	8.3%	8.3%	22.2%
`haiku-4.5`	0.0%	16.7%	8.3%	47.2%

Provider-agnostic by design. The native runner targets Anthropic; the Inspect adapter lifts that ceiling so the same probes produce a cross-vendor leaderboard. Runs are content-addressed and resumable with a hard cost ceiling. See the Scaling docs.

FDR correction changes a conclusion. On CommonsenseQA, Haiku-vs-Sonnet (47% vs 22%) is significant at raw p = 0.026 but not after Benjamini–Hochberg (q = 0.052). And the model ranking differs across domains: a propensity measured on math doesn’t cleanly transfer.

Why models cave

Social, not cognitive

When a model abandons the correct answer, did its reasoning already compute it and then cave (override), or did the pressure corrupt the computation (anchored)? Across the lineup, ~90% is override. The model knew, and threw it away.

Override versus anchored reasoning — **Override dominates.** Two independent signals classify each case and agree at 94%.

Model	Deference cases	Override	Anchored	Override share
`opus-4.8`	4	3	1	75.0%
`sonnet-4.6`	30	28	2	93.3%
`haiku-4.5`	16	14	2	87.5%

White-box · automated circuit discovery

Find the safety circuit automatically

Behavior tells you a model caved; mechanism tells you which components did it. VeriGrad ships a dependency-free implementation of ACDC (Conmy et al., NeurIPS 2023) and path patching (Goldowsky-Dill et al., 2023), run on a transparent safety circuit with a known answer key, so the discovery is validated, not just asserted.

Discover the safety circuit, live

This runs the real ACDC + path-patching algorithm in your browser on the transparent safety DAG. Drag the threshold: a higher tau prunes more aggressively, the paper’s precision/recall knob. Solid teal edges are kept; dashed are pruned.

threshold tau 0.020

edges kept 8/13 · faithfulness KL 0.009 · sparsity 38%

Same code path as verigrad circuit: the harm-detection then guard then output pathway survives, while edges out of the constant inputs (refusal_cue, noise) carry no information and get pruned.

circuit.sh

# automated circuit discovery on a transparent safety circuit
verigrad circuit --target safety-dag --tau 0.02
#  -> benchmark/circuits/{REPORT.md, fig_circuit.svg}
 
# run the same method on the RL environment's actual reward model
verigrad circuit --target toy-circuit

Validated, not asserted. Because the circuit is white-box it has a known answer key, so tests/test_acdc.py checks the core pathway is recovered, that information-free edges are pruned, and that precision/recall move with the threshold exactly as the ACDC paper predicts. Method in the Circuit discovery docs; every paper behind VeriGrad is mapped in References.

Is the ruler trustworthy?

We test our graders

Label	Cohen’s κ	Raw agreement	n	Verdict
Correctness	0.95	99%	450	validated
Deference	0.97	99%	150	validated
Spec-gaming	n/a	n/a	150	0 positives (nothing to validate)

The cross-check caught a bug in our own ruler. An earlier spec-gaming detector flagged 3 “positives” the judge unanimously rejected: all three the same clock-time answer, “2:00 PM”, misread as two numbers (κ = 0.00 despite 98% raw agreement). After the fix, true spec-gaming is 0/150.

Measure what a frontier model will do under pressure.

Install and run it

Use it as a library

A benchmark, a baseline, and the plumbing to trust both

Propensity benchmark

Inspect AI adapter

Scales to a program

Automated circuit discovery

We test the ruler

RL-from-verifier baseline

Probe a model under pressure

Capability is nearly tied. Trustworthiness is not.

Built to live in the safety-eval ecosystem

Shipped integrations

Patterned after · compatible by design

Train a model on the data, live

What drives deference? Logistic regression, trained in-browser

Learned coefficients standardized · + = more deference

Predict P(defer) for any combination

The κ paradox: why raw agreement lies for rare behaviors

Across domains, providers, and an elicitation gradient

Social, not cognitive

Find the safety circuit automatically

Discover the safety circuit, live

We test our graders