deepseek-r1 | The Metacognition Bench

Positioning spread: every benchmark, one model

Performance vs. confidence for deepseek-r1, per benchmark (prospective probe).

Benchmark	Task acc	Confidence	F₁	Leans
SQuAD (factual recall)	0.57	0.53	0.73	-0.04 calibrated
MMLU-Pro (knowledge)	0.65	0.80	0.80	+0.16 overconfident
LegalBench (legal reasoning)	0.87	0.82	0.85	-0.05 calibrated
MathBench (competition math)	0.98	1.00	0.99	+0.02 calibrated
OmniMath (advanced math)	0.70	0.97	0.84	+0.27 overconfident
SciCode (scientific code)	0.56	0.75	0.68	+0.20 overconfident

deepseek-r1 conditions all other model/condition points equal relative confidence and pass rate

Match accuracy controls for the performance base-rate gap

deepseek-r1 pairs

18/ 171

deepseek-r1 mean tau

+0.021

All-pairs mean

+0.037

deepseek-r1 p<0.05

5(27.8%)

all model pairs (observed) base-rate-matched null calibration-preserving null deepseek-r1 pair (filled = p<0.05) deepseek-r1 mean all-pairs mean

No curated cases for this selection yet — outcome-matrix extraction currently covers a sample of MMLU-Pro trials.