gpt-4o-mini | The Metacognition Bench

Positioning spread: every benchmark, one model

Performance vs. confidence for gpt-4o-mini, per benchmark (prospective probe).

Benchmark	Task acc	Confidence	F₁	Leans
SQuAD (factual recall)	0.45	0.71	0.69	+0.25 overconfident
MMLU-Pro (knowledge)	0.40	0.87	0.59	+0.47 overconfident
LegalBench (legal reasoning)	0.85	0.45	0.60	-0.39 cautious
MathBench (competition math)	0.78	1.00	0.88	+0.22 overconfident
OmniMath (advanced math)	0.30	0.96	0.47	+0.66 overconfident
SciCode (scientific code)	0.37	0.39	0.56	+0.02 calibrated

gpt-4o-mini conditions all other model/condition points equal relative confidence and pass rate

Match accuracy controls for the performance base-rate gap

gpt-4o-mini pairs

18/ 171

gpt-4o-mini mean tau

+0.007

All-pairs mean

+0.037

gpt-4o-mini p<0.05

3(16.7%)

all model pairs (observed) base-rate-matched null calibration-preserving null gpt-4o-mini pair (filled = p<0.05) gpt-4o-mini mean all-pairs mean

No curated cases for this selection yet — outcome-matrix extraction currently covers a sample of MMLU-Pro trials.