gpt-5.2 | The Metacognition Bench

Positioning spread: every benchmark, one model

Performance vs. confidence for gpt-5.2, per benchmark (prospective probe).

Benchmark	Task acc	Confidence	F₁	Leans
SQuAD (factual recall)	0.62	0.41	0.67	-0.21 cautious
MMLU-Pro (knowledge)	0.73	0.81	0.86	+0.08 overconfident
LegalBench (legal reasoning)	0.85	0.85	0.85	+0.00 calibrated
MathBench (competition math)	1.00	1.00	1.00	+0.00 calibrated
OmniMath (advanced math)	0.85	0.94	0.92	+0.09 overconfident
SciCode (scientific code)	0.58	0.65	0.74	+0.07 overconfident

In the full cloud

gpt-5.2 conditions all other model/condition points equal relative confidence and pass rate

Pairwise signal: pairs involving gpt-5.2

Match accuracy controls for the performance base-rate gap

gpt-5.2 pairs

18/ 171

gpt-5.2 mean tau

+0.022

All-pairs mean

+0.037

gpt-5.2 p<0.05

5(27.8%)

all model pairs (observed) base-rate-matched null calibration-preserving null gpt-5.2 pair (filled = p<0.05) gpt-5.2 mean all-pairs mean

The four metacognitive outcomes

↳ Turned out correct

↳ Turned out wrong

Competent claimed it could — and could

No example in this selection.

Overconfident claimed it could — but couldn’t

No example in this selection.

Underconfident declined — but could

gpt-5.2 · MMLU-Pro

What is the worldwide prevalence of obesity?

It said it couldn't answer: “The worldwide prevalence of obesity depends on the definition (adult BMI≥30 vs. children, age-standardized vs. crude), the year, and the data source (e.g., WHO vs. IHME). Without multiple-choice options or a specified reference, any single numeric answer could be mismatched.”

→ It but it could.It answered “About 13% of adults worldwide are obese (roughly 1 in 8 people).”; expected “13%”.

Well-declined declined — and couldn’t

No example in this selection.

Rows: top = claimed it could answer · bottom = declined Columns: left = was actually correct · right = was actually wrong

vs gpt-4o →vs gpt-4o-mini → Compare with anything →