← OpenAI · GPT Individual model view OpenAI · GPT

gpt-5.2

Appears in 6 benchmarksMean lean (confidence − pass rate): +0.015/6 benchmarks lean overconfident (prospective probe)

Positioning spread: every benchmark, one model

0.000.250.500.751.00SQuAD (factual recall)MMLU-Pro (knowledge)LegalBench (legal reasoning)MathBench (competition math)OmniMath (advanced math)SciCode (scientific code)performanceconfidence (red gap = overconfident)
Performance vs. confidence for gpt-5.2, per benchmark (prospective probe).
BenchmarkTask accConfidenceF₁Leans
SQuAD (factual recall)0.620.410.67-0.21 cautious
MMLU-Pro (knowledge)0.730.810.86+0.08 overconfident
LegalBench (legal reasoning)0.850.850.85+0.00 calibrated
MathBench (competition math)1.001.001.00+0.00 calibrated
OmniMath (advanced math)0.850.940.92+0.09 overconfident
SciCode (scientific code)0.580.650.74+0.07 overconfident

In the full cloud

-3-3-2-2-1-100112233Performance z-score within benchmark/probe →Confidence z-score →
gpt-5.2 conditions all other model/condition points equal relative confidence and pass rate

Pairwise signal: pairs involving gpt-5.2

Match accuracy controls for the performance base-rate gap
gpt-5.2 pairs
18/ 171
gpt-5.2 mean tau
+0.022
All-pairs mean
+0.037
gpt-5.2 p<0.05
5(27.8%)
-1.0-0.50.00.51.0Pair signal: do confidence gaps rank performance gaps? (Kendall tau-b)
all model pairs (observed) base-rate-matched null calibration-preserving null gpt-5.2 pair (filled = p<0.05) gpt-5.2 mean all-pairs mean

The four metacognitive outcomes

Competent claimed it could — and could

No example in this selection.

Overconfident claimed it could — but couldn’t

No example in this selection.

Underconfident declined — but could
gpt-5.2 · MMLU-Pro

What is the worldwide prevalence of obesity?

It said it couldn't answer: “The worldwide prevalence of obesity depends on the definition (adult BMI≥30 vs. children, age-standardized vs. crude), the year, and the data source (e.g., WHO vs. IHME). Without multiple-choice options or a specified reference, any single numeric answer could be mismatched.”

→ It but it could.It answered “About 13% of adults worldwide are obese (roughly 1 in 8 people).”; expected “13%”.

Well-declined declined — and couldn’t

No example in this selection.

Rows: top = claimed it could answer · bottom = declined Columns: left = was actually correct · right = was actually wrong
vs gpt-4o →vs gpt-4o-mini → Compare with anything →