Positioning spread: every benchmark, one model
| Benchmark | Task acc | Confidence | F₁ | Leans |
|---|---|---|---|---|
| SQuAD (factual recall) | 0.62 | 0.41 | 0.67 | -0.21 cautious |
| MMLU-Pro (knowledge) | 0.73 | 0.81 | 0.86 | +0.08 overconfident |
| LegalBench (legal reasoning) | 0.85 | 0.85 | 0.85 | +0.00 calibrated |
| MathBench (competition math) | 1.00 | 1.00 | 1.00 | +0.00 calibrated |
| OmniMath (advanced math) | 0.85 | 0.94 | 0.92 | +0.09 overconfident |
| SciCode (scientific code) | 0.58 | 0.65 | 0.74 | +0.07 overconfident |
In the full cloud
gpt-5.2 conditions all other model/condition points equal relative confidence and pass rate
Pairwise signal: pairs involving gpt-5.2
Match accuracy controls for the performance base-rate gap
gpt-5.2 pairs
18/ 171
gpt-5.2 mean tau
+0.022
All-pairs mean
+0.037
gpt-5.2 p<0.05
5(27.8%)
all model pairs (observed) base-rate-matched null calibration-preserving null gpt-5.2 pair (filled = p<0.05) gpt-5.2 mean all-pairs mean
The four metacognitive outcomes
↳ Turned out correct
↳ Turned out wrong
Competent claimed it could — and could
No example in this selection.
Overconfident claimed it could — but couldn’t
No example in this selection.
Underconfident declined — but could
gpt-5.2 · MMLU-Pro
What is the worldwide prevalence of obesity?
It said it couldn't answer: “The worldwide prevalence of obesity depends on the definition (adult BMI≥30 vs. children, age-standardized vs. crude), the year, and the data source (e.g., WHO vs. IHME). Without multiple-choice options or a specified reference, any single numeric answer could be mismatched.”
→ It but it could.It answered “About 13% of adults worldwide are obese (roughly 1 in 8 people).”; expected “13%”.
Well-declined declined — and couldn’t
No example in this selection.
Rows: top = claimed it could answer · bottom = declined Columns: left = was actually correct · right = was actually wrong