Positioning spread: every benchmark, one model
| Benchmark | Task acc | Confidence | F₁ | Leans |
|---|---|---|---|---|
| SQuAD (factual recall) | 0.45 | 0.71 | 0.69 | +0.25 overconfident |
| MMLU-Pro (knowledge) | 0.40 | 0.87 | 0.59 | +0.47 overconfident |
| LegalBench (legal reasoning) | 0.85 | 0.45 | 0.60 | -0.39 cautious |
| MathBench (competition math) | 0.78 | 1.00 | 0.88 | +0.22 overconfident |
| OmniMath (advanced math) | 0.30 | 0.96 | 0.47 | +0.66 overconfident |
| SciCode (scientific code) | 0.37 | 0.39 | 0.56 | +0.02 calibrated |
In the full cloud
gpt-4o-mini conditions all other model/condition points equal relative confidence and pass rate
Pairwise signal: pairs involving gpt-4o-mini
Match accuracy controls for the performance base-rate gap
gpt-4o-mini pairs
18/ 171
gpt-4o-mini mean tau
+0.007
All-pairs mean
+0.037
gpt-4o-mini p<0.05
3(16.7%)
all model pairs (observed) base-rate-matched null calibration-preserving null gpt-4o-mini pair (filled = p<0.05) gpt-4o-mini mean all-pairs mean
The four metacognitive outcomes
No curated cases for this selection yet — outcome-matrix extraction currently covers a sample of MMLU-Pro trials.