Positioning spread: every benchmark, one model
| Benchmark | Task acc | Confidence | F₁ | Leans |
|---|---|---|---|---|
| SQuAD (factual recall) | 0.58 | 0.68 | 0.76 | +0.10 overconfident |
| MMLU-Pro (knowledge) | 0.67 | 0.87 | 0.82 | +0.20 overconfident |
| LegalBench (legal reasoning) | 0.87 | 0.79 | 0.82 | -0.09 cautious |
| MathBench (competition math) | 0.91 | 1.00 | 0.95 | +0.09 overconfident |
| OmniMath (advanced math) | 0.70 | 0.99 | 0.83 | +0.29 overconfident |
| SciCode (scientific code) | 0.61 | 0.38 | 0.54 | -0.23 cautious |
In the full cloud
gemini-2.5-pro conditions all other model/condition points equal relative confidence and pass rate
Pairwise signal: pairs involving gemini-2.5-pro
Match accuracy controls for the performance base-rate gap
gemini-2.5-pro pairs
18/ 171
gemini-2.5-pro mean tau
+0.053
All-pairs mean
+0.037
gemini-2.5-pro p<0.05
12(66.7%)
all model pairs (observed) base-rate-matched null calibration-preserving null gemini-2.5-pro pair (filled = p<0.05) gemini-2.5-pro mean all-pairs mean
The four metacognitive outcomes
No curated cases for this selection yet — outcome-matrix extraction currently covers a sample of MMLU-Pro trials.