Positioning spread: every benchmark, one model
| Benchmark | Task acc | Confidence | F₁ | Leans |
|---|---|---|---|---|
| MMLU-Pro (knowledge) | 0.66 | 0.79 | 0.81 | +0.13 overconfident |
| LegalBench (legal reasoning) | 0.87 | 0.96 | 0.93 | +0.10 overconfident |
| MathBench (competition math) | 1.00 | 1.00 | 1.00 | +0.00 calibrated |
| OmniMath (advanced math) | 0.88 | 0.98 | 0.94 | +0.10 overconfident |
| SciCode (scientific code) | 0.64 | 0.48 | 0.62 | -0.15 cautious |
In the full cloud
gemini-3.1-pro-preview conditions all other model/condition points equal relative confidence and pass rate
Pairwise signal: pairs involving gemini-3.1-pro-preview
Match accuracy controls for the performance base-rate gap
gemini-3.1-pro-preview pairs
19/ 190
gemini-3.1-pro-preview mean tau
+0.083
All-pairs mean
+0.041
gemini-3.1-pro-preview p<0.05
12(63.2%)
all model pairs (observed) base-rate-matched null calibration-preserving null gemini-3.1-pro-preview pair (filled = p<0.05) gemini-3.1-pro-preview mean all-pairs mean
The four metacognitive outcomes
No curated cases for this selection yet — outcome-matrix extraction currently covers a sample of MMLU-Pro trials.