Operating-point board
MathBench (competition math) · Prospective · Fβ β
| # | Model | Fβ | Prec | Rec | Task acc |
|---|---|---|---|---|---|
| 1 | gemini-3.1-pro-preview | 1.000 | 1.00 | 1.00 | 1.00 |
| 2 | gpt-5.2 | 0.998 | 1.00 | 1.00 | 1.00 |
| 3 | gemini-3-flash-preview | 0.991 | 0.98 | 1.00 | 0.98 |
| 4 | deepseek-r1 | 0.988 | 0.98 | 1.00 | 0.98 |
| 5 | gemini-2.5-flash | 0.973 | 0.95 | 1.00 | 0.95 |
| 6 | claude-haiku-4.5 | 0.972 | 0.95 | 1.00 | 0.95 |
| 7 | claude-sonnet-4.5 | 0.967 | 0.94 | 1.00 | 0.93 |
| 8 | deepseek-chat | 0.962 | 0.95 | 0.97 | 0.95 |
| 9 | gemini-2.5-pro | 0.955 | 0.91 | 1.00 | 0.91 |
| 10 | gemini-2.0-flash-001 | 0.945 | 0.90 | 1.00 | 0.89 |
| 11 | mistral-small-3.2-24b-instruct | 0.923 | 0.87 | 0.98 | 0.87 |
| 12 | mistral-medium-3.1 | 0.916 | 0.95 | 0.88 | 0.94 |
| 13 | qwen-2.5-72b-instruct | 0.910 | 0.87 | 0.95 | 0.83 |
| 14 | gpt-4o | 0.894 | 0.81 | 1.00 | 0.80 |
| 15 | gpt-4o-mini | 0.878 | 0.78 | 1.00 | 0.78 |
| 16 | llama-3.3-70b-instruct | 0.853 | 0.75 | 0.99 | 0.74 |
| 17 | llama-3.1-70b-instruct | 0.782 | 0.65 | 0.99 | 0.63 |
| 18 | claude-3-haiku | 0.595 | 0.43 | 0.98 | 0.41 |
Pairwise signal on MathBench
Observed correlation
0.072
mean τ-b · 0 = no signal
Baseline correlation
0.000
permutation null · ≈ 0
Model pairs
135
unit = pair, not model
Significant pairs
28.9%
p<0.05 after FDR · not effect size
Observed model pairs Base-rate-matched null Calibration-preserving null 5%-95% observed: -0.188 to 0.280
Confidence structure: the shared-difficulty factor
Information the model doesn’t use
| Benchmark | Model | Internal | Population | Same-context text | Post-hoc text |
|---|---|---|---|---|---|
| MathBench | GPT-4o | 0.630 | 0.742 | 0.703 | 0.773 |
Values are ROC AUCs from the paper's calibration-ROC table (SI). "Same-context text" is the pre-judgement reasoning trace available before the binary confidence commit; "post-hoc text" reads the answer attempt and is an upper-bound comparison.
The four metacognitive outcomes on MathBench
No curated cases for this selection yet — outcome-matrix extraction currently covers a sample of MMLU-Pro trials.