Cross-benchmark trait map
First signal: confidence is not a private performance readout
All points
r = 0.08
pooled model/condition points aggregate fit
Unaided means: answer without seeing the multiple-choice options.
Open the MMLU-Pro page →Forecast probe
Prospective: asked before attempting.
Counterfactual: asked after.
β<1 · weight overconfidence
precision matters weight underconfidence · β>1
recall matters
precision matters weight underconfidence · β>1
recall matters
Task-accuracy band = 0.00–1.00 Showing 20 / 20 models in band
Operating-point ranking
| # | Model | Fβ ? ↓ | Precision ? | Recall ? | Task acc ? |
|---|---|---|---|---|---|
| 01 | gpt-5.2 deep dive ↗ | 0.861 | 0.819 | 0.908 | 0.728 |
| 02 | gemini-2.5-pro deep dive ↗ | 0.822 | 0.727 | 0.945 | 0.672 |
| 03 | claude-sonnet-4.5 deep dive ↗ | 0.819 | 0.775 | 0.869 | 0.702 |
| 04 | gemini-3.1-pro-preview deep dive ↗ | 0.814 | 0.748 | 0.893 | 0.662 |
| 05 | deepseek-r1 deep dive ↗ | 0.799 | 0.722 | 0.895 | 0.648 |
| 06 | gemini-3-flash-preview deep dive ↗ | 0.778 | 0.646 | 0.977 | 0.609 |
| 07 | claude-haiku-4.5 deep dive ↗ | 0.744 | 0.728 | 0.760 | 0.646 |
| 08 | claude-3.5-sonnet deep dive ↗ | 0.705 | 0.583 | 0.891 | 0.541 |
| 09 | deepseek-chat deep dive ↗ | 0.693 | 0.547 | 0.947 | 0.514 |
| 10 | gpt-4o deep dive ↗ | 0.659 | 0.508 | 0.938 | 0.470 |
| 11 | mistral-medium-3.1 deep dive ↗ | 0.648 | 0.517 | 0.868 | 0.477 |
| 12 | gemini-2.5-flash deep dive ↗ | 0.636 | 0.476 | 0.960 | 0.455 |
| 13 | gemini-2.0-flash-001 deep dive ↗ | 0.623 | 0.453 | 0.996 | 0.449 |
| 14 | gpt-4o-mini deep dive ↗ | 0.595 | 0.433 | 0.949 | 0.396 |
| 15 | mistral-small-3.2-24b-instruct deep dive ↗ | 0.573 | 0.410 | 0.951 | 0.388 |
| 16 | qwen-2.5-72b-instruct deep dive ↗ | 0.570 | 0.455 | 0.763 | 0.414 |
| 17 | llama-3.1-70b-instruct deep dive ↗ | 0.549 | 0.385 | 0.955 | 0.364 |
| 18 | llama-3.3-70b-instruct deep dive ↗ | 0.525 | 0.364 | 0.944 | 0.343 |
| 19 | claude-3-haiku deep dive ↗ | 0.512 | 0.354 | 0.926 | 0.352 |
| 20 | qwen-2.5-coder-32b-instruct deep dive ↗ | 0.405 | 0.254 | 1.000 | 0.254 |
Observed correlation
0.037
Baseline correlation
0.000
Model pairs
171
Significant pairs
46.2%
Observed model pairs Base-rate-matched null Calibration-preserving null 5%-95% observed :
-0.026 to 0.127
Per-benchmark AUCs (internal vs population vs text classifier) are on each benchmark page (linked from the leaderboard above).