Cross-benchmark trait map

First signal: confidence is not a private performance readout

All points
r = 0.08
-3-3-2-2-1-100112233Performance z-score within benchmark/probe →Confidence z-score within benchmark/probe →equal relative confidence and pass rateoverall r = 0.08
pooled model/condition points aggregate fit
Unaided means: answer without seeing the multiple-choice options.
Open the MMLU-Pro page →
Forecast probe
Prospective: asked before attempting. Counterfactual: asked after.
β<1 · weight overconfidence
precision matters
weight underconfidence · β>1
recall matters
Task-accuracy band = 0.00–1.00 Showing 20 / 20 models in band

Operating-point ranking

#Model Fβ ? Precision ? Recall ? Task acc ?
01gpt-5.2 deep dive ↗0.8610.8190.9080.728
02gemini-2.5-pro deep dive ↗0.8220.7270.9450.672
03claude-sonnet-4.5 deep dive ↗0.8190.7750.8690.702
04gemini-3.1-pro-preview deep dive ↗0.8140.7480.8930.662
05deepseek-r1 deep dive ↗0.7990.7220.8950.648
06gemini-3-flash-preview deep dive ↗0.7780.6460.9770.609
07claude-haiku-4.5 deep dive ↗0.7440.7280.7600.646
08claude-3.5-sonnet deep dive ↗0.7050.5830.8910.541
09deepseek-chat deep dive ↗0.6930.5470.9470.514
10gpt-4o deep dive ↗0.6590.5080.9380.470
11mistral-medium-3.1 deep dive ↗0.6480.5170.8680.477
12gemini-2.5-flash deep dive ↗0.6360.4760.9600.455
13gemini-2.0-flash-001 deep dive ↗0.6230.4530.9960.449
14gpt-4o-mini deep dive ↗0.5950.4330.9490.396
15mistral-small-3.2-24b-instruct deep dive ↗0.5730.4100.9510.388
16qwen-2.5-72b-instruct deep dive ↗0.5700.4550.7630.414
17llama-3.1-70b-instruct deep dive ↗0.5490.3850.9550.364
18llama-3.3-70b-instruct deep dive ↗0.5250.3640.9440.343
19claude-3-haiku deep dive ↗0.5120.3540.9260.352
20qwen-2.5-coder-32b-instruct deep dive ↗0.4050.2541.0000.254
Observed correlation
0.037
Baseline correlation
0.000
Model pairs
171
Significant pairs
46.2%
-1.0-0.50.00.51.0confidence anti-predicts performancenoneusablestrongcalibration null meanbase-rate null meanobserved meanPair signal: do confidence gaps rank performance gaps? (Kendall tau-b)
Observed model pairs Base-rate-matched null Calibration-preserving null 5%-95% observed : -0.026 to 0.127

Per-benchmark AUCs (internal vs population vs text classifier) are on each benchmark page (linked from the leaderboard above).

Benchmark
0PC1 = 54.9% of variance??Eigenvalue rank →Variance share19 models · 1 negative eigenvalue
55%
of the shared variance in models’ confidence sits on one common factor SQuAD — models largely agree on which items are hard

Per-benchmark eigenspectra (and the full structure) are on each benchmark page (linked from the leaderboard above).