Positioning spread: every benchmark, one model
| Benchmark | Task acc | Confidence | F₁ | Leans |
|---|---|---|---|---|
| SQuAD (factual recall) | 0.40 | 0.42 | 0.55 | +0.01 calibrated |
| MMLU-Pro (knowledge) | 0.35 | 0.92 | 0.51 | +0.57 overconfident |
| LegalBench (legal reasoning) | 0.91 | 0.54 | 0.66 | -0.37 cautious |
| MathBench (competition math) | 0.41 | 0.95 | 0.59 | +0.54 overconfident |
| OmniMath (advanced math) | 0.13 | 0.33 | 0.29 | +0.20 overconfident |
| SciCode (scientific code) | 0.40 | 0.95 | 0.60 | +0.55 overconfident |
In the full cloud
claude-3-haiku conditions all other model/condition points equal relative confidence and pass rate
Pairwise signal: pairs involving claude-3-haiku
Match accuracy controls for the performance base-rate gap
claude-3-haiku pairs
18/ 171
claude-3-haiku mean tau
+0.040
All-pairs mean
+0.037
claude-3-haiku p<0.05
9(50%)
all model pairs (observed) base-rate-matched null calibration-preserving null claude-3-haiku pair (filled = p<0.05) claude-3-haiku mean all-pairs mean
The four metacognitive outcomes
No curated cases for this selection yet — outcome-matrix extraction currently covers a sample of MMLU-Pro trials.