Positioning spread: every benchmark, one model
| Benchmark | Task acc | Confidence | F₁ | Leans |
|---|---|---|---|---|
| SQuAD (factual recall) | 0.57 | 0.53 | 0.73 | -0.04 calibrated |
| MMLU-Pro (knowledge) | 0.65 | 0.80 | 0.80 | +0.16 overconfident |
| LegalBench (legal reasoning) | 0.87 | 0.82 | 0.85 | -0.05 calibrated |
| MathBench (competition math) | 0.98 | 1.00 | 0.99 | +0.02 calibrated |
| OmniMath (advanced math) | 0.70 | 0.97 | 0.84 | +0.27 overconfident |
| SciCode (scientific code) | 0.56 | 0.75 | 0.68 | +0.20 overconfident |
In the full cloud
deepseek-r1 conditions all other model/condition points equal relative confidence and pass rate
Pairwise signal: pairs involving deepseek-r1
Match accuracy controls for the performance base-rate gap
deepseek-r1 pairs
18/ 171
deepseek-r1 mean tau
+0.021
All-pairs mean
+0.037
deepseek-r1 p<0.05
5(27.8%)
all model pairs (observed) base-rate-matched null calibration-preserving null deepseek-r1 pair (filled = p<0.05) deepseek-r1 mean all-pairs mean
The four metacognitive outcomes
No curated cases for this selection yet — outcome-matrix extraction currently covers a sample of MMLU-Pro trials.