Positioning spread: every benchmark, one model
| Benchmark | Task acc | Confidence | F₁ | Leans |
|---|---|---|---|---|
| SQuAD (factual recall) | 0.55 | 0.32 | 0.57 | -0.23 cautious |
| MMLU-Pro (knowledge) | 0.51 | 0.89 | 0.69 | +0.38 overconfident |
| LegalBench (legal reasoning) | 0.84 | 0.57 | 0.67 | -0.27 cautious |
| MathBench (competition math) | 0.95 | 0.97 | 0.96 | +0.03 calibrated |
| OmniMath (advanced math) | 0.56 | 0.84 | 0.71 | +0.28 overconfident |
| SciCode (scientific code) | 0.55 | 0.68 | 0.68 | +0.13 overconfident |
In the full cloud
deepseek-chat conditions all other model/condition points equal relative confidence and pass rate
Pairwise signal: pairs involving deepseek-chat
Match accuracy controls for the performance base-rate gap
deepseek-chat pairs
18/ 171
deepseek-chat mean tau
+0.017
All-pairs mean
+0.037
deepseek-chat p<0.05
6(33.3%)
all model pairs (observed) base-rate-matched null calibration-preserving null deepseek-chat pair (filled = p<0.05) deepseek-chat mean all-pairs mean
The four metacognitive outcomes
No curated cases for this selection yet — outcome-matrix extraction currently covers a sample of MMLU-Pro trials.