Positioning spread: every benchmark, one model
| Benchmark | Task acc | Confidence | F₁ | Leans |
|---|---|---|---|---|
| SQuAD (factual recall) | 0.44 | 0.61 | 0.67 | +0.17 overconfident |
| MMLU-Pro (knowledge) | 0.39 | 0.90 | 0.57 | +0.51 overconfident |
| LegalBench (legal reasoning) | 0.85 | 0.86 | 0.85 | +0.01 calibrated |
| MathBench (competition math) | 0.87 | 0.98 | 0.92 | +0.11 overconfident |
| OmniMath (advanced math) | 0.44 | 0.82 | 0.63 | +0.38 overconfident |
| SciCode (scientific code) | 0.41 | 0.47 | 0.56 | +0.06 overconfident |
In the full cloud
mistral-small-3.2-24b-instruct conditions all other model/condition points equal relative confidence and pass rate
Pairwise signal: pairs involving mistral-small-3.2-24b-instruct
Match accuracy controls for the performance base-rate gap
mistral-small-3.2-24b-instruct pairs
18/ 171
mistral-small-3.2-24b-instruct mean tau
+0.038
All-pairs mean
+0.037
mistral-small-3.2-24b-instruct p<0.05
8(44.4%)
all model pairs (observed) base-rate-matched null calibration-preserving null mistral-small-3.2-24b-instruct pair (filled = p<0.05) mistral-small-3.2-24b-instruct mean all-pairs mean
The four metacognitive outcomes
No curated cases for this selection yet — outcome-matrix extraction currently covers a sample of MMLU-Pro trials.