Positioning spread: every benchmark, one model
| Benchmark | Task acc | Confidence | F₁ | Leans |
|---|---|---|---|---|
| SQuAD (factual recall) | 0.52 | 0.61 | 0.70 | +0.09 overconfident |
| MMLU-Pro (knowledge) | 0.48 | 0.80 | 0.65 | +0.32 overconfident |
| LegalBench (legal reasoning) | 0.86 | 0.88 | 0.86 | +0.03 calibrated |
| MathBench (competition math) | 0.94 | 0.87 | 0.92 | -0.07 cautious |
| OmniMath (advanced math) | 0.55 | 0.31 | 0.60 | -0.24 cautious |
| SciCode (scientific code) | 0.48 | 0.21 | 0.34 | -0.27 cautious |
In the full cloud
mistral-medium-3.1 conditions all other model/condition points equal relative confidence and pass rate
Pairwise signal: pairs involving mistral-medium-3.1
Match accuracy controls for the performance base-rate gap
mistral-medium-3.1 pairs
18/ 171
mistral-medium-3.1 mean tau
+0.024
All-pairs mean
+0.037
mistral-medium-3.1 p<0.05
7(38.9%)
all model pairs (observed) base-rate-matched null calibration-preserving null mistral-medium-3.1 pair (filled = p<0.05) mistral-medium-3.1 mean all-pairs mean
The four metacognitive outcomes
No curated cases for this selection yet — outcome-matrix extraction currently covers a sample of MMLU-Pro trials.