Positioning spread: every benchmark, one model
| Benchmark | Task acc | Confidence | F₁ | Leans |
|---|---|---|---|---|
| SQuAD (factual recall) | 0.31 | 1.00 | 0.48 | +0.69 overconfident |
| MMLU-Pro (knowledge) | 0.25 | 1.00 | 0.41 | +0.75 overconfident |
In the full cloud
qwen-2.5-coder-32b-instruct conditions all other model/condition points equal relative confidence and pass rate
Pairwise signal: pairs involving qwen-2.5-coder-32b-instruct
Match accuracy controls for the performance base-rate gap
qwen-2.5-coder-32b-instruct pairs
19/ 190
qwen-2.5-coder-32b-instruct mean tau
+0.127
All-pairs mean
+0.041
qwen-2.5-coder-32b-instruct p<0.05
17(89.5%)
all model pairs (observed) base-rate-matched null calibration-preserving null qwen-2.5-coder-32b-instruct pair (filled = p<0.05) qwen-2.5-coder-32b-instruct mean all-pairs mean
The four metacognitive outcomes
No curated cases for this selection yet — outcome-matrix extraction currently covers a sample of MMLU-Pro trials.