Positioning spread: every benchmark, one model
| Benchmark | Task acc | Confidence | F₁ | Leans |
|---|---|---|---|---|
| SQuAD (factual recall) | 0.45 | 0.39 | 0.59 | -0.06 cautious |
| MMLU-Pro (knowledge) | 0.34 | 0.89 | 0.53 | +0.55 overconfident |
| LegalBench (legal reasoning) | 0.88 | 0.68 | 0.76 | -0.20 cautious |
| MathBench (competition math) | 0.74 | 0.98 | 0.85 | +0.24 overconfident |
| OmniMath (advanced math) | 0.33 | 0.51 | 0.54 | +0.18 overconfident |
| SciCode (scientific code) | 0.46 | 0.44 | 0.51 | -0.02 calibrated |
In the full cloud
llama-3.3-70b-instruct conditions all other model/condition points equal relative confidence and pass rate
Pairwise signal: pairs involving llama-3.3-70b-instruct
Match accuracy controls for the performance base-rate gap
llama-3.3-70b-instruct pairs
18/ 171
llama-3.3-70b-instruct mean tau
+0.034
All-pairs mean
+0.037
llama-3.3-70b-instruct p<0.05
5(27.8%)
all model pairs (observed) base-rate-matched null calibration-preserving null llama-3.3-70b-instruct pair (filled = p<0.05) llama-3.3-70b-instruct mean all-pairs mean
The four metacognitive outcomes
No curated cases for this selection yet — outcome-matrix extraction currently covers a sample of MMLU-Pro trials.