mistral-small-3.2-24b-instruct | The Metacognition Bench

Positioning spread: every benchmark, one model

Performance vs. confidence for mistral-small-3.2-24b-instruct, per benchmark (prospective probe).

Benchmark	Task acc	Confidence	F₁	Leans
SQuAD (factual recall)	0.44	0.61	0.67	+0.17 overconfident
MMLU-Pro (knowledge)	0.39	0.90	0.57	+0.51 overconfident
LegalBench (legal reasoning)	0.85	0.86	0.85	+0.01 calibrated
MathBench (competition math)	0.87	0.98	0.92	+0.11 overconfident
OmniMath (advanced math)	0.44	0.82	0.63	+0.38 overconfident
SciCode (scientific code)	0.41	0.47	0.56	+0.06 overconfident

In the full cloud

mistral-small-3.2-24b-instruct conditions all other model/condition points equal relative confidence and pass rate

Pairwise signal: pairs involving mistral-small-3.2-24b-instruct

Match accuracy controls for the performance base-rate gap

mistral-small-3.2-24b-instruct pairs

18/ 171

mistral-small-3.2-24b-instruct mean tau

+0.038

All-pairs mean

+0.037

mistral-small-3.2-24b-instruct p<0.05

8(44.4%)

all model pairs (observed) base-rate-matched null calibration-preserving null mistral-small-3.2-24b-instruct pair (filled = p<0.05) mistral-small-3.2-24b-instruct mean all-pairs mean

The four metacognitive outcomes

No curated cases for this selection yet — outcome-matrix extraction currently covers a sample of MMLU-Pro trials.