qwen-2.5-coder-32b-instruct | The Metacognition Bench

Positioning spread: every benchmark, one model

Performance vs. confidence for qwen-2.5-coder-32b-instruct, per benchmark (prospective probe).

Benchmark	Task acc	Confidence	F₁	Leans
SQuAD (factual recall)	0.31	1.00	0.48	+0.69 overconfident
MMLU-Pro (knowledge)	0.25	1.00	0.41	+0.75 overconfident

In the full cloud

qwen-2.5-coder-32b-instruct conditions all other model/condition points equal relative confidence and pass rate

Pairwise signal: pairs involving qwen-2.5-coder-32b-instruct

Match accuracy controls for the performance base-rate gap

qwen-2.5-coder-32b-instruct pairs

19/ 190

qwen-2.5-coder-32b-instruct mean tau

+0.127

All-pairs mean

+0.041

qwen-2.5-coder-32b-instruct p<0.05

17(89.5%)

all model pairs (observed) base-rate-matched null calibration-preserving null qwen-2.5-coder-32b-instruct pair (filled = p<0.05) qwen-2.5-coder-32b-instruct mean all-pairs mean

The four metacognitive outcomes

No curated cases for this selection yet — outcome-matrix extraction currently covers a sample of MMLU-Pro trials.