qwen-2.5-72b-instruct | The Metacognition Bench

Positioning spread: every benchmark, one model

Performance vs. confidence for qwen-2.5-72b-instruct, per benchmark (prospective probe).

Benchmark	Task acc	Confidence	F₁	Leans
SQuAD (factual recall)	0.47	0.30	0.57	-0.18 cautious
MMLU-Pro (knowledge)	0.41	0.69	0.57	+0.28 overconfident
LegalBench (legal reasoning)	0.84	0.51	0.63	-0.33 cautious
MathBench (competition math)	0.83	0.91	0.91	+0.08 overconfident
OmniMath (advanced math)	0.34	0.43	0.57	+0.10 overconfident
SciCode (scientific code)	0.43	0.42	0.56	-0.01 calibrated

In the full cloud

qwen-2.5-72b-instruct conditions all other model/condition points equal relative confidence and pass rate

Pairwise signal: pairs involving qwen-2.5-72b-instruct

Match accuracy controls for the performance base-rate gap

qwen-2.5-72b-instruct pairs

18/ 171

qwen-2.5-72b-instruct mean tau

+0.052

All-pairs mean

+0.037

qwen-2.5-72b-instruct p<0.05

11(61.1%)

all model pairs (observed) base-rate-matched null calibration-preserving null qwen-2.5-72b-instruct pair (filled = p<0.05) qwen-2.5-72b-instruct mean all-pairs mean

The four metacognitive outcomes

No curated cases for this selection yet — outcome-matrix extraction currently covers a sample of MMLU-Pro trials.