← Qwen Individual model view Qwen

qwen-2.5-coder-32b-instruct

Appears in 2 benchmarksMean lean (confidence − pass rate): +0.722/2 benchmarks lean overconfident (prospective probe)

Positioning spread: every benchmark, one model

0.000.250.500.751.00SQuAD (factual recall)MMLU-Pro (knowledge)performanceconfidence (red gap = overconfident)
Performance vs. confidence for qwen-2.5-coder-32b-instruct, per benchmark (prospective probe).
BenchmarkTask accConfidenceF₁Leans
SQuAD (factual recall)0.311.000.48+0.69 overconfident
MMLU-Pro (knowledge)0.251.000.41+0.75 overconfident

In the full cloud

-3-3-2-2-1-100112233Performance z-score within benchmark/probe →Confidence z-score →
qwen-2.5-coder-32b-instruct conditions all other model/condition points equal relative confidence and pass rate

Pairwise signal: pairs involving qwen-2.5-coder-32b-instruct

Match accuracy controls for the performance base-rate gap
qwen-2.5-coder-32b-instruct pairs
19/ 190
qwen-2.5-coder-32b-instruct mean tau
+0.127
All-pairs mean
+0.041
qwen-2.5-coder-32b-instruct p<0.05
17(89.5%)
-1.0-0.50.00.51.0Pair signal: do confidence gaps rank performance gaps? (Kendall tau-b)
all model pairs (observed) base-rate-matched null calibration-preserving null qwen-2.5-coder-32b-instruct pair (filled = p<0.05) qwen-2.5-coder-32b-instruct mean all-pairs mean

The four metacognitive outcomes

No curated cases for this selection yet — outcome-matrix extraction currently covers a sample of MMLU-Pro trials.
vs qwen-2.5-72b-instruct → Compare with anything →