deepseek-chat | The Metacognition Bench

Positioning spread: every benchmark, one model

Performance vs. confidence for deepseek-chat, per benchmark (prospective probe).

Benchmark	Task acc	Confidence	F₁	Leans
SQuAD (factual recall)	0.55	0.32	0.57	-0.23 cautious
MMLU-Pro (knowledge)	0.51	0.89	0.69	+0.38 overconfident
LegalBench (legal reasoning)	0.84	0.57	0.67	-0.27 cautious
MathBench (competition math)	0.95	0.97	0.96	+0.03 calibrated
OmniMath (advanced math)	0.56	0.84	0.71	+0.28 overconfident
SciCode (scientific code)	0.55	0.68	0.68	+0.13 overconfident

deepseek-chat conditions all other model/condition points equal relative confidence and pass rate

Match accuracy controls for the performance base-rate gap

deepseek-chat pairs

18/ 171

deepseek-chat mean tau

+0.017

All-pairs mean

+0.037

deepseek-chat p<0.05

6(33.3%)

all model pairs (observed) base-rate-matched null calibration-preserving null deepseek-chat pair (filled = p<0.05) deepseek-chat mean all-pairs mean

No curated cases for this selection yet — outcome-matrix extraction currently covers a sample of MMLU-Pro trials.