SciCode (scientific code) | The Metacognition Bench

Operating-point board

SciCode (scientific code) · Prospective · F_β β

#	Model	F_β	Prec	Rec	Task acc
1	gpt-5.2	0.739	0.70	0.79	0.58
2	claude-haiku-4.5	0.730	0.68	0.79	0.56
3	deepseek-chat	0.684	0.62	0.77	0.55
4	deepseek-r1	0.677	0.59	0.80	0.56
5	gemini-2.0-flash-001	0.653	0.50	0.94	0.51
6	gemini-3-flash-preview	0.635	0.68	0.60	0.59
7	gemini-3.1-pro-preview	0.623	0.72	0.55	0.64
8	claude-3-haiku	0.596	0.42	1.00	0.40
9	llama-3.1-70b-instruct	0.594	0.51	0.71	0.44
10	gpt-4o	0.565	0.65	0.50	0.51
11	mistral-small-3.2-24b-instruct	0.558	0.52	0.60	0.41
12	qwen-2.5-72b-instruct	0.557	0.56	0.55	0.43
13	gpt-4o-mini	0.556	0.54	0.57	0.37
14	gemini-2.5-pro	0.542	0.70	0.44	0.61
15	gemini-2.5-flash	0.520	0.56	0.48	0.56
16	llama-3.3-70b-instruct	0.512	0.52	0.50	0.46
17	claude-sonnet-4.5	0.428	0.64	0.32	0.57
18	mistral-medium-3.1	0.338	0.55	0.24	0.48

Pairwise signal on SciCode

Observed correlation

0.049

mean τ-b · 0 = no signal

Baseline correlation

-0.000

permutation null · ≈ 0

Model pairs

153

unit = pair, not model

Significant pairs

11.1%

p<0.05 after FDR · not effect size

Observed model pairs Base-rate-matched null Calibration-preserving null 5%-95% observed: -0.100 to 0.205

Confidence structure: the shared-difficulty factor

Information the model doesn’t use

Benchmark	Model	Internal	Population	Same-context text	Post-hoc text
SciCode	Llama 3.1 70B	0.590	0.657	0.539	0.858

Values are ROC AUCs from the paper's calibration-ROC table (SI). "Same-context text" is the pre-judgement reasoning trace available before the binary confidence commit; "post-hoc text" reads the answer attempt and is an upper-bound comparison.

The four metacognitive outcomes on SciCode

No curated cases for this selection yet — outcome-matrix extraction currently covers a sample of MMLU-Pro trials.