← Atlas home Benchmark view 18 models

SciCode (scientific code)

Unaided means: implement the function without the extra background factsProbes: Prospective

Operating-point board

SciCode (scientific code) · Prospective · Fβ β
#ModelFβPrecRecTask acc
1 gpt-5.20.7390.700.790.58
2 claude-haiku-4.50.7300.680.790.56
3 deepseek-chat0.6840.620.770.55
4 deepseek-r10.6770.590.800.56
5 gemini-2.0-flash-0010.6530.500.940.51
6 gemini-3-flash-preview0.6350.680.600.59
7 gemini-3.1-pro-preview0.6230.720.550.64
8 claude-3-haiku0.5960.421.000.40
9 llama-3.1-70b-instruct0.5940.510.710.44
10 gpt-4o0.5650.650.500.51
11 mistral-small-3.2-24b-instruct0.5580.520.600.41
12 qwen-2.5-72b-instruct0.5570.560.550.43
13 gpt-4o-mini0.5560.540.570.37
14 gemini-2.5-pro0.5420.700.440.61
15 gemini-2.5-flash0.5200.560.480.56
16 llama-3.3-70b-instruct0.5120.520.500.46
17 claude-sonnet-4.50.4280.640.320.57
18 mistral-medium-3.10.3380.550.240.48

Pairwise signal on SciCode

Observed correlation
0.049
mean τ-b · 0 = no signal
Baseline correlation
-0.000
permutation null · ≈ 0
Model pairs
153
unit = pair, not model
Significant pairs
11.1%
p<0.05 after FDR · not effect size
-1.0-0.50.00.51.0Pair signal: do confidence gaps rank performance gaps? (Kendall tau-b)
Observed model pairs Base-rate-matched null Calibration-preserving null 5%-95% observed: -0.100 to 0.205

Confidence structure: the shared-difficulty factor

0PC1 = 37.3% of variance??Eigenvalue rank →Variance share19 models · 4 negative eigenvalues

Information the model doesn’t use

BenchmarkModelInternalPopulationSame-context textPost-hoc text
SciCodeLlama 3.1 70B0.5900.6570.5390.858

Values are ROC AUCs from the paper's calibration-ROC table (SI). "Same-context text" is the pre-judgement reasoning trace available before the binary confidence commit; "post-hoc text" reads the answer attempt and is an upper-bound comparison.

The four metacognitive outcomes on SciCode

No curated cases for this selection yet — outcome-matrix extraction currently covers a sample of MMLU-Pro trials.