Operating-point board
SciCode (scientific code) · Prospective · Fβ β
| # | Model | Fβ | Prec | Rec | Task acc |
|---|---|---|---|---|---|
| 1 | gpt-5.2 | 0.739 | 0.70 | 0.79 | 0.58 |
| 2 | claude-haiku-4.5 | 0.730 | 0.68 | 0.79 | 0.56 |
| 3 | deepseek-chat | 0.684 | 0.62 | 0.77 | 0.55 |
| 4 | deepseek-r1 | 0.677 | 0.59 | 0.80 | 0.56 |
| 5 | gemini-2.0-flash-001 | 0.653 | 0.50 | 0.94 | 0.51 |
| 6 | gemini-3-flash-preview | 0.635 | 0.68 | 0.60 | 0.59 |
| 7 | gemini-3.1-pro-preview | 0.623 | 0.72 | 0.55 | 0.64 |
| 8 | claude-3-haiku | 0.596 | 0.42 | 1.00 | 0.40 |
| 9 | llama-3.1-70b-instruct | 0.594 | 0.51 | 0.71 | 0.44 |
| 10 | gpt-4o | 0.565 | 0.65 | 0.50 | 0.51 |
| 11 | mistral-small-3.2-24b-instruct | 0.558 | 0.52 | 0.60 | 0.41 |
| 12 | qwen-2.5-72b-instruct | 0.557 | 0.56 | 0.55 | 0.43 |
| 13 | gpt-4o-mini | 0.556 | 0.54 | 0.57 | 0.37 |
| 14 | gemini-2.5-pro | 0.542 | 0.70 | 0.44 | 0.61 |
| 15 | gemini-2.5-flash | 0.520 | 0.56 | 0.48 | 0.56 |
| 16 | llama-3.3-70b-instruct | 0.512 | 0.52 | 0.50 | 0.46 |
| 17 | claude-sonnet-4.5 | 0.428 | 0.64 | 0.32 | 0.57 |
| 18 | mistral-medium-3.1 | 0.338 | 0.55 | 0.24 | 0.48 |
Pairwise signal on SciCode
Observed correlation
0.049
mean τ-b · 0 = no signal
Baseline correlation
-0.000
permutation null · ≈ 0
Model pairs
153
unit = pair, not model
Significant pairs
11.1%
p<0.05 after FDR · not effect size
Observed model pairs Base-rate-matched null Calibration-preserving null 5%-95% observed: -0.100 to 0.205
Confidence structure: the shared-difficulty factor
Information the model doesn’t use
| Benchmark | Model | Internal | Population | Same-context text | Post-hoc text |
|---|---|---|---|---|---|
| SciCode | Llama 3.1 70B | 0.590 | 0.657 | 0.539 | 0.858 |
Values are ROC AUCs from the paper's calibration-ROC table (SI). "Same-context text" is the pre-judgement reasoning trace available before the binary confidence commit; "post-hoc text" reads the answer attempt and is an upper-bound comparison.
The four metacognitive outcomes on SciCode
No curated cases for this selection yet — outcome-matrix extraction currently covers a sample of MMLU-Pro trials.