← Atlas home Benchmark view 19 models

LegalBench (legal reasoning)

Unaided means: answer without the legal definition providedProbes: Prospective

Operating-point board

LegalBench (legal reasoning) · Prospective · Fβ β
#ModelFβPrecRecTask acc
1 gemini-3.1-pro-preview0.9270.880.980.87
2 mistral-medium-3.10.8630.850.880.86
3 llama-3.1-70b-instruct0.8610.880.840.88
4 gemini-3-flash-preview0.8590.890.830.88
5 mistral-small-3.2-24b-instruct0.8540.850.860.85
6 gpt-5.20.8520.850.850.85
7 deepseek-r10.8510.880.830.87
8 gemini-2.5-pro0.8220.870.780.87
9 claude-3.5-sonnet0.8210.890.760.86
10 gemini-2.5-flash0.8100.840.780.84
11 claude-sonnet-4.50.7950.870.730.86
12 llama-3.3-70b-instruct0.7630.880.680.88
13 claude-haiku-4.50.7420.870.650.86
14 gemini-2.0-flash-0010.7340.810.670.85
15 deepseek-chat0.6740.830.570.84
16 claude-3-haiku0.6600.890.530.91
17 qwen-2.5-72b-instruct0.6260.830.500.84
18 gpt-4o-mini0.6000.860.460.85
19 gpt-4o0.5760.820.440.83

Pairwise signal on LegalBench

Observed correlation
0.024
mean τ-b · 0 = no signal
Baseline correlation
-0.000
permutation null · ≈ 0
Model pairs
171
unit = pair, not model
Significant pairs
25.7%
p<0.05 after FDR · not effect size
-1.0-0.50.00.51.0Pair signal: do confidence gaps rank performance gaps? (Kendall tau-b)
Observed model pairs Base-rate-matched null Calibration-preserving null 5%-95% observed: -0.060 to 0.127

Confidence structure: the shared-difficulty factor

0PC1 = 36.4% of variance??Eigenvalue rank →Variance share19 models · 1 negative eigenvalue

Information the model doesn’t use

BenchmarkModelInternalPopulationSame-context textPost-hoc text
LegalBenchGemini Flash0.4740.5320.7880.861

Values are ROC AUCs from the paper's calibration-ROC table (SI). "Same-context text" is the pre-judgement reasoning trace available before the binary confidence commit; "post-hoc text" reads the answer attempt and is an upper-bound comparison.

The four metacognitive outcomes on LegalBench

No curated cases for this selection yet — outcome-matrix extraction currently covers a sample of MMLU-Pro trials.