LegalBench (legal reasoning) | The Metacognition Bench

Operating-point board

LegalBench (legal reasoning) · Prospective · F_β β

#	Model	F_β	Prec	Rec	Task acc
1	gemini-3.1-pro-preview	0.927	0.88	0.98	0.87
2	mistral-medium-3.1	0.863	0.85	0.88	0.86
3	llama-3.1-70b-instruct	0.861	0.88	0.84	0.88
4	gemini-3-flash-preview	0.859	0.89	0.83	0.88
5	mistral-small-3.2-24b-instruct	0.854	0.85	0.86	0.85
6	gpt-5.2	0.852	0.85	0.85	0.85
7	deepseek-r1	0.851	0.88	0.83	0.87
8	gemini-2.5-pro	0.822	0.87	0.78	0.87
9	claude-3.5-sonnet	0.821	0.89	0.76	0.86
10	gemini-2.5-flash	0.810	0.84	0.78	0.84
11	claude-sonnet-4.5	0.795	0.87	0.73	0.86
12	llama-3.3-70b-instruct	0.763	0.88	0.68	0.88
13	claude-haiku-4.5	0.742	0.87	0.65	0.86
14	gemini-2.0-flash-001	0.734	0.81	0.67	0.85
15	deepseek-chat	0.674	0.83	0.57	0.84
16	claude-3-haiku	0.660	0.89	0.53	0.91
17	qwen-2.5-72b-instruct	0.626	0.83	0.50	0.84
18	gpt-4o-mini	0.600	0.86	0.46	0.85
19	gpt-4o	0.576	0.82	0.44	0.83

Pairwise signal on LegalBench

Observed correlation

0.024

mean τ-b · 0 = no signal

Baseline correlation

-0.000

permutation null · ≈ 0

Model pairs

171

unit = pair, not model

Significant pairs

25.7%

p<0.05 after FDR · not effect size

Observed model pairs Base-rate-matched null Calibration-preserving null 5%-95% observed: -0.060 to 0.127

Confidence structure: the shared-difficulty factor

Information the model doesn’t use

Benchmark	Model	Internal	Population	Same-context text	Post-hoc text
LegalBench	Gemini Flash	0.474	0.532	0.788	0.861

Values are ROC AUCs from the paper's calibration-ROC table (SI). "Same-context text" is the pre-judgement reasoning trace available before the binary confidence commit; "post-hoc text" reads the answer attempt and is an upper-bound comparison.

The four metacognitive outcomes on LegalBench

No curated cases for this selection yet — outcome-matrix extraction currently covers a sample of MMLU-Pro trials.