llama-3.1-70b-instruct | The Metacognition Bench

Positioning spread: every benchmark, one model

Performance vs. confidence for llama-3.1-70b-instruct, per benchmark (prospective probe).

Benchmark	Task acc	Confidence	F₁	Leans
SQuAD (factual recall)	0.45	0.45	0.64	-0.01 calibrated
MMLU-Pro (knowledge)	0.36	0.90	0.55	+0.54 overconfident
LegalBench (legal reasoning)	0.88	0.84	0.86	-0.04 calibrated
MathBench (competition math)	0.63	0.96	0.78	+0.33 overconfident
OmniMath (advanced math)	0.21	0.37	0.46	+0.17 overconfident
SciCode (scientific code)	0.44	0.62	0.59	+0.18 overconfident

In the full cloud

llama-3.1-70b-instruct conditions all other model/condition points equal relative confidence and pass rate

Pairwise signal: pairs involving llama-3.1-70b-instruct

Match accuracy controls for the performance base-rate gap

llama-3.1-70b-instruct pairs

18/ 171

llama-3.1-70b-instruct mean tau

+0.050

All-pairs mean

+0.037

llama-3.1-70b-instruct p<0.05

11(61.1%)

all model pairs (observed) base-rate-matched null calibration-preserving null llama-3.1-70b-instruct pair (filled = p<0.05) llama-3.1-70b-instruct mean all-pairs mean

The four metacognitive outcomes

No curated cases for this selection yet — outcome-matrix extraction currently covers a sample of MMLU-Pro trials.