llama-3.3-70b-instruct | The Metacognition Bench

Positioning spread: every benchmark, one model

Performance vs. confidence for llama-3.3-70b-instruct, per benchmark (prospective probe).

Benchmark	Task acc	Confidence	F₁	Leans
SQuAD (factual recall)	0.45	0.39	0.59	-0.06 cautious
MMLU-Pro (knowledge)	0.34	0.89	0.53	+0.55 overconfident
LegalBench (legal reasoning)	0.88	0.68	0.76	-0.20 cautious
MathBench (competition math)	0.74	0.98	0.85	+0.24 overconfident
OmniMath (advanced math)	0.33	0.51	0.54	+0.18 overconfident
SciCode (scientific code)	0.46	0.44	0.51	-0.02 calibrated

In the full cloud

llama-3.3-70b-instruct conditions all other model/condition points equal relative confidence and pass rate

Pairwise signal: pairs involving llama-3.3-70b-instruct

Match accuracy controls for the performance base-rate gap

llama-3.3-70b-instruct pairs

18/ 171

llama-3.3-70b-instruct mean tau

+0.034

All-pairs mean

+0.037

llama-3.3-70b-instruct p<0.05

5(27.8%)

all model pairs (observed) base-rate-matched null calibration-preserving null llama-3.3-70b-instruct pair (filled = p<0.05) llama-3.3-70b-instruct mean all-pairs mean

The four metacognitive outcomes

No curated cases for this selection yet — outcome-matrix extraction currently covers a sample of MMLU-Pro trials.