LLM Benchmark Results

Composite Indices

Composite Index 1.0 is a 50/50 weighted average of Artificial Analysis Intelligence Index 4.0 and Vals AI Index. Vals scores (0-1 scale) are scaled to 0-100 before averaging.

Composite Index 2.0 is a weighted average of Artificial Analysis Intelligence Index 4.0, Vals Multimodal Index, ARC-AGI-2, and Epoch Capabilities Index (ECI). ECI receives an adaptive weight of 30% × (1 − CI_width / ECI_range), where CI_width is the model's 90% confidence interval width and ECI_range is the spread of ECI scores across Composite 2.0 models. This means models with tighter confidence intervals (more reliable ECI estimates) receive higher ECI weight, while models with wider intervals receive less. When a model has no ECI score, the composite falls back to 40/40/20 (AA/Vals/ARC-AGI-2). The remaining weight after ECI is distributed in a 40/40/20 ratio across AA, Vals Multimodal, and ARC-AGI-2. ECI scores are z-score normalized to match the weighted (40/40/20) mean and standard deviation of the other three components across Composite 2.0 models.

Source Benchmarks

Artificial Analysis Intelligence Index 4.0 combines 10 evaluations across four equally-weighted categories: Agents (GDPval-AA, τ²-Bench Telecom), Coding (Terminal-Bench Hard, SciCode), General (AA-LCR, AA-Omniscience, IFBench), and Scientific Reasoning (Humanity's Last Exam, GPQA Diamond, CritPt).

Vals AI Index aggregates performance across three economic sectors weighted by their contribution to U.S. GDP: Finance (˜$2T; CorpFin, Finance Agent), Law (˜$360B; CaseLaw), and Coding (˜$1.4T; SWE-Bench, Terminal Bench 2). The Multimodal Index additionally includes Education (˜$270B; SAGE for grading handwritten student work).

ARC-AGI-2 tests AI reasoning systems on tasks requiring symbolic interpretation, compositional reasoning, and contextual rule application. Pure LLMs score 0%, while AI reasoning systems achieve only single-digit percentages, yet humans can solve every task.

Epoch Capabilities Index (ECI) is a composite metric from Epoch AI that uses Item Response Theory (IRT) to synthesize scores from 37 benchmarks into a single capability scale. IRT accounts for benchmark difficulty and steepness, enabling fair comparisons even when models are evaluated on different subsets of benchmarks. The scale is anchored so that Claude 3.5 Sonnet = 130 and GPT-5 = 150. Models in our dataset have between 4 and 21 ECI benchmarks, with a median of 8. The 37 component benchmarks span knowledge, reasoning, coding, and agentic tasks across three categories:

Internal evaluations (6): FrontierMath Tiers 1-3, FrontierMath Tier 4, GPQA Diamond, MATH Level 5, OTIS Mock AIME 2024-2025, SimpleQA Verified
External benchmark leaderboards (11): Aider Polyglot, BALROG, DeepResearch Bench, Fiction.LiveBench, GeoBench, GSO, SimpleBench, SWE-Bench (Bash Only), Terminal-Bench, VPCT, WeirdML V2
Developer reported scores (20): ANLI, ARC AI2, ARC-AGI, BIG-Bench Hard, CADEval, Cybench, GSM8K, HellaSwag, LAMBADA, Lech Mazur Writing, LiveBench, MMLU, OpenBookQA, OSWorld, PIQA, ScienceQA, SuperGLUE, TriviaQA, Video MME, WinoGrande

Cost Index

Cost Index 1.0 combines cost per task from Vals AI Index and total cost to complete the benchmark from Artificial Analysis, using 50/50 weighting. Cost Index 2.0 adds cost per task from ARC-AGI-2, using 40/40/20 weighting. Values are log-normalized across models with available data, then combined as a weighted average. Higher values indicate more expensive models.

Speed Index

Speed Index 1.0 combines latency from Vals AI Index and end-to-end response time from Artificial Analysis, using 50/50 weighting. Speed Index 2.0 combines latency from Vals Multimodal Index and end-to-end response time from Artificial Analysis, also using 50/50 weighting (since ARC-AGI-2 does not report speed metrics). Values are log-normalized across models in each composite. Higher values indicate slower models.

Model Coverage

Composite Index 1.0 includes all models present in both Vals AI Index and Artificial Analysis Index 4.0. Composite Index 2.0 requires Vals Multimodal Index, ARC-AGI-2, and Artificial Analysis Index 4.0, with ECI incorporated via adaptive weighting when available. Artificial Analysis has the broadest coverage, Vals covers most models, ECI covers 26 of 39 models, while ARC-AGI-2 is the limiting factor as it only publishes results for frontier models or lab partners (primarily major American labs).

Note that Vals periodically updates their model roster—adding new models, removing older ones, and revising scores over time as they rerun benchmarks. Models that Vals drops retain their last known scores in our dataset. As a result, composite scores for older models may reflect Vals data from a previous snapshot rather than the current leaderboard.

Notable gaps include GPT 5.3 Codex (recently released). We hope to add this model shortly.

LLM Benchmark Results

Intelligence

Intelligence vs. Cost

Intelligence vs. Speed

Methodology

Composite Indices

Source Benchmarks

Cost Index

Speed Index

Model Coverage

Acknowledgements