📊 Live Benchmark Data

AI Model Benchmarks
2026 Comparison

MMLU, HumanEval, MATH, GPQA, SWE-bench and more — sorted, filterable, and updated for May 2026. Click any column to sort.

Model	Provider	MMLU ↓	HumanEval	MATH	GPQA	SWE-bench	Input $/1M

Last updated May 2026. Scores from official model cards and papers. — = not reported. Click column headers to sort.

MMLU

Massive Multitask Language Understanding. 57 academic subjects. Tests breadth of knowledge. Max: 100%

HumanEval

Python code generation. 164 hand-written coding problems. Tests functional correctness. Max: 100%

MATH

High-school competition math. 12,500 problems. Tests mathematical reasoning. Max: 100%

GPQA

Graduate-level science questions (PhD-hard). 448 questions written by domain experts. Max: 100%

SWE-bench

Real-world GitHub issue resolution. Tests ability to navigate codebases and fix bugs autonomously. Max: 100%

PromptChief gives you prompt templates optimized for each model — code prompts for Claude, creative prompts for GPT-5, research prompts for Gemini.

AI Model Benchmarks2026 Comparison