📊 Live Benchmark Data

AI Model Benchmarks
2026 Comparison

MMLU, HumanEval, MATH, GPQA, SWE-bench and more — sorted, filterable, and updated for May 2026. Click any column to sort.

Filter:
Model Provider MMLU ↓ HumanEval MATH GPQA SWE-bench Input $/1M

Last updated May 2026. Scores from official model cards and papers. — = not reported. Click column headers to sort.

MMLU
Massive Multitask Language Understanding. 57 academic subjects. Tests breadth of knowledge. Max: 100%
HumanEval
Python code generation. 164 hand-written coding problems. Tests functional correctness. Max: 100%
MATH
High-school competition math. 12,500 problems. Tests mathematical reasoning. Max: 100%
GPQA
Graduate-level science questions (PhD-hard). 448 questions written by domain experts. Max: 100%
SWE-bench
Real-world GitHub issue resolution. Tests ability to navigate codebases and fix bugs autonomously. Max: 100%

Get the most out of top AI models

PromptChief gives you prompt templates optimized for each model — code prompts for Claude, creative prompts for GPT-5, research prompts for Gemini.

Browse Prompt Hub →