Benchmark Directory
Explore leaderboards for 2 benchmarks. Each leaderboard includes a value ranking showing which models deliver the best performance per dollar.
MMLU
127 models scored
Massive Multitask Language Understanding — tests knowledge across 57 academic subjects.
GPQA
125 models scored
Graduate-level science questions vetted by domain experts for difficulty.