Humanity's Last Exam
The most challenging benchmark available
Test your knowledge with extremely challenging questions across diverse academic domains. Covers everything from chess puzzles to quantum physics.
GSM8K
Grade school math word problems
Multi-step math problems that require breaking down word problems into logical steps. Designed to test mathematical reasoning skills.
MMLU
Massive multitask language understanding
Multiple-choice questions across 57 academic subjects from elementary to professional level. Tests broad knowledge and reasoning.
HumanEval
Programming problem solving
Python coding challenges that test algorithmic thinking and code generation skills. Includes function completion tasks.
ARC Challenge
Science reasoning questions
Grade-school level science questions that require complex reasoning beyond simple fact recall. Tests scientific thinking.
HellaSwag
Commonsense reasoning
Tests commonsense reasoning about everyday situations. Given a context, predict what happens next from multiple choice options.
WinoGrande
Pronoun resolution
Fill-in-the-blank tasks that require commonsense reasoning to resolve pronoun references correctly.
TruthfulQA
Truthful question answering
Questions designed to test whether answers are both truthful and informative, avoiding common misconceptions.
BoolQ
Yes/no reading comprehension
Answer yes/no questions based on Wikipedia passages, requiring reading comprehension and reasoning skills.
PIQA
Physical interaction reasoning
Physical interaction question answering about everyday physical reasoning. Choose the most sensible solution to achieve a goal.