Artificial intelligence is evolving at a staggering pace, and researchers are now putting it to what they call Humanity’s Last Exam (HLE)—a test designed to challenge AI models with the toughest academic questions ever compiled. Experts predict that within the next year, AI could dramatically improve its accuracy, bringing it closer to mastering knowledge at a human level.
The Exam Designed to Outsmart AI
Unlike standard assessments, HLE isn’t just another set of routine questions. It was created by specialists from the Center for AI Safety and Scale AI, a for-profit company that works with major tech firms to refine AI training data. Their goal? To design a test so challenging that even the most advanced large language models (LLMs), like ChatGPT, Gemini, and DeepSeek, struggle to score above a failing grade.
HLE pulls from over 2,700 expert-submitted questions, spanning disciplines from mathematics and medicine to engineering and humanities. Any questions that today’s AI models could easily answer were discarded. Instead, the exam focuses on problems requiring deeper reasoning, specialized knowledge, and complex interpretations—things AI has traditionally struggled with.
The results so far? AI models have flunked spectacularly, scoring between 3 and 14 percent. But that may not last for long.
AI Models Are Rapidly Improving
The latest study suggests that by the end of 2025, LLMs could achieve at least 50 percent accuracy on the test. That’s a massive leap, considering the difficulty of the questions.
According to the researchers:
“HLE is developed globally by subject-matter experts and consists of multiple-choice and short-answer questions suitable for automated grading. Each question has a known solution that is unambiguous and easily verifiable, but cannot be quickly answered via internet retrieval.”
The test is structured as follows:
- 41% Mathematics
- 11% Biology & Medicine
- 10% Computer Science
- 9% Physics
- 9% Humanities & Social Sciences
- 6% Chemistry
- 5% Engineering
- 9% Other topics
Examples of the kinds of challenges LLMs face include translating ancient Roman inscriptions, identifying missing links in chemical reactions, and solving highly advanced mathematical equations. One question even asks AI about itself—testing whether it truly understands its own limitations.
AI’s Next Step: Recognizing Uncertainty
One of AI’s biggest flaws is overconfidence—it often provides an answer even when it has no idea if it’s correct. To address this, researchers are training AI models to evaluate their own uncertainty, forcing them to assess confidence levels before responding.
In the next phase of AI development, models will not only give answers but will also provide a confidence score from 0 to 100 percent. The idea is to move away from blind guessing and towards an approach that mirrors human uncertainty—where admitting “I don’t know” is sometimes the best answer.
The results of HLE are verified by another AI model, GPT-40, which checks whether slight variations of a correct response are still valid. This is similar to how a contestant on Jeopardy! might answer “T. rex” instead of “Tyrannosaurus rex” and still be awarded points.
History suggests that AI models rapidly overcome benchmarks, sometimes going from near-zero accuracy to near-perfect scores in just a few training cycles. While today’s LLMs are failing HLE, it may only be a matter of time before they crack the code.
What this means for the future is still up for debate. Will AI become the ultimate academic tool, capable of answering any question with near-perfect accuracy? Or will researchers keep raising the bar, ensuring that human intelligence remains ahead?