GPT-5.5 Surprises by Outperforming Claude Fable 5

The Rise of Agents' Last Exam

The University of California, Berkeley has introduced a groundbreaking benchmark called Agents’ Last Exam (ALE), designed to evaluate AI's ability to perform economically valuable tasks. In a surprising upset, OpenAI's GPT-5.5 achieved a 24.0% pass rate, surpassing Anthropic's Claude Fable 5, which scored 22.0%. This shift marks a significant departure from traditional AI assessments, focusing on real-world applications rather than isolated coding challenges.

ALE's innovative framework requires AI models to demonstrate capabilities across five functional layers: Brain, Eyes, Body, Hands, and Feet. This comprehensive approach ensures that models cannot simply rely on static question-answering but must engage in complex, multi-step interactions. By minimizing the reliance on subjective grading, ALE aims to provide a more accurate reflection of an AI's practical abilities in various professional workflows.

# Key Features of ALE

1,490 task instances, with plans to expand to 5,000.
Focus on real-world tasks relevant to U.S. federal standards.
Strict evaluation criteria to eliminate loopholes in grading.

As AI continues to evolve, benchmarks like ALE will play a crucial role in determining which models can truly deliver value in the workforce.