A test for AGI is closer to being solved — but it may be flawed

by Anika Shah - Technology
0 comments

A controversial test often touted as a benchmark for artificial general intelligence (AGI) has seen a surprising jump in performance – but the creators are not celebrating just yet. The ARC-AGI benchmark, designed to measure AI’s ability to learn new skills outside its training data, has seen a 20% increase in performance this year, but its creators admit the test might need some serious revisions.

What is the ARC-AGI Benchmark?

In 2019, Francois Chollet, a leading figure in AI, introduced the ARC-AGI benchmark, aiming to gauge how well AI systems could demonstrate general intelligence. This involves solving puzzle-like problems involving different colored squares, requiring AI to adapt to new, unseen patterns.

Tasks in the ARC-AGI benchmark. Models must solve ‘problems’ in the top row; the bottom row shows solutions. Image Credits:ARC-AGI

The Rise and Fall (and Rise Again?) of ARC-AGI

Until this year, the highest-performing AI could only solve about a third of the tasks. This led Chollet to criticize the focus on large language models (LLMs), arguing they lacked true reasoning abilities and instead merely memorized patterns. He proposed that only AI capable of “generating new reasoning” based on novel situations could be considered truly intelligent.

Then, in June 2024, a $1 million competition was launched to incentivize research beyond LLMs. Out of 17,789 submissions, the best achieved a score of 55.5%, a significant jump but still short of the 85% “human-level” threshold.

This success, however, has exposed some potential flaws in the ARC-AGI test. many submissions seem to have “brute forced” their way to solutions, raising questions about whether these tasks truly measure general intelligence.

Criticisms and Future Directions

The ARC-AGI benchmark has faced criticism for overstating its claim as a definitive test for AGI, particularly as the definition of AGI itself remains highly debated. Some experts argue that current AI systems, even if not perfect, already surpass humans in many tasks, meeting certain criteria for AGI.

Despite these challenges, Chollet and Knoop remain committed to refining the ARC-AGI benchmark. They plan to release a second-generation version in 2025, along with another competition, to address the identified issues and continue pushing the boundaries of AI research.

“We will continue to direct the efforts of the research community towards what we see as the most important unsolved problems in AI, and accelerate the timeline to AGI,” Chollet wrote in an X post.

What’s your take? Do you think AI has already achieved general intelligence? Share your thoughts in the comments!

Related Posts

Leave a Comment