Poetiq Beats ARC-AGI-2 - Cheaper & Better Than Gemini?

Poetiq AI Achieves Breakthrough on ARC-AGI-2 Leaderboard

Table of Contents

Poetiq AI Achieves Breakthrough on ARC-AGI-2 Leaderboard
Poetiq Achieves Human-Level Performance on AI Reasoning Benchmark, Outperforming Google’s Gemini
Poetiq: An AI That Solves Unsolvable puzzles
Poetiq’s Meta-System Achieves Breakthroughs in AI Reasoning Through Output Refinement

AI lab Poetiq has officially topped the ARC-AGI-2 leaderboard with an approach that hints at a significant shift in how AI systems solve complex reasoning tasks. On November 20, 2025, the company announced preliminary results that have now been verified by the ARC Prize team. The Poetiq system achieved a score of 54% on the semi-Private Test Set, considerably outperforming the previous state-of-the-art held by Gemini 3 Deep Think, which scored 45%.

Poetiq Achieves Human-Level Performance on AI Reasoning Benchmark, Outperforming Google’s Gemini

A new AI system developed by Poetiq has achieved a significant milestone in artificial intelligence: reaching human-level performance on the ARC-AGI-1 benchmark, a challenging test of abstract reasoning. This achievement is particularly noteworthy not only for the performance itself but also for the cost-effectiveness of the system.

Poetiq’s system attained an accuracy of 83.2% on ARC-AGI-1, surpassing the average human score of 77.7% adn exceeding the performance of Google’s Gemini 3 Deep Think. Beyond the accuracy gains, Poetiq’s system reached this milestone at a cost of $30.57 per problem, compared to the $77.16 per problem cost of Gemini 3 Deep think. This result suggests that progress in AI reasoning is moving away from purely scaling model size and reasoning tokens and toward the implementation of well-engineered systems that optimize performance at the request layer.

To understand the meaning of this achievement, one must look at the benchmark itself. ARC-AGI-1 (previously called ARC Challenge) is based on the Abstract Reasoning Corpus (ARC) introduced by François Chollet in 2019 to measure intelligence defined as efficient skill acquisition rather then the mastery of fixed tasks.

The benchmark consists of grid-based visual puzzles where the solver must infer an underlying rule from a few example input-output pairs and apply it to a new test grid. This format aims to test “core knowledge priors” and generalization, avoiding the pitfalls of benchmarks that can be solved through the memorization of vast training datasets.

## The Rise of ARC: A New Benchmark for Artificial General intelligence

The pursuit of Artificial General Intelligence (AGI) – AI that can perform any intellectual task that a human being can – is driving the development of increasingly sophisticated benchmarks.Among these, the Abstraction Reasoning Corpus (ARC) stands out. Introduced by AI scientist François Chollet, ARC isn’t about memorizing vast datasets; it’s about *reasoning* with limited information, a hallmark of human intelligence.

ARC presents visual puzzles requiring systems to identify patterns and apply abstract rules. Unlike many AI benchmarks focused on specific skills like image recognition or language translation, ARC emphasizes fluid intelligence – the ability to solve novel problems independent of prior knowledge. This makes it a particularly challenging test for current AI models, which frequently enough excel at pattern recognition within known parameters but struggle with true abstraction.

The original ARC, released in 2020, quickly became a popular, yet arduous, benchmark. Models that performed well on other tasks often faltered on ARC, highlighting the gap between narrow AI and genuine general intelligence.

The updated ARC-AGI-2, released in March 2025, increased the difficulty to challenge a new generation of hybrid reasoning systems. It includes 1,000 training tasks and targets more complex phenomena such as symbolic interpretation and compositional reasoning. The design explicitly resists brute-force methods. In technical reports from early 2025, leading AI models scored under 5% on ARC-AGI-2, reinforcing the series’ ethos of being easy for humans, hard for AI.