Poetiq Beats ARC-AGI-2 – Cheaper & Better Than Gemini?

by Anika Shah - Technology
0 comments

Poetiq AI Achieves Breakthrough on ARC-AGI-2 Leaderboard

Table of Contents

AI lab Poetiq has officially topped the ARC-AGI-2 leaderboard with an approach that hints at a significant shift in how AI systems solve complex reasoning tasks. On November 20, 2025, the company announced preliminary results that have now been verified by the ARC Prize team. The Poetiq system achieved a score of 54% on the semi-Private Test Set, considerably outperforming the previous state-of-the-art held by Gemini 3 Deep Think, which scored 45%.

Poetiq Achieves Human-Level Performance on AI Reasoning Benchmark, Outperforming Google’s Gemini

A new AI system developed by Poetiq has achieved a significant milestone in artificial intelligence: reaching human-level performance on the ARC-AGI-1 benchmark, a challenging test of abstract reasoning. This achievement is particularly noteworthy not only for the performance itself but also for the cost-effectiveness of the system.

Poetiq’s system attained an accuracy of 83.2% on ARC-AGI-1, surpassing the average human score of 77.7% adn exceeding the performance of Google’s Gemini 3 Deep Think. Beyond the accuracy gains, Poetiq’s system reached this milestone at a cost of $30.57 per problem, compared to the $77.16 per problem cost of Gemini 3 Deep think. This result suggests that progress in AI reasoning is moving away from purely scaling model size and reasoning tokens and toward the implementation of well-engineered systems that optimize performance at the request layer.

To understand the meaning of this achievement, one must look at the benchmark itself. ARC-AGI-1 (previously called ARC Challenge) is based on the Abstract Reasoning Corpus (ARC) introduced by François Chollet in 2019 to measure intelligence defined as efficient skill acquisition rather then the mastery of fixed tasks.

The benchmark consists of grid-based visual puzzles where the solver must infer an underlying rule from a few example input-output pairs and apply it to a new test grid. This format aims to test “core knowledge priors” and generalization, avoiding the pitfalls of benchmarks that can be solved through the memorization of vast training datasets.

## The Rise of ARC: A New Benchmark for Artificial General intelligence

The pursuit of Artificial General Intelligence (AGI) – AI that can perform any intellectual task that a human being can – is driving the development of increasingly sophisticated benchmarks.Among these, the Abstraction Reasoning Corpus (ARC) stands out. Introduced by AI scientist François Chollet, ARC isn’t about memorizing vast datasets; it’s about *reasoning* with limited information, a hallmark of human intelligence.

ARC presents visual puzzles requiring systems to identify patterns and apply abstract rules. Unlike many AI benchmarks focused on specific skills like image recognition or language translation, ARC emphasizes fluid intelligence – the ability to solve novel problems independent of prior knowledge. This makes it a particularly challenging test for current AI models, which frequently enough excel at pattern recognition within known parameters but struggle with true abstraction.

The original ARC, released in 2020, quickly became a popular, yet arduous, benchmark. Models that performed well on other tasks often faltered on ARC, highlighting the gap between narrow AI and genuine general intelligence.

The updated ARC-AGI-2, released in March 2025, increased the difficulty to challenge a new generation of hybrid reasoning systems. It includes 1,000 training tasks and targets more complex phenomena such as symbolic interpretation and compositional reasoning. The design explicitly resists brute-force methods. In technical reports from early 2025, leading AI models scored under 5% on ARC-AGI-2, reinforcing the series’ ethos of being easy for humans, hard for AI.

Poetiq Beats ARC-AGI-2 - Cheaper & Better Than Gemini?

An example of an ARC task. These puzzles require identifying abstract patterns and applying them to new situations. (Source: ARC Prize

Poetiq: An AI That Solves Unsolvable puzzles

Poetiq, a new AI system developed by researchers at stanford, has achieved a significant milestone in artificial intelligence: solving puzzles from the ARC Prize competition that were previously considered unsolvable. The ARC Prize, designed to challenge AI’s reasoning capabilities, features complex tasks requiring a blend of common sense, logical deduction, and creative problem-solving.Poetiq’s success isn’t due to sheer computational power, but rather a novel approach to how it interacts with large language models (LLMs).

Poetiq’s success relies on a move away from standard chain-of-thought (CoT) prompting toward an iterative process known as “refinement.” In this approach, the prompt acts as an interface rather than the sole driver of intelligence. The system does not simply ask a question and accept the output; instead, it generates a potential solution, receives feedback, analyzes that feedback, and uses the underlying large language model (LLM) to refine the answer. This creates a multi-step,self-improving loop that incrementally builds the correct solution.

A key component of Poetiq’s architecture is its “Self-Auditing” feature. The system monitors its own progress and decides when it has gathered enough information or produced a satisfactory solution. This capability allows the system to terminate the process at the optimal moment, preventing wasteful computation. Consequently, the system makes fewer than two requests on average per problem, contrasting with the two attempts permitted by the ARC-AGI rules.

## Autonomous Agents Achieve Breakthroughs in Cost and Performance with Gemini 3 and GPT-5.1

Our autonomous agents rapidly integrated Gemini 3 and GPT-5.1 within hours of their release.By programmatically addressing problems using multiple model calls, the system redrew the Pareto frontier for cost versus performance, delivering higher accuracy at lower costs across the board.

Poetiq’s Meta-System Achieves Breakthroughs in AI Reasoning Through Output Refinement

A new meta-system developed by Poetiq is demonstrating significant advancements in AI reasoning capabilities, achieving state-of-the-art results on complex problem-solving tasks like the ARC-AGI-2 benchmark without relying on novel model training. This approach, focused on refining the outputs of existing Large Language Models (LLMs), is gaining traction as a key driver of progress in the field, with 2025 being dubbed the “Year of the Refinement Loop” by the ARC Prize team.

https://poetiq.ai/

The core innovation lies in Poetiq’s model-agnostic framework, which enhances the performance of LLMs through a process of verification and refinement at the application layer. This means the system doesn’t require changes to the underlying LLM itself, but instead focuses on improving the quality of its responses. The framework’s effectiveness has been demonstrated across a range of models, including those from OpenAI, Anthropic, and xAI.

Notably, the “Poetiq (Grok-4-Fast)” configuration achieved accuracy comparable to significantly more expensive models, while “Poetiq (GPT-OSS-b)” delivered strong performance at a cost of less than one cent per problem. https://www.theverge.com/2024/6/13/24179399/poetiq-arc-challenge-ai-reasoning-gemini-3-pro This adaptability highlights the system’s ability to tailor its approach to both the specific task and the characteristics of the LLM being used.

The success of Poetiq’s open-source refinement solution on Google’s Gemini 3 Pro is particularly compelling. The system improved Gemini 3 Pro’s performance on the ARC-AGI-2 benchmark from a baseline of 31% to 54%, demonstrating the potential to substantially enhance AI reasoning without requiring further model training. https://poetiq.ai/blog/gemini-3-pro-arc

Looking forward, Poetiq plans to extend its meta-system beyond abstract puzzles to tackle more complex, long-horizon tasks. The company is investigating how recursive architectures can leverage the existing world knowledge embedded within frontier models.A key focus is on improving the mechanisms for knowledge extraction to make it more compatible with LLMs, perhaps enabling the solution of complex reasoning and retrieval tasks without the need for constant model updates.

Related Posts

Leave a Comment