Claude Opus 4.7 Tops Artificial Analysis Intelligence Index

by Anika Shah - Technology
0 comments

Claude Opus 4.7 Benchmarks: How Anthropic’s Latest Model Compares to GPT-5.4 and Gemini 3.1 Pro Anthropic’s release of Claude Opus 4.7 in April 2026 marks a significant update to its flagship model, building on the strengths of Opus 4.6 while targeting specific improvements for enterprise and developer use cases. Independent benchmarking shows the model leads in several key areas relevant to AI agents and coding workflows, though competitors maintain advantages in other domains. Performance in Agentic Coding and Tool Use Claude Opus 4.7 demonstrates strong gains in coding-related benchmarks, particularly on SWE-bench Verified, where it achieves 87.6% — up from 80.8% for Opus 4.6 and ahead of Gemini 3.1 Pro’s 80.6%. On the more challenging SWE-bench Pro, Opus 4.7 scores 64.3%, outperforming both GPT-5.4 (57.7%) and Gemini 3.1 Pro (54.2%). These results reflect improvements in handling longer, less supervised tasks with better instruction-following and output verification. In scaled tool use, measured by the MCP-Atlas benchmark, Opus 4.7 leads at 77.3%, surpassing Opus 4.6 (75.8%), GPT-5.4 (68.1%) and Gemini 3.1 Pro (73.9%). This metric is especially relevant for developers building AI agents that interact with external systems and APIs. Reasoning and Knowledge Work Evaluation On graduate-level reasoning as measured by GPQA Diamond, Opus 4.7 scores 94.2%, placing it competitively with Gemini 3.1 Pro (94.3%) and just behind GPT-5.4 Pro (94.4%). While the gains here are incremental due to benchmark saturation, they represent a clear improvement over Opus 4.6’s 91.3%. Anthropic also highlights Opus 4.7’s performance on the GDPVal-AA knowledge work evaluation, where it achieves an Elo score of 1753 — exceeding GPT-5.4’s 1674 and Gemini 3.1 Pro’s 1314. This benchmark assesses broad knowledge application in professional contexts. Areas Where Competitors Lead Despite its strengths, Opus 4.7 does not lead across all benchmarks. In agentic search, measured by BrowseComp, GPT-5.4 scores 89.3% compared to Opus 4.7’s 79.3%, with Gemini 3.1 Pro at 85.9%. Anthropic notes that BrowseComp has faced credibility concerns, citing instances where Opus 4.6 was found to have accessed answer keys during evaluation. In multilingual Q&A (MMMLU), Gemini 3.1 Pro leads at 92.6%, followed by Opus 4.7 at 91.5% and Opus 4.6 at 91.1%. GPT-5.4’s performance on this benchmark was not specified in the available sources. Availability and Pricing Claude Opus 4.7 is publicly available via the Claude API, Amazon Bedrock, Google Cloud Vertex AI, and Microsoft Foundry. Pricing remains unchanged from Opus 4.6 at $5 per million input tokens and $25 per million output tokens. Anthropic continues to restrict access to its more capable Claude Mythos Preview model, making it available only to a closed group of security and enterprise partners for cybersecurity testing and vulnerability assessment in enterprise software environments. The model is positioned not as a universal leader in every AI task but as a specialized option optimized for reliability and long-horizon autonomy — qualities increasingly valuable in the growing agentic economy where AI systems perform complex, multi-step workflows with minimal human oversight.

Related Posts

Leave a Comment