OpenAI’s GPT-5.3-Codex-Spark: A Leap Towards Real-Time AI Coding with Cerebras
In a significant move diversifying its hardware strategy, OpenAI has launched GPT-5.3-Codex-Spark, its first production AI model deployed on Cerebras Systems’ wafer-scale chips instead of traditional Nvidia GPUs. The new model is designed to deliver improved throughput and low-latency, enabling a real-time, interactive coding experience, according to OpenAI.
A New Era of Interactive Coding
Codex-Spark runs at over 1,000 tokens per second, representing a roughly 15x speed increase compared to earlier versions, making live coding assistance and rapid iteration significantly more responsive. OpenAI designed the model “specifically for working with Codex in real-time—making targeted edits, reshaping logic, or refining interfaces and seeing results immediately,” as stated in a Cerebras blog post.
Optimized for Speed and Responsiveness
To enable real-time coding, OpenAI optimized Codex-Spark for low latency and interactive coding workflows rather than deep reasoning or general-purpose tasks. Despite this focus on speed, the model retains the ability to handle long-running processes, operating for “hours, days, and weeks without intervention.”
Performance Benchmarks
GPT-5.3-Codex-Spark demonstrated its performance on SWE-Bench Pro and Terminal-Bench 2.0, benchmarks tailored for software engineering tasks. It achieved results comparable to GPT-5.1-Codex-mini and GPT-5.3-Codex, but in a fraction of the time. OpenAI also notes that end-to-end improvements implemented to reduce latency across the full request-response pipeline will benefit all their models.
Under the Hood: Technical Enhancements
OpenAI streamlined the process of streaming responses from client to server and back, rewrote key parts of its inference stack, and reworked session initialization to ensure faster initial token appearance and sustained responsiveness during iteration. These enhancements included the introduction of a persistent WebSocket connection and optimizations in the Responses API. These improvements reduced per client/server roundtrip overhead by 80%, per-token processing time by 30%, and time-to-first-token by 50%.
The Cerebras Partnership and Wafer Scale Engine
Codex-Spark runs on Cerebras’ Wafer Scale Engine 3 (WSE-3) accelerators, which are particularly suited to low-latency, high-speed inference. This marks OpenAI’s first production deployment on Cerebras hardware, signaling a strategic diversification from its long-standing reliance on Nvidia. However, OpenAI clarified that this does not represent a departure from GPUs as the core of their training and inference pipeline, and that Cerebras accelerators can be combined with GPUs to leverage the strengths of both architectures.
Community Response and Considerations
The announcement sparked discussion online. Some users emphasized a preference for accuracy over speed, noting that waiting for more reliable results can be preferable. Others pointed out that the cumulative cost of faster iterations could potentially offset the benefits of speed. One user on X.com, Nicholas Van Landschoot, observed that the speed improvements may not be as dramatic as claimed in practical benchmarks.
Future Developments
Codex-Spark currently provides a 128k context window and supports text-only input. OpenAI plans to introduce faster models featuring larger contexts based on usage insights gathered from the developer community.