Vera Rubin: NVIDIA's Engineering Miracle Supercomputer

A bit like a gamer who has just finished assembling his latest gaming “beast”, Jensen Huang looks with admiring eyes at his company’s latest creation. On the stage of the Las Vegas keynote, in front of him, there are the different elements that inserted into the rack make up what is called NVL72, a rack-scale NVIDIA platform that integrates 72 GPUs connected to each other via NVLink in order to function as a single coherent calculation system with very high bandwidth and very low latency.

When NVIDIA chose to name its new supercomputer Vera Rubin it wasn’t simply paying homage to the famous astronomer. He was also explaining the nature of the problem he intended to solve, something invisible that holds the entire AI universe together. Vera Rubin, the astronomer who first observed anomalies in the rotation of galaxies and realized the existence of dark matter, he had discovered that the universe we see is only a fraction of what exists.

Similarly, the power of today’s data centers is only a fraction of that needed to reach the next frontier of AI and power a world of robots.

The problem is simple: AI models grow by an order of magnitude every year, and token generation has increased fivefold year over year, but at the same time the cost per token of previous generations has collapsed tenfold per year.

This is a symptom not of a struggling AI market or a possible bubble, but of a race for AI computing power so intense that each new step makes the previous one obsolete. And while all this is happening, Moore’s Law has pretty much stopped working the way we’re used to.

Huang explains that this is where NVIDIA made a decision for Vera Rubin that goes against every established principle of semiconductor engineering: simultaneously redesign all six chips that make up the system.

Huang in front of the NVL72 racks

Within the company he founded, the CEO tells us, there was a strict rule: never change more than one or two chips per generation; It’s a sensible rule, born from decades of experience in the semiconductor industry. “When you’re faced with an exponentially widening gap between available transistor capabilities and required computation – explains Huang – sensible rules become recipes for irrelevance”.

The architecture of the impossible

Table of Contents

The architecture of the impossible
The revolution in assembly: zero cables, five minutes
Silicon photonics, the future of networking
Performance that redefines what’s possible
Beyond Moore’s Law

Vera Rubin is a system of six new chips designed to operate as one. At the center is the Vera processor, a custom ARM CPU designed by NVIDIA that doubles the performance of the previous generation in a world, that of AI, today tied exclusively to power.

Vera does not absolutely double performance, which would be achievable simply by increasing clock and power consumption, doubles performance per watt in an era where every data center has a fixed power limit that it can consume and where energy efficiency directly determines how much computation you can extract from a given energy budget. The real problem of AI today is the consumption of data centers.

Vera Rubin: NVIDIA's Engineering Miracle Supercomputer

Vera implements eighty-eight physical cores, but that’s only half the story. Through a technology called “spatial multi-threading” each core is designed to handle two threads simultaneously, but it does not do so in the traditional way we are used to on consumer processors, where threads compete for the same execution resources and inevitably degrade each other. In conventional multi-threaded processors, when you run two threads on the same core, each thread gets maybe sixty or seventy percent of the performance that it would get if it had the core all to itself, and that’s kind of what’s pushed Intel on recent generations to take a step back. The overall gain is there, but two threads never give you double the performance.

Vera’s spatial multi-threading has been architected differently: core resources are spatially partitioned, physically separated so that each thread has dedicated access to the execution units it needs. Each thread can reach its full performance, as if it had a dedicated core.

The eighty-eight physical cores essentially behave like 166 independent cores when it comes to performance, doubling computational throughput without doubling silicon area or power consumption. “It’s an extraordinary efficiency multiplier, only possible when you rethink the core architecture from scratch instead of adapting existing designs” Huang says with twinkling eyes.

Having an extremely powerful CPU creates a new problem, however: if the CPU can process data twice as fast, but the channels that bring data in and out of the CPU remain slow, you’ve simply moved the bottleneck. It’s like having a Ferrari engine in a car with small car exhausts: the power is there, but it can’t be expressed.

In modern AI systems the CPU must constantly exchange data with the GPUs that do the heavy computational work, and it must do so across the network that connects thousands of components in the rack. If these communication channels are not up to par, the CPU spends most of its time waiting instead of computing.

This is why NVIDIA designed Vera together with ConnectX9 from the beginning: ConnectX9 is the networking chip that manages how data flows between the CPU, GPUs and between all system components.

It provides 1.6 terabits per second of bandwidth for each GPU, enough speed to keep up with the speed at which Vera processes data. It’s not just about raw speed: Vera and ConnectX9 have to speak the same language, use the same protocols, coordinate how to manage shared memory and that’s why the two chips were developed in parallel, with the engineering teams deciding together which features to implement in which chip, how to make the components communicate in the most efficient way, which compromises to make. If NVIDIA had designed Vera first and then tried to fit ConnectX9 into it after, or vice versa, it would have had to make compromises that limit the performance of both.

Instead, by designing them together, each chip can be optimized by knowing exactly how the other works and what it expects to receive. NVIDIA didn’t release Vera until ConnectX9 was ready – one without the other would have been like having half a bridge.

The Vera chip shown at the presentation is gigantic, one of the largest CPU dies NVIDIA has ever produced, a tile of 227 billion transistors.

Alongside Vera is Rubin, the GPU that represents perhaps the most impressive technical achievement of this generation. Rubin offers five times the floating point performance of Blackwell, the current generation, but there is one fact that makes it clear how advanced this super computer is: it contains only 1.6 times the number of transistors as Blackwell.

In a world where semiconductor physics tells us that more transistors should linearly translate into more performance, NVIDIA has managed to extract three times more performance per transistor. Those who have followed the world of semiconductors for years know well that with the change in production processes and the revisiting of architectures the leaps that can be made are never exaggerated, and even assuming a twenty-five percent improvement in transistor performance and optimal production yields is mathematically impossible to achieve one hundred percent performance jumps year on year.

Rubin succeeds, demonstrating once again that in today’s world only perfect integration between hardware and software can give you performances that non-dedicated software on open platforms cannot provide. Here the novelty is called NVFP4 TensorCore, with which NVIDIA has fundamentally rethought the way calculations are performed.

The name might make you think that NVIDIA has simply thought of a new, more efficient 4-bit format but that would be an understatement. NVFP4 is an entire computing ecosystem: Rubin’s new Tensor Core, when faced with this format, is not a simple passive executor but an intelligent processor capable of “reading” the context. In traditional systems, when it comes to making calculations for AI, “static” precision is applied regardless of the operation but not all calculations have the same weight: some operations require surgical precision to preserve the semantic meaning of the model, while others can be simplified without damage. The magic of NVFP4 is managing this distinction in real time: the Tensor Core analyzes each operation and decides, clock cycle after clock cycle, whether to push the accelerator to reduce precision or to brake to ensure maximum accuracy. This is how the monstrous 5x increase in floating point performance comes out.

A calculation of this type cannot be done in software, because the choice must be made super quickly, at the level of individual clock cycles of the processor. That’s why it must be implemented directly in hardware, with dedicated logic that analyzes the context of each operation and makes precision decisions millions of times per second. This dynamic adaptation capability allows Rubin to maintain levels of accuracy comparable to higher precision calculations while operating at a computational speed that would be impossible to achieve with traditional approaches.

NVIDIA has already published several academic papers on this technology, and the precision it manages to maintain while pushing throughput to the limit is, Huang proudly says, “totally amazing“.

The results show that quality degradation is minimal or non-existent for the majority of inference and training workloads while the performance gain is substantial.

“It would not be surprising if the industry decided to adopt this format and structure as a future standard” says the CEO of NVIDIA, making it clear that, a bit like it did recently with CUDA, the only way for NVIDIA to keep the reins of the AI market is to make sure that everyone uses its tools and solutions. With which NVIDIA, which created them, will always have an advantage both in terms of access and performance.

A powerful processor is useless if data can’t reach it fast enough and that’s where ConnectX-9 and the new 6th generation NVLink switch come into play. ConnectX9 provides 1.6 terabits per second of bandwidth to each individual GPU while the NVLink switch, when the rest of the world is still struggling to reach 200 gigabit, it moves data at an incredible rate of 240 terabytes per second.

To understand how absurd NVIDIA’s work is, just think about what every single switch is capable of moving twice the entire capacity of the Internet around the world every second. Every single GPU must be able to communicate with every other GPU simultaneously, without waiting, without queues, without compromises, and NVIDIA has created what is effectively the fastest data transfer system for a super computer.

The revolution in assembly: zero cables, five minutes

The most radical innovation lies not only in the individual chips but also in the way they have been integrated. The previous generation required forty-three wires and six liquid cooling tubes to assemble. It took two hours for each rack, and time in this case is money given the enormous demand that NVIDIA has to support. “If you got something wrong, you had to test, disassemble, reassemble. It was an artisanal process in an era that requires mass industrial production” explains Huang.

Vera Rubin has zero wires, two tubes, and assembles in five minutes. It is one hundred percent liquid cooled, and that liquid enters at forty-five degrees Celsius: hot enough that you don’t need tools in data centers to cool it, so-called chillers. NVIDIA is literally cooling its supercomputers with hot watera miracle of thermal efficiency that alone could save approximately six percent of global data center energy consumption.

A single Vera Rubin MVLink72 rack contains seventy-two Rubin GPUs, where each “GPU” is actually two die connected together. Eighteen compute trays, nine MVLink switch trays, 220 trillion transistors, almost two tons of weight. Inside are two miles of copper cables, five thousand individual connections, all shielded, structured, carrying data at 400 gigabits per second from the top to the bottom of the rack. It’s the largest use of structured copper cabling the computing world has ever seen.

But NVIDIA didn’t stop there, it also invented a new category of storage. The problem to be solved is that of the AI’s working memory. Today when a user uses an AI model, they would like the model to keep in memory everything that happened previously and to make this happen every time a model generates a token it must load the entire model into memory, load the entire previous conversation and process the entire context saving the result in what is called “KV cache”.

With conversations getting longer and longer, with models getting bigger and with users wanting AI to remember every interaction they’ve ever had the “KV cache” becomes huge.

Initially this memory was saved in the system’s high-bandwidth memory, which is kind of the reason why AI is so hungry for RAM, but to address this problem already on the current generation, Blackwell, NVIDIA has added fast expanded memory. Even this solution is no longer enough.

Then comes Bluefield 4, a dedicated processor that manages a new type of very fast storage system that does not rely on RAM: behind every Bluefield 4 processor there are one hundred and fifty terabytes of contextual memory and for each GPU in the system this translates into sixteen additional terabytes of memory accessible at very high speed, two hundred gigabits per second, as if it were local memory.

Silicon photonics, the future of networking

And then there’s Spectrum X. Two years ago NVIDIA entered the ethernet switch market for the first time with Spectrum AI traffic is different from traditional traffic: there are instantaneous spikes in intensity that would cause a normal Ethernet switch to collapse, and latencies that need to be measured in nanoseconds, not milliseconds. Spectrum X was so successful that NVIDIA has now become the largest networking company the world has ever seen.

The new generation of Spectrum-X marks a historic step in networking for artificial intelligence. In fact, NVIDIA brings into production a technology that, until very recently, was confined to research laboratories: silicon photonics integrated directly into the chip.

Traditionally, the worlds of electronics and optics have remained separate in data centers. The chips process the data electrically, while the conversion into light signals, which is necessary for transmitting information over long distances via fiber, takes place in external optical modules, so-called transceivers. This separation introduces clear limitations in terms of power consumption, latency, heat dissipation, and overall system scalability.

With Spectrum-X this pattern is overcome. What Huang shows us is the first chip produced using TSMC’s COOP process (Co-Optimized Optical Packaging) by TSMC, a technology co-developed to integrate electronic and photonic components in the same package. In practice, light is no longer generated and modulated in an external device: the lasers enter directly into the chip, where the integrated optics takes care of the modulation of the light signal.

The result is a co-packaged optics architecture, which eliminates the need for traditional optical transceivers and by physically looking at the chip it is possible to identify the laser entry points and integrated photonic structures: everything is concentrated in a single package, designed to operate on an industrial scale.

From a numerical perspective, the specs are impressive: 512 ports of 200 gigabits per second eachfor an overall bandwidth that well exceeds 100 terabits per second per single chip.

A level of density and throughput that would be impractical with traditional architecture based on separate optical modules.

Performance that redefines what’s possible

Vera Rubin’s performance numbers are almost difficult to conceptualize. To train a ten trillion parameter model on one hundred trillion tokens, what NVIDIA projects to be near the next frontier of AI, Blackwell would require a certain number of systems to complete the training in a month. Rubin achieves the same result with a quarter of the systems. Throughput per watt, the crucial metric in a world where every data center has a fixed limit on how much power it can consume, is ten times higher than Blackwell, which was already ten times higher than Hopper. For a one gigawatt data center, which costs about fifty billion dollars, a ten percent improvement in throughput is worth five billion dollars.

The cost per token generated, the metric that determines the affordability of AI, is one-tenth that of Blackwell.

This isn’t just an incremental improvement: It’s the kind of leap that makes possible entire new categories of AI applications that were previously economically unviable.

Beyond Moore’s Law

Vera Rubin represents something more than a new super computer: she is a demonstration that when the fundamental laws of physics stop working in your favor, the answer is not to give up but to rewrite the rules of the game.

NVIDIA violated its own principle of not changing more than two chips per generation because it understood a fundamental truth: in a world where demand grows exponentially and physics grows linearly, the only solution is vertical innovation across the entire hardware and software stack.

Jensen Huang tells us that each chip in Vera Rubin might deserve its own press conference, because each chip represents years of research, thousands of engineers, milestones that redefine the state of the art in their respective categories. Put together they create a system that is not simply the sum of the parts, but something qualitatively different: a supercomputer that makes its predecessors seem like they belong to a bygone technological era.

All this also makes us understand how small the consumer world, and the evolution of consumer technologies, is compared to what is being done today for AI. The level of technology present in Vera Rubin, and the engineering capabilities of NVIDIA, are impressive. To date there is nothing even remotely close to what NVIDIA is doing for AI.

It took NVIDIA fifteen thousand engineer years from the moment the design began, that is, it would have taken a single engineer 15,000 years to design a similar system, which is equivalent to approximately 5000 engineers for three years of work. The first Vera Rubin MVLink72 rack is now in full production.

Vera Rubin was right about dark matter. NVIDIA is also betting that it’s right that seemingly impossible challenges require not just new tools, but new ways of thinking about what tools are possible. She was able to do it, who will be able to follow her?

date:2026-01-06 18:00:00

Vera Rubin: NVIDIA’s Engineering Miracle Supercomputer

The architecture of the impossible

The revolution in assembly: zero cables, five minutes

Silicon photonics, the future of networking

Performance that redefines what’s possible

Beyond Moore’s Law

CES 2026 Day 1: Robots & AI Dominate Key Announcements

Ivory Coast 3-0 Burkina Faso: Holders Ease into Afcon 2025 Quarter-Finals

Related Posts

Leave a Comment Cancel Reply