Comment As Jensen Huang is fond of saying, Moore’s Law is dead – and at Nvidia GTC this month, the GPU-slinger’s chief exec let slip just how deep in the ground the computational scaling law really is.
Standing on stage, Huang revealed not just the chip designer’s next-gen Blackwell Ultra processors, but a surprising amount of detail about its next two generations of accelerated computing platforms, including a 600kW rack scale system packing 576 GPUs. We also learned an upcoming GPU family, due to arrive in 2028, will be named after Richard Feynman. Surely you’re joking!
It’s not that unusual for chipmakers to tease their roadmaps from time to time, but we usually don’t get this much information all at once. And that’s because Nvidia is stuck. It’s run into not just one roadblock but several. Worse, apart from throwing money at the problem, they’re all largely out of Nvidia’s control.
These challenges won’t come as any great surprise to those paying attention. Distributed computing has always been a game of bottleneck whack-a-mole, and AI might just be the ultimate mole hunt.
It’s all up and out from here
The first and most obvious of these challenges revolves around scaling compute.
Advancements in process technology have slowed to a crawl in recent years. While there are still knobs to turn, they’re getting exponentially harder to budge.
Faced with these limitations, Nvidia’s strategy is simple: scaling up the amount of silicon in each compute node as far as they can. Today, Nvidia’s densest systems, or really racks, mesh 72 GPUs into a single compute domain using its high-speed 1.8TB/s NVLink fabric. Eight or more of these racks are then stitched together using InfiniBand or Ethernet to achieve the desired compute and memory capacity.
At GTC, Nvidia revealed its intention to boost this to 144 and eventually 576 GPUs per rack. However, scaling up isn’t limited to racks; it’s also happening on the chip package.
This became obvious with the launch of Nvidia’s Blackwell accelerators a year ago. The chips boasted 5x the performance uplift over Hopper, which sounded great until you realized it needed twice the die count, a new 4-bit datatype, and 500 watts more power to do it.
The reality was, normalized to FP16, Nvidia’s top-specced Blackwell dies are only about 1.25x faster than a GH100 at 1,250 dense teraFLOPS versus 989 — there just happened to be two of them.
By 2027 Nvidia CEO Jensen Huang expects racks to surge to 600kW with the debut of the Rubin Ultra NVL576 – Click to enlarge
We don’t yet know what process tech Nvidia plans to use for its next-gen chips, but what we do know is that Rubin Ultra will continue this trend, jumping from two reticle limited dies to four. Even with the roughly 20 percent increase in efficiency, Huang expects to get out of TSMC 2nm, that’s still going to be one hot package.
It’s not just compute either; it’s memory too. The eagle eyed among you might have noticed a rather sizable jump in capacity and bandwidth between Rubin to Rubin Ultra — 288GB per package versus 1TB. Roughly half of this comes from faster, higher capacity memory modules, but the other half comes from a doubling the amount of silicon dedicated to memory from eight modules on Blackwell and Rubin to 16 on Rubin Ultra.
Higher capacity means Nvidia can cram more model parameters, around 2 trillion at FP4, into a single package or 500 billion per “GPU” since they’re counting individual dies now instead of sockets. HBM4e also looks to effectively double the memory bandwidth over HBM3e. Bandwidth is expected to jump from around 4TB/s per Blackwell die today to around 8TB/s on Rubin Ultra.
Unfortunately, short of a major breakthrough in process tech, it’s likely future Nvidia GPU packages could pack on even more silicon.
The good news is that process advancements aren’t the only way to scale compute or memory. Generally speaking, dropping from say 16-bit to 8-bit precision effectively doubles the throughput while also halving the memory requirements of a given model. The problem is Nvidia is running out of bits to drop to juice its performance gains. From Hopper to Blackwell, Nvidia dropped four bits, doubled the silicon, and claimed a 5x floating point gain.
But below four-bit precision, LLM inference gets pretty rough, with rapidly climbing perplexity scores. That said, there’s some interesting research being done around super low precision quantization, as low as 1.58 bits while maintaining accuracy.
Not that reduced precision isn’t the only way to pick up FLOPS. You can also dedicate less die area to higher precision datatypes that AI workloads don’t need.
We saw this with Blackwell Ultra. Ian Buck, VP of Accelerated Computing business unit at Nvidia, told us in an interview they actually nerfed the chip’s double precision (FP64) tensor core performance in exchange for 50% more 4-bit FLOPS.
Whether this is a sign that FP64 is on its way out at Nvidia remains to be seen, but if you really care about double-precision grunt, AMD’s GPUs and APUs probably should be at the top of your list anyway.
In any case, Nvidia’s path forward is clear: its compute platforms are only going to get bigger, denser, hotter and more power hungry from here on out. As a calorie deprived Huang put it during his press Q&A last week, the practical limit for a rack is however much power you can feed it.
“A datacenter is now 250 megawatts. That’s kind of the limit per rack. I think the rest of it is just details,” Huang said. “If you said that a datacenter is a gigawatt, and I would say a gigawatt per rack sounds like a good limit.”
No escaping the power problem
Naturally, 600kW racks pose one helluva headache for datacenter operators.
To be clear, chilling megawatts of ultra-dense compute isn’t a new problem. The folks at Cray, Eviden, and Lenovo have had that figured out for years. What’s changed is we’re not talking about a handful of boutique compute clusters a year. We’re talking dozens of clusters, some of which are so large as to dethrone the Top500’s most powerful supers if tying up 200,000 Hopper GPUs with Linpack would make any money.
At these scales, highly-specialized, low-volume thermal management and power delivery systems simply aren’t going to cut it. Unfortunately, the datacenter vendors — you know the folks selling the not so sexy bits and bobs you need to make those multimillion dollar NVL72 racks work — are only now catching up with demand.
We suspect this is why so many of the Blackwell deployments announced so far have been for the air-cooled HGX B200 and not for the NVL72 Huang keeps hyping. These eight GPU HGX systems can be deployed in many existing H100 environments. Nvidia has been doing 30-40kW racks for years, so jumping to 60kW just isn’t that much of a stretch, and it is, dropping down to two or three servers per rack is still an option.
This is where those ‘AI factories’ Huang keeps rattling on about come into play
The NVL72 is a rackscale design inspired heavily by the hyperscalers with DC bus bars, power sleds, and networking out the front. And at 120kW of liquid cooled compute, deploying more than a few of these things in existing facilities gets problematic in a hurry. And this is only going to get even more difficult once Nvidia’s 600kW monster racks make their debut in late 2027.
This is where those “AI factories” Huang keeps rattling on about come into play — purpose built datacenters designed in collaboration with partners like Schneider Electric to cope with the power and thermal demands of AI.
And surprise, surprise, a week after detailing its GPU roadmap for the next three years, Schneider announced a $700 million expansion in the US to boost production of all the power and cooling kits necessary to support them.
Of course, having the infrastructure necessary to power and cool these ultra dense systems isn’t the only problem. So is getting the power to the datacenter in the first place, and once again, this is largely out of Nvidia’s control.
Anytime Meta, Oracle, Microsoft, or anyone else announces another AI bit barn, a juicy power purchase agreement usually follows. Meta’s mega DC being birthed in the bayou was announced alongside a 2.2GW gas generator plant — so much for those sustainability and carbon neutrality pledges.
And as much as we want to see nuclear make a comeback, it’s hard to take small modular reactors seriously when even the rosiest predictions put deployments somewhere in the 2030s.
A closer look at Dynamo, Nvidia’s ‘operating system’ for AI inference
Microsoft walking away from datacenter leases (probably) isn’t a sign the AI bubble is bursting
Schneider Electric pumps $700M into US ops as AI datacenter demand surges
Nvidia’s Vera Rubin CPU, GPU roadmap charts course for hot-hot-hot 600 kW racks
Follow the leader
To be clear, none of these roadblocks are unique to Nvidia. AMD, Intel, and every other cloud provider and chip designer vying for a slice of Nvidia’s market share are bound to run into these same challenges before long. Nvidia just happens to be one of the first to run up against them.
While this certainly has its disadvantages, it also puts Nvidia in a somewhat unique position to shape the direction of future datacenter power and thermal designs.
As we mentioned earlier, the reason why Huang was willing to reveal its next three generations of GPU tech and tease its fourth is so its infrastructure partners are ready to support them when they finally arrive.
“The reason why I communicated to the world what Nvidia’s next three, four year roadmap is now everybody else can plan,” Huang said.
On the flip side, these efforts also serve to clear the way for competing chipmakers. If Nvidia designs a 120kW, or now 600kW, rack and colocation providers and cloud operators are willing to support that, AMD or Intel now has the all clear to pack just as much compute into their own rack-scale platforms without having to worry about where customers are going to put them. ®
GIPHY App Key not set. Please check settings