in

Million GPU clusters, gigawatts of power – the scale of AI defies logic

Million GPU clusters, gigawatts of power – the scale of AI defies logic

Comment Next year will see some truly monstrous compute projects get underway as the AI boom enters its third year. Among the largest disclosed so far is xAI’s plan to expand its Colossus AI supercomputer from an already impressive 100,000 GPUs to a cool million.

Such a figure seemingly defies logic. Even if you could source enough GPUs for this new Colossus, the power and cooling – not to mention capital – required to support it would be immense.

At $30,000 to $40,000 a pop, adding another 900,000 GPUs would set xAI back $27 to $36 billion. Even with a generous bulk discount, it still won’t be cheap regardless of whether they’re deployed over the course of several years. Oh, and that’s not even taking into account the cost of the building, cooling, and the electrical infrastructure to support all those accelerators.

Speaking of power, depending on what generation of accelerators xAI plans to deploy, the GPU nodes alone would require roughly 1.2 to 1.5 gigawatts of generation. That’s more than the typical nuclear reactor – and the big ones, no less. And again, that’s just for the compute.

Your gut reaction might be to chalk these figures up to an eccentric billionaire whose off-the-cuff quip was taken as gospel and then parroted by the local Chamber of Commerce as fact. However, when you take into consideration what the competition is doing, the scale of this new Colossus starts to look a little less crazy.

A terminal case of AI fever
The same week as the Greater Memphis Chamber dropped the details on xAI’s reported expansion plans, rival model dev and Xitter competitor Meta announced a massive datacenter campus of its own. The facility, slated for construction in Richland Parish, Louisiana, will span 4 million square feet and cost $10 billion.

Meta hasn’t revealed how many accelerators the plant might hold, but CEO Mark Zuckerberg has already committed to deploying 600,000 some GPUs this year alone. To put that number into perspective, that’s nearly as many H100-class GPUs analysts believe Nvidia shipped in all of 2023.

From what we’re told, the site will likely be built in phases over the next few years, and it’ll consume a monumental amount of power.

For reference, it’s not unusual for a typical cloud datacenter campus with multiple data halls to have a rated capacity of around 50 megawatts. With power constraints in the US already proving problematic for datacenter operators, you’d think this would be a problem for all these AI obsessed hyperscalers, cloud providers, and model builders – but instead, they’re just bankrolling their own generator plants.

As for Meta’s Louisiana campus, it has partnered with Entergy to construct three gas turbines with a combined energy production of more than 2.2 gigawatts.

We’ll have to wait and see if the entire site is ever completed. We can only imagine an AI bubble burst might derail those plans in a hurry – assuming it is in fact a bubble. We’ll let you debate that in the comments.

In any case, with numbers this large, suddenly, the idea of building a nuclear plant’s worth of power doesn’t sound so crazy after all. In fact, Meta seems so confident that its power demands are going to continue to grow that it’s started fishing for suppliers that can get it one to four gigawatts of nuclear energy by the early 2030s.

Day after nuclear power vow, Meta announces largest-ever datacenter powered by fossil fuels

Altman to Musk: Don’t go full supervillain – that’s so un-American

Fission impossible? Meta wants up to 4GW of American atomic power for AI

AWS says AI could disrupt everything – and hopes it will do just that to Windows

The AI fever with which the tech giants have collectively come down has served as a sort of sea change for the nuclear industry as a whole, with cloud providers fronting the cash to reinstate retired reactors – and even plop their datacenters behind the meter in the case of AWS’ new Cumulus datacenter complex.

Speaking of Amazon, it’s certainly not just Meta and xAI dreaming big. The e-commerce giant turned cloud provider last week cranked up the heat on its AI ambitions. At re:Invent, the hyperscaler revealed a litany of AI products, systems, and models – among them, an AI supercomputer built in collaboration with model builder Anthropic using “hundreds of thousands” of its custom Trainium2 accelerators, which we can only imagine will require a fair bit of power themselves.

Earlier this summer, we poked some fun at Oracle’s “zettascale” supercomputer which, at 4-bit precision and sparsity coming to its aid, will have a peak output of 2.4 zettaFLOPS.

While real world performance for training will be closer to 459 exaFLOPS at the FP/BF16 precision mostly commonly used today, it’ll still employ a serious number of GPUs – totaling 131,072 – to do it. That’s not quite a million, but it’s still pretty huge compared to the clusters being deployed by CoreWeave and others.

We could keep going – but you get the picture.

A new arms race
It seems that the hype surrounding generative AI hasn’t just changed the way we think about scaling compute.

In many respects, the mobilization of capital we’ve seen around AI is reminiscent of the space race, just with China playing the part of the Red Menace instead of Russia.

The sheer number of hurdles required to put a man in orbit, let alone the Moon, forced scientists and engineers to overcome challenges and advance technology that moved the world forward as a whole.

And while there’s certainly a nationalistic element to all of this, it’s not just one country racing against the next. Driving these investments are some of the largest and most powerful corporations in the world.

It seems that in this new AI arms race we may see a similar course of events as power, cooling, and economic constraints drive investments in things like nuclear power or sustainable computing. It won’t be because it’s the right thing to do, but because it’s the difference between winning and losing the race – and making money doing it. ®

Report

What do you think?

Newbie

Written by Mr Viral

Leave a Reply

Your email address will not be published. Required fields are marked *

GIPHY App Key not set. Please check settings

Stranded in space: Starliner crew to remain in orbit even longer as SpaceX faces delays

Stranded in space: Starliner crew to remain in orbit even longer as SpaceX faces delays

US bipartisan group publishes laundry list of AI policy requests

US bipartisan group publishes laundry list of AI policy requests