📰 YAHOO NEWS

Nvidia’s Delayed Blackwell AI Chips Overheating in Servers

PCMag editors select and review products independently. If you buy through affiliate links, we may earn commissions, which help support our testing.

Nvidia’s upcoming Blackwell GPUs for AI computing may face further delays because they’re prone to overheating when connected to each other on server racks, according to a new report from The Information.

The server rack Nvidia designed for Blackwell—which can connect up to 72 GPUs at a time—is reportedly causing the overheating issue. Nvidia has repeatedly redesigned the racks, which could result in GPU server shipments being delayed and new Google, Microsoft, or Meta data centers may not be able to open on schedule.

Back in August, a previous report suggested that a “design flaw” had caused the Blackwell GPUs’ launch to be delayed by months. It’s unclear whether this flaw is the server rack design issue, though it’s possible. Nvidia had announced Blackwell back in March, and initially said the GPUs could ship as soon as Q2 2024 before it encountered challenges.

Nvidia indirectly addressed the server rack problem in a statement to Reuters. “Nvidia is working with leading cloud service providers as an integral part of our engineering team and process. The engineering iterations are normal and expected,” a company spokesperson said, suggesting a new server design could be on the horizon.

Overheating is a main cause of performance issues for GPUs, which can consume a lot of energy to operate. The crypto mining industry, like AI, also uses a ton of energy, produces a lot of heat, and relies on high numbers of interconnected GPUs or mining rigs. Sometimes, crypto miners use immersion cooling, where the rigs are submersed in liquid, to prevent overheating.

And the more powerful a GPU, the more heat it can produce. While sometimes tech advancements can bring more energy efficiencies, this typically isn’t enough to offset the increased energy needs overall. The Blackwell AI chips can be 30 times faster than previous GPUs, according to Nvidia.

Training and running generative AI models at scale requires a ton of energy, too, as well as water to cool these systems. This has lead some experts to predict that AI data centers may face power shortages as soon as next year. This is because AI firms aren’t able to add new power sources to grid as quickly as they can add data centers—and they aren’t necessarily willing to wait, either.

Meta, Microsoft, and Google have recently turned to nuclear power to meet their rising energy needs. However, “power purchase agreements” don’t necessarily solve AI’s energy problems.

Nvidia has seen its stock soar over 180% in the past year due to the AI surge and resulting spike in chip demand, while rival AMD recently began mass layoffs.


Source link

Back to top button