AI & RoboticsNews

Nvidia’s new DGX SuperPOD can handle trillion-parameter AI models

DGX SuperPOD: Nvidia's Most Powerful AI Acceleration System

Nvidia is launching its most powerful systems yet with the new DGX SuperPod, as part of a broad rollout of hardware and software at the Nvidia GTC conference today.

The DGX in recent years has become one of Nvidia’s primary server hardware and cloud systems. The new DGX SuperPod system is powered by Nvidia next generation of GPUs for AI acceleration, known as Blackwell, which is being announced at GTC, as the successor to the Hopper GPU. The Blackwell is being positioned to support and enable AI models that have a trillion parameters.

The DGX SuperPOD integrates the GB200 superchip version of the Blackwell, which includes both CPU and GPU resources. Nvidia’s previous Grace Hopper generation of superchip is at the core of the prior generation of DGX systems. Existing DGX systems are already widely deployed for numerous use cases including drug discovery, healthcare, fraud detection, financial services, recommender systems and consumer internet.

“It’s a world-class supercomputing platform and it’s turnkey,” Ian Buck, VP of Hyperscale and HPC at Nvidia said during a press briefing. “It supports Nvidia’s full AI software stack, providing unmatched reliability and scale.”

What’s inside a DGX SuperPOD?

While the term SuperPOD might seem like just a marketing superlative, the actual hardware that Nvidia is packing into its new DGX system is impressive.

A DGX SuperPOD isn’t just a single rack server, it’s a combination of multiple DGX GB200 systems. Each DGX GB200 system features 36 Nvidia GB200 Superchips, which include 36 Nvidia Grace CPUs and 72 Nvidia Blackwell GPUs, connected as a single supercomputer via fifth-generation Nvidia NVLink.

Now what makes the SuperPOD, super is that the DGX SuperPOD can be configured with eight or more DGX GB200 systems and can scale to tens of thousands of GB200 Superchips connected via Nvidia Quantum InfiniBand.

Also Read: Nvidia launches Base Command Platform to accelerate AI workloads

The system can deliver 240 terabytes of memory, which is critical for large language model (LLM) training and generative AI inference at a massive scale. Another impressive figure claimed by Nvidia is that the DGX SuperPOD has 11.5 exaflops of AI supercomputing power.

Advanced networking and data processing units enable gen AI SuperPOD fabric

A core element of what makes a DGX SuperPOD super is the fact that so many GB200 systems can be connected together with a unified compute fabric.

Powering that fabric in the newly announced Nvidia Quantum-X800 InfiniBand networking technology. This architecture provides up to 1,800 gigabytes per second of bandwidth to each GPU in the platform.

The DGX also integrates the Nvidia BlueField-3 DPUs (data processing unit) and the fifth generation of the fifth-generation Nvidia NVLink interconnect.

Additionally, the new SuperPOD includes fourth-generation Nvidia Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) technology. The new version of SHARP delivers 14.4 teraflops of in-network computing, representing a 4x increase in the next-generation DGX SuperPOD architecture compared to the previous generation.

Blackwell coming to Nvidia DGX Cloud

The new GB200-based DGX systems are also coming to the Nvidia DGX cloud service.

The GB200 capabilities will be available first on Amazon Web Services (AWS), Google Cloud and Oracle Cloud.

“DGX Cloud is our cloud that we partnered deeply and co-designed with our cloud partners to provide the best Nvidia technology for own use for our own AI research and development in our products, but also to make available to our customers,” Buck said.

The new GB200 will also help to advance the Project Ceiba supercomputer has been developing with AWS which was first announced in November 2023. Project Ceiba is an effort to use DGX Cloud to create the world’s largest public cloud supercomputing platform.

“I’m pleased to announce that Project Ceiba has skipped ahead, we’ve now upgraded it to be Grace Blackwell supporting  20,000 GPUs,” Buck said. “It will now deliver over 400 exaflops of AI.”

Author: Sean Michael Kerner
Source: Venturebeat
Reviewed By: Editorial Team

Related posts
AI & RoboticsNews

Nvidia and DataStax just made generative AI smarter and leaner — here’s how

AI & RoboticsNews

OpenAI opens up its most powerful model, o1, to third-party developers

AI & RoboticsNews

UAE’s Falcon 3 challenges open-source leaders amid surging demand for small AI models

DefenseNews

Army, Navy conduct key hypersonic missile test

Sign up for our Newsletter and
stay informed!