NVIDIA DGX SuperPOD with NetApp - Design Guide
This NetApp Verified Architecture describes the design of the NVIDIA DGX SuperPOD with NetApp® BeeGFS® building blocks. This solution is a full-stack data center platform validated on a dedicated acceptance cluster at NVIDIA.
Amine Bennani, Christian Whiteside, David Arnette and Sathish Thyagarajan, NetApp
Executive summary
In today's rapidly evolving technological landscape, AI is revolutionizing consumer experiences and driving innovation across all industries. However, it also presents significant challenges for IT departments, which are under pressure to deploy high-performance computing (HPC) solutions capable of handling the intense demands of AI workloads. As organizations race to harness the power of AI, the urgency for a solution that is easy to deploy, scale, and manage grows.
NVIDIA DGX SuperPOD is an AI data center infrastructure platform delivered as a turnkey solution for IT to support the most complex AI workloads facing today’s enterprises. At the core of any accurate deep learning (DL) model are large volumes of data, requiring a high-throughput storage solution that can efficiently serve and re-serve this data. The NetApp BeeGFS solution, consisting of NetApp EF600 storage arrays with the BeeGFS parallel filesystem, enables the NVIDIA DGX SuperPOD to unleash its full capability. The NetApp BeeGFS solution has been validated by NVIDIA to integrate and scale with the SuperPOD architecture. The result is simplified AI data center deployment and management while delivering virtually limitless scalability for performance and capacity.
Solution overview
The NetApp BeeGFS solution, powered by the high-performance NetApp EF600 NVMe storage systems and the scalable BeeGFS parallel file system, offers a robust and efficient storage foundation for demanding AI workloads. Its shared-disk architecture ensures high availability, maintaining consistent performance and accessibility, even in the face of system challenges. This solution provides a scalable and flexible architecture that can be customized to meet diverse storage requirements. Customers can easily expand their storage performance and capacity by integrating additional storage building blocks to handle even the most demanding workloads.
Solution technology
-
NVIDIA DGX SuperPOD leverages DGX H100 and H200 systems with a validated externally attached shared storage:
-
Each DGX SuperPOD scalable unit (SU) consists of 32 DGX systems and is capable of 640 petaFLOPS of AI performance at FP8 precision. NetApp recommends sizing the NetApp BeeGFS storage solution with at least 2 building blocks for a single DGX SuperPOD configuration.
-
A high-level view of the solution
-
NetApp BeeGFS building blocks consists of two NetApp EF600 arrays and two x86 servers:
-
With NetApp EF600 all-flash arrays at the foundation of NVIDIA DGX SuperPOD, customers get a reliable storage foundation backed by six 9s of uptime.
-
The file system layer between the NetApp EF600 and the NVIDIA DGX systems is the BeeGFS parallel file system. BeeGFS was created by the Fraunhofer Center for High-Performance Computing in Germany to solve the pain points of legacy parallel file systems. The result is a file system with a modern, user space architecture that is now developed and delivered by ThinkParQ and used by many supercomputing environments.
-
NetApp support for BeeGFS aligns NetApp’s excellent support organization with customer requirements for performance and uptime. Customers get access to superior support resources, early access to BeeGFS releases, and access to select BeeGFS enterprise features such as quota enforcement and high availability (HA).
-
-
The combination of NVIDIA SuperPOD SUs and NetApp BeeGFS building blocks provides an agile AI solution in which compute or storage scales easily and seamlessly.
NetApp BeeGFS building block
Use Case Summary
This solution applies to the following use cases:
-
Artificial Intelligence (AI) including machine learning (ML), deep learning (DL), natural language processing (NLP), natural language understanding (NLU) and generative AI (GenAI).
-
Medium to large scale AI training
-
Computer vision, speech, audio, and language models
-
HPC including applications accelerated by message passing interface (MPI) and other distributed computing techniques
-
Application workloads characterized by the following:
-
Reading or writing to files larger than 1GB
-
Reading or writing to the same file by multiple clients (10s, 100s, and 1000s)
-
-
Multiterabyte or multipetabyte datasets
-
Environments that need a single storage namespace optimizable for a mix of large and small files
Technology Requirements
This section covers the technology requirements for the NVIDIA DGX SuperPOD with NetApp solution.
Hardware requirements
Table 1 below lists the hardware components that are required to implement the solution for a single SU. The solution sizing starts with 32 NVIDIA DGX H100 systems and two or three NetApp BeeGFS building blocks.
A single NetApp BeeGFS building block consists of two NetApp EF600 arrays and two x86 servers. Customers can add additional building blocks as the deployment size increases. For more information, see the NVIDIA DGX H100 SuperPOD reference architecture and NVA-1164-DESIGN: BeeGFS on NetApp NVA Design.
Hardware | Quantity |
---|---|
NVIDIA DGX H100 or H200 |
32 |
NVIDIA Quantum QM9700 switches |
8 leaf, 4 spine |
NetApp BeeGFS building blocks |
3 |
Software requirements
Table 2 below lists the software components required to implement the solution. The software components that are used in any particular implementation of the solution might vary based on customer requirements.
Software |
---|
NVIDIA DGX software stack |
NVIDIA Base Command Manager |
ThinkParQ BeeGFS parallel file system |
Solution verification
NVIDIA DGX SuperPOD with NetApp was validated on a dedicated acceptance cluster at NVIDIA by using NetApp BeeGFS building blocks. Acceptance criteria was based on a series of application, performance, and stress tests performed by NVIDIA. For more information, see the NVIDIA DGX SuperPOD: NetApp EF600 and BeeGFS Reference Architecture.
Conclusion
NetApp and NVIDIA have a long history of collaboration to deliver a portfolio of AI solutions to market. NVIDIA DGX SuperPOD with the NetApp EF600 all-flash array is a proven, validated solution that customers can deploy with confidence. This fully integrated, turnkey architecture takes the risk out of deployment and puts anyone on the path to winning the race to AI leadership.
Where to find additional information
To learn more about the information that is described in this document, review the following documents and/or websites: