Skip to main content
NetApp Solutions

Contributors kevin-hoke ntapdave

NVIDIA DGX SuperPOD with NetApp - Design Guide

200

Amine Bennani, David Arnette and Sathish Thyagarajan, NetApp

Executive summary

Although AI enhances consumers’ lives and helps organizations in all industries worldwide to innovate and to grow their businesses, it is a disrupter for IT. To support the business, IT departments are scrambling to deploy high-performance computing (HPC) solutions that can meet the extreme demands of AI workloads. As the race to win with AI intensifies, the need for an easy-to-deploy, easy-to-scale, and easy-to-manage solution becomes increasingly urgent.

The NVIDIA DGX SuperPOD makes supercomputing infrastructure easily accessible for any organization and delivers the extreme computational power needed to solve even the most complex AI problems. To help customers deploy at scale today, this NVIDIA and NetApp turnkey solution removes the complexity and guesswork from infrastructure design and delivers a complete, validated solution including best-in-class compute, networking, storage, and software.

Program summary

NVIDIA DGX SuperPOD with NVIDIA DGX H100 systems and NVIDIA Base Command brings together a design-optimized combination of AI computing, network fabric, storage, software, and support. The BeeGFS on NetApp architecture has been previously validated on a dedicated acceptance cluster at NVIDIA. The latest architecture extends that validation by maintaining the proven design while incorporating support for the latest hardware from NVIDIA.

Solution overview

NVIDIA DGX SuperPOD is an AI data center infrastructure platform delivered as a turnkey solution for IT to support the most complex AI workloads facing today’s enterprises. It simplifies deployment and management while delivering virtually limitless scalability for performance and capacity. In other words, DGX SuperPOD lets you focus on insights instead of infrastructure.
With NetApp EF600 all-flash arrays at the foundation of an NVIDIA DGX SuperPOD, customers get an agile AI solution that scales easily and seamlessly. The flexibility and scalability of the solution enable it to support and adapt to evolving workloads, making it a strong foundation to meet current and future storage requirements. Modular storage building blocks allow a granular approach to growth and scale seamlessly from terabytes to petabytes. By increasing the number of storage building blocks, customers can scale up the performance and capacity of the file system, enabling the solution to manage the most extreme workloads with ease.

Solution technology

  • NVIDIA DGX SuperPOD with NVIDIA DGX H100 systems leverates DGX H100 systems with validated externally attached shared storage:

    • Each DGX SuperPOD scalable unit (SU) consists of 32 DGX H100 systems and is capable of 640 petaFLOPS of AI performance at FP8 precision. It usually contains at least two NetApp BeeGFS building blocks depending on the performance and capacity requirements for a particular installation.

A high-level view of the solution
image::EF_SuperPOD_HighLevel.png[]

  • NetApp BeeGFS building blocks consists of two NetApp EF600 arrays and two x86 servers:

    • With NetApp EF600 all-flash arrays at the foundation of NVIDIA DGX SuperPOD, customers get a reliable storage foundation backed by six 9s of uptime.

    • The file system layer between the NetApp EF600 and the NVIDIA DGX H100 systems is the BeeGFS parallel file system. BeeGFS was created by the Fraunhofer Center for High-Performance Computing in Germany to solve the pain points of legacy parallel file systems. The result is a file system with a modern, user space architecture that is now developed and delivered by ThinkParQ and used by many supercomputing environments.

    • NetApp support for BeeGFS aligns NetApp’s excellent support organization with customer requirements for performance and uptime. Customers get access to superior support resources, early access to BeeGFS releases, and access to select BeeGFS enterprise features such as quota enforcement and high availability (HA).

  • The combination of NVIDIA SuperPOD SUs and NetApp BeeGFS building blocks provides an agile AI solution in which compute or storage scales easily and seamlessly.

NetApp BeeGFS building block
image::EF_SuperPOD_buildingblock.png[]

Use Case Summary

This solution applies to the following use cases:

  • Artificial Intelligence (AI) including machine learning (ML), deep learning (DL), natural language processing (NLP), natural language understanding (NLU) and g
    generative AI (GenAI).

  • Medium to large scale AI training

  • Computer vision, speech, audio, and language models

  • HPC including applications accelerated by message passing interface (MPI) and other distributed computing techniques

  • Application workloads characterized by the following:

    • Reading or writing to files larger than 1GB

    • Reading or writing to the same file by multiple clients (10s, 100s, and 1000s)

  • Multiterabyte or multipetabyte datasets

  • Environments that need a single storage namespace optimizable for a mix of large and small files

Technology Requirements

This section covers the technology requirements for the NVIDIA DGX SuperPOD with NetApp solution.

Hardware requirements

Table 1 below lists the hardware components that are required to implement the solution for a single SU. The solution sizing starts with 32 NVIDIA DGX H100 systems and two or three NetApp BeeGFS building blocks.
A single NetApp BeeGFS building block consists of two NetApp EF600 arrays and two x86 servers. Customers can add additional building blocks as the deployment size increases. For more information, see the NVIDIA DGX H100 SuperPOD reference architecture and NVA-1164-DESIGN: BeeGFS on NetApp NVA Design.

Table 1. Hardware requirements
Hardware Quantity

NVIDIA DGX H100

32

NVIDIA Quantum QM9700 switches

8 leaf, 4 spine

NetApp BeeGFS building blocks

3

Software requirements

Table 2 below lists the software components required to implement the solution. The software components that are used in any particular implementation of the solution might vary based on customer requirements.

Table 2. Software requirements
Software

NVIDIA DGX software stack

NVIDIA Base Command Manager

ThinkParQ BeeGFS parallel file system

Solution verification

NVIDIA DGX SuperPOD with NetApp was validated on a dedicated acceptance cluster at NVIDIA by using NetApp BeeGFS building blocks. Acceptance criteria was based on a series of application, performance, and stress tests performed by NVIDIA. For more information, see the NVIDIA DGX SuperPOD: NetApp EF600 and BeeGFS Reference Architecture.

Conclusion

NetApp and NVIDIA have a long history of collaboration to deliver a portfolio of AI solutions to market. NVIDIA DGX SuperPOD with the NetApp EF600 all-flash array is a proven, validated solution that customers can deploy with confidence. This fully integrated, turnkey architecture takes the risk out of deployment and puts anyone on the path to winning the race to AI leadership.

Where to find additional information

To learn more about the information that is described in this document, review the following documents and/or websites:
NVA-1164-DESIGN: BeeGFS on NetApp NVA Design
https://www.netapp.com/media/71123-nva-1164-design.pdf
NVA-1164-DEPLOY: BeeGFS on NetApp NVA Deployment
https://www.netapp.com/media/71124-nva-1164-deploy.pdf
NVIDIA DGX SuperPOD Reference Architecture
https://docs.nvidia.com/dgx-superpod/reference-architecture-scalable-infrastructure-h100/latest/index.html#
NVIDIA DGX SuperPOD Data Center Design Reference Guide
https://docs.nvidia.com/nvidia-dgx-superpod-data-center-design-dgx-h100.pdf
NVIDIA DGX SuperPOD: NetApp EF600 and BeeGFS
https://nvidiagpugenius.highspot.com/viewer/62915e2ef093f1a97b2d1fe6?iid=62913b14052a903cff46d054&source=email.62915e2ef093f1a97b2d1fe7.4