NVA-1173 NetApp AIPod with NVIDIA DGX Systems - Solution Validation and Sizing Guidance

11/13/2024 Contributors

This section focuses on the solution validation and sizing guidance for the NetApp AIPod with NVIDIA DGX systems.

Solution Validation

The storage configuration in this solution was validated using a series of synthetic workloads using the open-source tool FIO. These tests include read and write I/O patterns intended to simulate the storage workload generated by DGX systems performing deep learning training jobs. The storage configuration was validated using a cluster of 2-socket CPU servers running the FIO workloads concurrently to simulate a cluster of DGX systems. Each client was configured with the same network configuration described previously, with the addition of the following details.

The following mount options were used for this validation:

vers=4.1

enables pNFS for parallel access to multiple storage nodes

proto=rdma

sets the transfer protocol to RDMA instead of the default TCP

port=20049

specify the correct port for the RDMA NFS service

max_connect=16

enables NFS session trunking to aggregate storage port bandwidth

write=eager

improves write performance of buffered writes

rsize=262144,wsize=262144

sets the I/O transfer size to 256k

In addition the clients were configured with an NFS max_session_slots value of 1024. As the solution was tested using NFS over RDMA, the storage networks ports were configured with an active/passive bond. The following bond parameters were used for this validation:

mode=active-backup

sets the bond to active/passive mode

primary=<interface name>

primary interfaces for all clients were distributed across the switches

mii-monitor-interval=100

specifies monitoring interval of 100ms

fail-over-mac-policy=active

specifies that the MAC address of the active link is the MAC of the bond. This is required for proper operation of RDMA over the bonded interface.

The storage system was configured as described with two A900 HA pairs (4 controllers) with two NS224 disk shelves of 24 1.9TB NVMe disk drives attached to each HA pair. As noted in the architecture section, storage capacity from all controllers was combined using a FlexGroup volume, and data from all clients was distributed across all the controllers in the cluster.

Storage System Sizing Guidance

NetApp has successfully completed the DGX BasePOD certification, and the two A90 HA pairs as tested can easily support a cluster of sixteen DGX H100 systems. For larger deployments with higher storage performance requirements, additional AFF systems can be added to the NetApp ONTAP cluster up to 12 HA pairs (24 nodes) in a single cluster. Using the FlexGroup technology described in this solution, a 24-node cluster can provide over 79 PB and up to 552 GBps throughput in a single namespace. Other NetApp storage systems such as the AFF A400, A250 and C800 offer lower performance and/or higher capacity options for smaller deployments at lower cost points. Because ONTAP 9 supports mixed-model clusters, customers can start with a smaller initial footprint and add more or larger storage systems to the cluster as capacity and performance requirements grow. The table below shows a rough estimate of the number of A100 and H100 GPUs supported on each AFF model.

NetApp storage system sizing guidance

Figure showing input/output dialog or representing written content

NVA-1173 NetApp AIPod with NVIDIA DGX Systems - Solution Validation and Sizing Guidance

Creating your file...

Solution Validation

Storage System Sizing Guidance