NVA-1173 NetApp AIPod with NVIDIA DGX Systems - Solution Validation and Sizing Guidance
This section focuses on the solution validation and sizing guidance for the NetApp AIPod with NVIDIA DGX systems.
Solution Validation
The storage configuration in this solution was validated using a series of synthetic workloads using the open-source tool FIO. These tests include read and write I/O patterns intended to simulate the storage workload generated by DGX systems performing deep learning training jobs. The storage configuration was validated using a cluster of 2-socket CPU servers running the FIO workloads concurrently to simulate a cluster of DGX systems. Each client was configured with the same network configuration described previously, with the addition of the following details.
The following mount options were used for this validation:
vers=4.1 |
enables pNFS for parallel access to multiple storage nodes |
proto=rdma |
sets the transfer protocol to RDMA instead of the default TCP |
port=20049 |
specify the correct port for the RDMA NFS service |
max_connect=16 |
enables NFS session trunking to aggregate storage port bandwidth |
write=eager |
improves write performance of buffered writes |
rsize=262144,wsize=262144 |
sets the I/O transfer size to 256k |
In addition the clients were configured with an NFS max_session_slots value of 1024. As the solution was tested using NFS over RDMA, the storage networks ports were configured with an active/passive bond. The following bond parameters were used for this validation:
mode=active-backup |
sets the bond to active/passive mode |
primary=<interface name> |
primary interfaces for all clients were distributed across the switches |
mii-monitor-interval=100 |
specifies monitoring interval of 100ms |
fail-over-mac-policy=active |
specifies that the MAC address of the active link is the MAC of the bond. This is required for proper operation of RDMA over the bonded interface. |
The storage system was configured as described with two A900 HA pairs (4 controllers) with two NS224 disk shelves of 24 1.9TB NVMe disk drives attached to each HA pair. As noted in the architecture section, storage capacity from all controllers was combined using a FlexGroup volume, and data from all clients was distributed across all the controllers in the cluster.
Storage System Sizing Guidance
NetApp has successfully completed the DGX BasePOD certification, and the two A90 HA pairs as tested can easily support a cluster of sixteen DGX H100 systems. For larger deployments with higher storage performance requirements, additional AFF systems can be added to the NetApp ONTAP cluster up to 12 HA pairs (24 nodes) in a single cluster. Using the FlexGroup technology described in this solution, a 24-node cluster can provide over 79 PB and up to 552 GBps throughput in a single namespace. Other NetApp storage systems such as the AFF A400, A250 and C800 offer lower performance and/or higher capacity options for smaller deployments at lower cost points. Because ONTAP 9 supports mixed-model clusters, customers can start with a smaller initial footprint and add more or larger storage systems to the cluster as capacity and performance requirements grow. The table below shows a rough estimate of the number of A100 and H100 GPUs supported on each AFF model.
NetApp storage system sizing guidance