Solution Technology
This solution was implemented with one NetApp AFF A800 system, two DGX-1 servers, and two Cisco Nexus 3232C 100GbE-switches. Each DGX-1 server is connected to the Nexus switches with four 100GbE connections that are used for inter-GPU communications by using remote direct memory access (RDMA) over Converged Ethernet (RoCE). Traditional IP communications for NFS storage access also occur on these links. Each storage controller is connected to the network switches by using four 100GbE-links. The following figure shows the ONTAP AI solution architecture used in this technical report for all testing scenarios.
Hardware Used in This Solution
This solution was validated using the ONTAP AI reference architecture two DGX-1 nodes and one AFF A800 storage system. See NVA-1121 for more details about the infrastructure used in this validation.
The following table lists the hardware components that are required to implement the solution as tested.
Hardware | Quantity |
---|---|
DGX-1 systems |
2 |
AFF A800 |
1 |
Nexus 3232C switches |
2 |
Software Requirements
This solution was validated using a basic Kubernetes deployment with the Run:AI operator installed. Kubernetes was deployed using the NVIDIA DeepOps deployment engine, which deploys all required components for a production-ready environment. DeepOps automatically deployed NetApp Trident for persistent storage integration with the k8s environment, and default storage classes were created so containers leverage storage from the AFF A800 storage system. For more information on Trident with Kubernetes on ONTAP AI, see TR-4798.
The following table lists the software components that are required to implement the solution as tested.
Software | Version or Other Information |
---|---|
NetApp ONTAP data management software |
9.6p4 |
Cisco NX-OS switch firmware |
7.0(3)I6(1) |
NVIDIA DGX OS |
4.0.4 - Ubuntu 18.04 LTS |
Kubernetes version |
1.17 |
Trident version |
20.04.0 |
Run:AI CLI |
v2.1.13 |
Run:AI Orchestration Kubernetes Operator version |
1.0.39 |
Docker container platform |
18.06.1-ce [e68fc7a] |
Additional software requirements for Run:AI can be found at Run:AI GPU cluster prerequisites.