Solution Technology

Contributors netapp-dorianh kevin-hoke Download PDF of this page

This solution was implemented with one NetApp AFF A800 system, two DGX-1 servers, and two Cisco Nexus 3232C 100GbE-switches. Each DGX-1 server is connected to the Nexus switches with four 100GbE connections that are used for inter-GPU communications by using remote direct memory access (RDMA) over Converged Ethernet (RoCE). Traditional IP communications for NFS storage access also occur on these links. Each storage controller is connected to the network switches by using four 100GbE-links. The following figure shows the ONTAP AI solution architecture used in this technical report for all testing scenarios.

Error: Missing Graphic Image

Hardware Used in This Solution

This solution was validated using the ONTAP AI reference architecture two DGX-1 nodes and one AFF A800 storage system. See NVA-1121 for more details about the infrastructure used in this validation.

The following table lists the hardware components that are required to implement the solution as tested.

Hardware Quantity

DGX-1 systems

2

AFF A800

1

Nexus 3232C switches

2

Software Requirements

This solution was validated using a basic Kubernetes deployment with the Run:AI operator installed. Kubernetes was deployed using the NVIDIA DeepOps deployment engine, which deploys all required components for a production-ready environment. DeepOps automatically deployed NetApp Trident for persistent storage integration with the k8s environment, and default storage classes were created so containers leverage storage from the AFF A800 storage system. For more information on Trident with Kubernetes on ONTAP AI, see TR-4798.

The following table lists the software components that are required to implement the solution as tested.

Software Version or Other Information

NetApp ONTAP data management software

9.6p4

Cisco NX-OS switch firmware

7.0(3)I6(1)

NVIDIA DGX OS

4.0.4 - Ubuntu 18.04 LTS

Kubernetes version

1.17

Trident version

20.04.0

Run:AI CLI

v2.1.13

Run:AI Orchestration Kubernetes Operator version

1.0.39

Docker container platform

18.06.1-ce [e68fc7a]

Additional software requirements for Run:AI can be found at Run:AI GPU cluster prerequisites.