Hardware and Software Requirements
This section covers the technology requirements for the ONTAP AI solution.
Hardware Requirements
Although hardware requirements depend on specific customer workloads, ONTAP AI can be deployed at any scale for data engineering, model training, and production inferencing from a single GPU up to rack-scale configurations for large-scale ML/DL operations. For more information about ONTAP AI, see the ONTAP AI website.
This solution was validated using a DGX-1 system for compute, a NetApp AFF A800 storage system, and Cisco Nexus 3232C for network connectivity. The AFF A800 used in this validation can support as many as 10 DGX-1 systems for most ML/DL workloads. The following figure shows the ONTAP AI topology used for model training in this validation.
To extend this solution to a public cloud, Cloud Volumes ONTAP can be deployed alongside cloud GPU compute resources and integrated into a hybrid cloud data fabric that enables customers to use whatever resources are appropriate for any given workload.
Software Requirements
The following table shows the specific software versions used in this solution validation.
Component | Version |
---|---|
Ubuntu |
18.04.4 LTS |
NVIDIA DGX OS |
4.4.0 |
NVIDIA DeepOps |
20.02.1 |
Kubernetes |
1.15 |
Helm |
3.1.0 |
cnvrg.io |
3.0.0 |
NetApp ONTAP |
9.6P4 |
For this solution validation, Kubernetes was deployed as a single-node cluster on the DGX-1 system. For large-scale deployments, independent Kubernetes master nodes should be deployed to provide high availability of management services as well as reserve valuable DGX resources for ML and DL workloads.