Use Case Overview and Problem Statement
PDF of this doc site
- AI Converged Infrastructures
Data Pipelines, Data Lakes and Management
- NetApp AI Control Plane
- Distributed training in Azure - Click-Through Rate Prediction
- Conversational AI using NVIDIA
- Anthos with NetApp
- DevOps with NetApp Astra
Red Hat OpenShift with NetApp
- NetApp Storage Integrations Overview
Solution Validation and Use Cases
- Red Hat OpenShift Virtualization with NetApp ONTAP
- VMware Tanzu with NetApp
- Data Migration and Data Protection
- Oracle Database
- SnapCenter for Databases
- Modern Data Analytics
NetApp Hybrid Multicloud with VMware
- VMware for Public Cloud
- NetApp for GCP / GCVE
- NetApp Hybrid Multicloud with Red Hat OpenShift
- VMware vSphere Automation
- Virtual Desktops
- Artificial Intelligence
Datasets and dataset versions are typically located in a data lake, such as NetApp StorageGrid object-based storage, which offers reduced cost and other operational advantages. Data scientists pull these datasets and engineer them in multiple steps to prepare them for training with a specific model, often creating multiple versions along the way. As the next step, the data scientist must pick optimized compute resources (GPUs, high-end CPU instances, an on-premises cluster, and so on) to run the model. The following figure depicts the lack of dataset proximity in an ML compute environment.
However, multiple training experiments must run in parallel in different compute environments, each of which require a download of the dataset from the data lake, which is an expensive and time-consuming process. Proximity of the dataset to the compute environment (especially for a hybrid cloud) is not guaranteed. In addition, other team members that run their own experiments with the same dataset must go through the same arduous process. Beyond the obvious slow data access, challenges include difficulties tracking dataset versions, dataset sharing, collaboration, and reproducibility.
Customer requirements can vary in order to achieve high- performance ML runs while efficiently using resources; for example, customers might require the following:
Fast access to datasets from each compute instance executing the training model without incurring expensive downloads and data access complexities
The use any compute instance (GPU or CPU) in the cloud or on-premises without concern for the location of the datasets
Increased efficiency and productivity by running multiple training experiments in parallel with different compute resources on the same dataset without unnecessary delays and data latency
Minimized compute instance costs
Improved reproducibility with tools to keep records of the datasets, their lineage, versions, and other metadata details
Enhanced sharing and collaboration so that any authorized member of the team can access the datasets and run experiments
To implement dataset caching with NetApp ONTAP data management software, customers must perform the following tasks:
Configure and set the NFS storage that is closest to the compute resources.
Determine which dataset and version to cache.
Monitor the total memory committed to cached datasets and how much NFS storage is available for additional cache commits (for example, cache management).
Age out of datasets in the cache if they have not been used in certain time. The default is one day; other configuration options are available.