Use Case Overview and Problem Statement
Datasets and dataset versions are typically located in a data lake, such as NetApp StorageGrid object-based storage, which offers reduced cost and other operational advantages. Data scientists pull these datasets and engineer them in multiple steps to prepare them for training with a specific model, often creating multiple versions along the way. As the next step, the data scientist must pick optimized compute resources (GPUs, high-end CPU instances, an on-premises cluster, and so on) to run the model. The following figure depicts the lack of dataset proximity in an ML compute environment.
However, multiple training experiments must run in parallel in different compute environments, each of which require a download of the dataset from the data lake, which is an expensive and time-consuming process. Proximity of the dataset to the compute environment (especially for a hybrid cloud) is not guaranteed. In addition, other team members that run their own experiments with the same dataset must go through the same arduous process. Beyond the obvious slow data access, challenges include difficulties tracking dataset versions, dataset sharing, collaboration, and reproducibility.
Customer Requirements
Customer requirements can vary in order to achieve high- performance ML runs while efficiently using resources; for example, customers might require the following:
-
Fast access to datasets from each compute instance executing the training model without incurring expensive downloads and data access complexities
-
The use any compute instance (GPU or CPU) in the cloud or on-premises without concern for the location of the datasets
-
Increased efficiency and productivity by running multiple training experiments in parallel with different compute resources on the same dataset without unnecessary delays and data latency
-
Minimized compute instance costs
-
Improved reproducibility with tools to keep records of the datasets, their lineage, versions, and other metadata details
-
Enhanced sharing and collaboration so that any authorized member of the team can access the datasets and run experiments
To implement dataset caching with NetApp ONTAP data management software, customers must perform the following tasks:
-
Configure and set the NFS storage that is closest to the compute resources.
-
Determine which dataset and version to cache.
-
Monitor the total memory committed to cached datasets and how much NFS storage is available for additional cache commits (for example, cache management).
-
Age out of datasets in the cache if they have not been used in certain time. The default is one day; other configuration options are available.