TR-4810: NetApp ONTAP and Lenovo ThinkSystem SR670 for AI and ML Model Training Workloads

Contributors netapp-aoife

Karthikeyan Nagalingam, NetApp
Miroslav Hodak, Lenovo

TR-4810 describes a cost-effective, entry-level compute and storage architecture to deploy GPU-based artificial intelligence (AI) training on NetApp storage controllers and Lenovo ThinkSystem servers. The setup is designed as a shared resource for small to medium-sized teams running multiple training jobs in parallel.

TR-4810 provides performance data for the industry-standard MLPerf benchmark evaluating image classification training with TensorFlow on V100 GPUs. To measure performance, we used ResNet50 with the ImageNet dataset, a batch size of 512, half precision, CUDA, and cuDNN. We performed this analysis using four-GPU SR670 servers and an entry-level NetApp storage system. The results show highly efficient performance across the multiple use cases tested here―shared, multiuser, multijob cases, with individual
jobs scaling up to four servers. Large scale-out jobs were less efficient but still feasible