TR-4810: NetApp AFF A400 with Lenovo ThinkSystem SR670 V2 for AI and ML Model Training
Sathish Thyagarajan, David Arnette, NetApp
Mircea Troaca, Lenovo
This solution presents a mid-range cluster architecture using NetApp storage and Lenovo servers optimized for artificial intelligence (AI) workloads. It is meant for small- to medium-sized enterprises for which most compute jobs are single node (single or multi-GPU) or distributed over a few computational nodes. This solution aligns with most day-to-day AI training jobs for many businesses.
This document covers testing and validation of a compute and storage configuration consisting of eight-GPU Lenovo SR670V2 servers, a mid-range NetApp AFF A400 storage system and 100GbE interconnect switch. To measure the performance, we used ResNet50 with the ImageNet dataset, a batch size of 408, half precision, CUDA, and cuDNN. This architecture provides an efficient and cost-effective solution for small and medium-sized organizations just starting out with AI initiatives that require the enterprise-grade capabilities of NetApp ONTAP cloud-connected data storage.
Target audience
This document is intended for the following audiences:
-
Data scientists, data engineers, data administrators, and developers of AI systems
-
Enterprise architects who design solutions for the development of AI models
-
Data scientists and data engineers who are looking for efficient ways to achieve deep learning (DL) and machine learning (ML) development goals
-
Business leaders and OT/IT decision makers who want to achieve the fastest possible time to market for AI initiatives
Solution architecture
This solution with Lenovo ThinkSystem servers and NetApp ONTAP with AFF storage is designed to handle AI training on large datasets using the processing power of GPUs alongside traditional CPUs. This validation demonstrates high performance and optimal data management with a scale-out architecture that uses either one, two, or four Lenovo SR670 V2 servers alongside a single NetApp AFF A400 storage system. The following figure provides an architectural overview.
This NetApp and Lenovo solution offers the following key benefits:
-
Highly efficient and cost-effective performance when executing multiple training jobs in parallel
-
Scalable performance based on different numbers of Lenovo servers and different models of NetApp storage controllers
-
Robust data protection to meet low recovery point objectives (RPOs) and recovery time objectives (RTOs) with no data loss
-
Optimized data management with snapshots and clones to streamline development workflows