Technology overview
This section introduces the major components of this solution in greater detail.
NetApp AFF systems
NetApp AFF storage systems enable businesses to meet enterprise storage requirements with industry-leading performance, superior flexibility, cloud integration, and best-in-class data management. Designed specifically for flash, AFF systems help accelerate, manage, and protect business-critical data.
NetApp AFF A400 is a mid-range NVMe flash storage system that includes the following features:
-
Maximum effective capacity: ~20PB
-
Maximum scale-out: 2-24 nodes (12 HA pairs)
-
25GbE and 16Gb FC host support
-
100GbE RDMA over Converged Ethernet (RoCE) connectivity to NVMe expansion storage shelves
-
100GbE RoCE ports can be used for host network attachment if NVMe shelves aren’t attached
-
Full 12Gbps SAS connectivity expansion storage shelves
-
Available in two configurations:
-
Ethernet: 4x 25Gb Ethernet (SFP28) ports
-
Fiber Channel: 4x 16Gb FC (SFP+) ports
-
-
100% 8KB random read @.4 ms 400k IOPS
NetApp AFF A250 features for entry level AI/ML deployments include the following:
-
Maximum effective capacity: 35PB
-
Maximum scale out: 2-24 nodes (12 HA pairs)
-
440k IOPS random reads @1ms
-
Built on the latest NetApp ONTAP release ONTAP 9.8 or later
-
Two 25Gb Ethernet ports for HA and cluster interconnect
NetApp also offers other storage systems, such as the AFF A800 and AFF A700 that provide higher performance and scalability for larger-scale AI/ML deployments.
NetApp ONTAP
ONTAP 9, the latest generation of storage management software from NetApp, enables businesses to modernize infrastructure and transition to a cloud-ready data center. Leveraging industry-leading data management capabilities, ONTAP enables the management and protection of data with a single set of tools, regardless of where that data resides. Data can also be moved freely to wherever it’s needed: the edge, the core, or the cloud. ONTAP 9 includes numerous features that simplify data management, accelerate and protect critical data, and future-proof infrastructure across hybrid cloud architectures.
Simplify data management
Data management is crucial to enterprise IT operations so that appropriate resources are used for applications and datasets. ONTAP includes the following features to streamline and simplify operations and reduce the total cost of operation:
-
Inline data compaction and expanded deduplication. Data compaction reduces wasted space inside storage blocks, and deduplication significantly increases effective capacity. This applies to data stored locally and data tiered to the cloud.
-
Minimum, maximum, and adaptive quality of service (QoS). Granular QoS controls help maintain performance levels for critical applications in highly shared environments.
-
ONTAP FabricPool. This feature automatically tiers cold data to public and private cloud storage options, including Amazon Web Services (AWS), Azure, and NetApp StorageGRID object storage.
Accelerate and protect data
ONTAP delivers superior levels of performance and data protection and extends these capabilities in the following ways:
-
Performance and lower latency. ONTAP offers the highest possible throughput at the lowest possible latency.
-
Data protection. ONTAP provides built-in data protection capabilities with common management across all platforms.
-
NetApp Volume Encryption. ONTAP offers native volume-level encryption with both onboard and external key management support.
Future-proof infrastructure
ONTAP 9 helps meet demanding and constantly changing business needs:
-
Seamless scaling and nondisruptive operations. ONTAP supports the nondisruptive addition of capacity to existing controllers as well as to scale-out clusters. Customers can upgrade to the latest technologies, such as NVMe and 32Gb FC, without costly data migrations or outages.
-
Cloud connection. ONTAP is the most cloud-connected storage management software, with options for software-defined storage (ONTAP Select) and cloud-native instances (Google Cloud NetApp Volumes) in all public clouds.
-
Integration with emerging applications. ONTAP offers enterprise-grade data services for next-generation platforms and applications such as OpenStack, Hadoop, and MongoDB by using the same infrastructure that supports existing enterprise apps.
NetApp FlexGroup volumes
Training datasets are typically a collection of potentially billions of files. Files can include text, audio, video, and other forms of unstructured data that must be stored and processed to be read in parallel. The storage system must store many small files and must read those files in parallel for sequential and random I/O.
A FlexGroup volume (the following figure) is a single namespace made up of multiple constituent member volumes that is managed and acts like a NetApp FlexVol volume to storage administrators. Files in a FlexGroup volume are allocated to individual member volumes and are not striped across volumes or nodes. They enable the following capabilities:
-
Up to 20 petabytes of capacity and predictable low latency for high-metadata workloads
-
Up to 400 billion files in the same namespace
-
Parallelized operations in NAS workloads across CPUs, nodes, aggregates, and constituent FlexVol volumes
Lenovo ThinkSystem portfolio
Lenovo ThinkSystem servers feature innovative hardware, software, and services that solve customers’ challenges today and deliver an evolutionary, fit-for-purpose, modular design approach to address tomorrow’s challenges. These servers capitalize on best-in-class, industry-standard technologies coupled with differentiated Lenovo innovations to provide the greatest possible flexibility in x86 servers.
Key advantages of deploying Lenovo ThinkSystem servers include the following:
-
Highly scalable, modular designs that grow with your business
-
Industry-leading resilience to save hours of costly unscheduled downtime
-
Fast flash technologies for lower latencies, quicker response times, and smarter data management in real time
In the AI area, Lenovo is taking a practical approach to helping enterprises understand and adopt the benefits of ML and AI for their workloads. Lenovo customers can explore and evaluate Lenovo AI offerings in Lenovo AI Innovation Centers to fully understand the value for their particular use case. To improve time to value, this customer-centric approach gives customers proofs of concept for solution development platforms that are ready to use and optimized for AI.
Lenovo SR670 V2
The Lenovo ThinkSystem SR670 V2 rack server delivers optimal performance for accelerated AI and high-performance computing (HPC). Supporting up to eight GPUs, the SR670 V2 is suited for the computationally intensive workload requirements of ML, DL, and inference.
With the latest scalable Intel Xeon CPUs that support high-end GPUs (including the NVIDIA A100 80GB PCIe 8x GPU), the ThinkSystem SR670 V2 delivers optimized, accelerated performance for AI and HPC workloads.
Because more workloads use the performance of accelerators, the demand for GPU density has increased. Industries such as retail, financial services, energy, and healthcare are using GPUs to extract greater insights and drive innovation with ML, DL, and inference techniques.
The ThinkSystem SR670 V2 is an optimized, enterprise-grade solution for deploying accelerated HPC and AI workloads in production, maximizing system performance while maintaining data center density for supercomputing clusters with next-generation platforms.
Other features include:
-
Support for GPU direct RDMA I/O in which high-speed network adapters are directly connected to the GPUs to maximize I/O performance.
-
Support for GPU direct storage in which NVMe drives are directly connected to the GPUs to maximize storage performance.
MLPerf
MLPerf is the industry-leading benchmark suite for evaluating AI performance. In this validation, we used its image-classification benchmark with MXNet, one of the most popular AI frameworks. The MXNet_benchmarks training script was used to drive AI training. The script contains implementations of several popular conventional models and is designed to be as fast as possible. It can be run on a single machine or run in distributed mode across multiple hosts.