TR-4904: Distributed Training in Azure - Click-through Rate Prediction

Contributors Download PDF of this page

Rick Huang, Verron Martina, Muneer Ahmad, NetApp

The work of a data scientist should be focused on the training and tuning of machine learning (ML) and artificial intelligence (AI) models. However, according to research by Google, data scientists spend approximately 80% of their time figuring out how to make their models work with enterprise applications and run at scale.

To manage end-to-end AI/ML projects, a wider understanding of enterprise components is needed. Although DevOps have taken over the definition, integration, and deployment, these types of components, ML operations target a similar flow that includes AI/ML projects. To get an idea of what an end-to-end AI/ML pipeline touches in the enterprise, see the following list of required components:

  • Storage

  • Networking

  • Databases

  • File systems

  • Containers

  • Continuous integration and continuous deployment (CI/CD) pipeline

  • Integrated development environment (IDE)

  • Security

  • Data access policies

  • Hardware

  • Cloud

  • Virtualization

  • Data science toolsets and libraries

Target audience

The world of data science touches multiple disciplines in IT and business:

  • The data scientist needs the flexibility to use their tools and libraries of choice.

  • The data engineer needs to know how the data flows and where it resides.

  • A DevOps engineer needs the tools to integrate new AI/ML applications into their CI/CD pipelines.

  • Cloud administrators and architects need to be able to set up and manage Azure resources.

  • Business users want to have access to AI/ML applications.

In this technical report, we describe how Azure NetApp Files, RAPIDS AI, Dask, and Azure help each of these roles bring value to business.

Solution overview

This solution follows the lifecycle of an AI/ML application. We start with the work of data scientists to define the different steps needed to prepare data and train models. By leveraging RAPIDS on Dask, we perform distributed training across the Azure Kubernetes Service (AKS) cluster to drastically reduce the training time when compared to the conventional Python scikit-learn approach. To complete the full cycle, we integrate the pipeline with Azure NetApp Files.

Azure NetApp Files provides various performance tiers. Customers can start with a Standard tier and scale out and scale up to a high-performance tier nondisruptively without moving any data. This capability enables data scientists to train models at scale without any performance issues, avoiding any data silos across the cluster, as shown in figure below.

Error: Missing Graphic Image