Skip to main content
NetApp Solutions

Lane detection – Distributed training with RUN:AI

Contributors kevin-hoke

This section provides details on setting up the platform for performing lane detection distributed training at scale using the RUN: AI orchestrator. We discuss installation of all the solution elements and running the distributed training job on the said platform. ML versioning is completed by using NetApp SnapshotTM linked with RUN: AI experiments for achieving data and model reproducibility. ML versioning plays a crucial role in tracking models, sharing work between team members, reproducibility of results, rolling new model versions to production, and data provenance. NetApp ML version control (Snapshot) can capture point-in-time versions of the data, trained models, and logs associated with each experiment. It has rich API support making it easy to integrate with the RUN: AI platform; you just have to trigger an event based on the training state. You also have to capture the state of the whole experiment without changing anything in the code or the containers running on top of Kubernetes (K8s).

Finally, this technical report wraps up with performance evaluation on multiple GPU-enabled nodes across AKS.

Distributed training for lane detection use case using the TuSimple dataset

In this technical report, distributed training is performed on the TuSimple dataset for lane detection. Horovod is used in the training code for conducting data distributed training on multiple GPU nodes simultaneously in the Kubernetes cluster through AKS. Code is packaged as container images for TuSimple data download and processing. Processed data is stored on persistent volumes allocated by NetApp Trident plug- in. For the training, one more container image is created, and it uses the data stored on persistent volumes created during downloading the data.

To submit the data and training job, use RUN: AI for orchestrating the resource allocation and management. RUN: AI allows you to perform Message Passing Interface (MPI) operations which are needed for Horovod. This layout allows multiple GPU nodes to communicate with each other for updating the training weights after every training mini batch. It also enables monitoring of training through the UI and CLI, making it easy to monitor the progress of experiments.

NetApp Snapshot is integrated within the training code and captures the state of data and the trained model for every experiment. This capability enables you to track the version of data and code used, and the associated trained model generated.

AKS setup and installation

For setup and installation of the AKS cluster go to Create an AKS Cluster. Then, follow these series of steps:

  1. When selecting the type of nodes (whether it be system (CPU) or worker (GPU) nodes), select the following:

    1. Add primary system node named agentpool at the Standard_DS2_v2 size. Use the default three nodes.

    2. Add worker node gpupool with the Standard_NC6s_v3 pool size. Use three nodes minimum for GPU nodes.

      Figure showing input/output dialog or representing written content

      Note Deployment takes 5–10 minutes.
  2. After deployment is complete, click Connect to Cluster. To connect to the newly created AKS cluster, install the Kubernetes command-line tool from your local environment (laptop/PC). Visit Install Tools to install it as per your OS.

  3. Install Azure CLI on your local environment.

  4. To access the AKS cluster from the terminal, first enter az login and put in the credentials.

  5. Run the following two commands:

    az account set --subscription xxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxxxx
    aks get-credentials --resource-group resourcegroup --name aksclustername
  6. Enter this command in the Azure CLI:

    kubectl get nodes
    Note If all six nodes are up and running as seen here, your AKS cluster is ready and connected to your local environment.

    Figure showing input/output dialog or representing written content

Create a delegated subnet for Azure NetApp Files

To create a delegated subnet for Azure NetApp Files, follow this series of steps:

  1. Navigate to Virtual networks within the Azure portal. Find your newly created virtual network. It should have a prefix such as aks-vnet, as seen here. Click the name of the virtual network.

    Figure showing input/output dialog or representing written content

  2. Click Subnets and select +Subnet from the top toolbar.

    Figure showing input/output dialog or representing written content

  3. Provide the subnet with a name such as ANF.sn and under the Subnet Delegation heading, select Microsoft.NetApp/volumes. Do not change anything else. Click OK.

    Figure showing input/output dialog or representing written content

Azure NetApp Files volumes are allocated to the application cluster and are consumed as persistent volume claims (PVCs) in Kubernetes. In turn, this allocation provides us the flexibility to map volumes to different services, be it Jupyter notebooks, serverless functions, and so on

Users of services can consume storage from the platform in many ways. The main benefits of Azure NetApp Files are:

  • Provides users with the ability to use snapshots.

  • Enables users to store large quantities of data on Azure NetApp Files volumes.

  • Procure the performance benefits of Azure NetApp Files volumes when running their models on large sets of files.

Azure NetApp Files setup

To complete the setup of Azure NetApp Files, you must first configure it as described in Quickstart: Set up Azure NetApp Files and create an NFS volume.

However, you may omit the steps to create an NFS volume for Azure NetApp Files as you will create volumes through Trident. Before continuing, be sure that you have:

Peering of AKS virtual network and Azure NetApp Files virtual network

Next, peer the AKS virtual network (VNet) with the Azure NetApp Files VNet by following these steps:

  1. In the search box at the top of the Azure portal, type virtual networks.

  2. Click VNet aks- vnet-name, then enter Peerings in the search field.

  3. Click +Add and enter the information provided in the table below:

    Field

    Value or description
    #

    Peering link name

    aks-vnet-name_to_anf

    SubscriptionID

    Subscription of the Azure NetApp Files VNet to which you’re peering

    VNet peering partner

    Azure NetApp Files VNet

    Note Leave all the nonasterisk sections on default
  4. Click ADD or OK to add the peering to the virtual network.

Trident

Trident is an open-source project that NetApp maintains for application container persistent storage. Trident has been implemented as an external provisioner controller that runs as a pod itself, monitoring volumes and completely automating the provisioning process.

NetApp Trident enables smooth integration with K8s by creating and attaching persistent volumes for storing training datasets and trained models. This capability makes it easier for data scientists and data engineers to use K8s without the hassle of manually storing and managing datasets. Trident also eliminates the need for data scientists to learn managing new data platforms as it integrates the data management-related tasks through the logical API integration.

Install Trident

To install Trident software, complete the following steps:

  1. First install helm.

  2. Download and extract the Trident 21.01.1 installer.

    wget https://github.com/NetApp/trident/releases/download/v21.01.1/trident-installer-21.01.1.tar.gz
    tar -xf trident-installer-21.01.1.tar.gz
  3. Change the directory to trident-installer.

    cd trident-installer
  4. Copy tridentctl to a directory in your system $PATH.

    cp ./tridentctl /usr/local/bin
  5. Install Trident on K8s cluster with Helm:

    1. Change directory to helm directory.

      cd helm
    2. Install Trident.

      helm install trident trident-operator-21.01.1.tgz --namespace trident --create-namespace
    3. Check the status of Trident pods the usual K8s way:

      kubectl -n trident get pods
    4. If all the pods are up and running, Trident is installed and you are good to move forward.

Set up Azure NetApp Files back-end and storage class

To set up Azure NetApp Files back-end and storage class, complete the following steps:

  1. Switch back to the home directory.

    cd ~
  2. Clone the project repository lane-detection-SCNN-horovod.

  3. Go to the trident-config directory.

    cd ./lane-detection-SCNN-horovod/trident-config
  4. Create an Azure Service Principle (the service principle is how Trident communicates with Azure to access your Azure NetApp Files resources).

    az ad sp create-for-rbac --name

    The output should look like the following example:

    {
      "appId": "xxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
       "displayName": "netapptrident",
        "name": "http://netapptrident",
        "password": "xxxxxxxxxxxxxxx.xxxxxxxxxxxxxx",
        "tenant": "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxx"
     }
  5. Create the Trident backend json file.

  6. Using your preferred text editor, complete the following fields from the table below inside the anf-backend.json file.

    Field Value

    subscriptionID

    Your Azure Subscription ID

    tenantID

    Your Azure Tenant ID (from the output of az ad sp in the previous step)

    clientID

    Your appID (from the output of az ad sp in the previous step)

    clientSecret

    Your password (from the output of az ad sp in the previous step)

    The file should look like the following example:

    {
        "version": 1,
        "storageDriverName": "azure-netapp-files",
        "subscriptionID": "fakec765-4774-fake-ae98-a721add4fake",
        "tenantID": "fakef836-edc1-fake-bff9-b2d865eefake",
        "clientID": "fake0f63-bf8e-fake-8076-8de91e57fake",
        "clientSecret": "SECRET",
        "location": "westeurope",
        "serviceLevel": "Standard",
        "virtualNetwork": "anf-vnet",
        "subnet": "default",
        "nfsMountOptions": "vers=3,proto=tcp",
        "limitVolumeSize": "500Gi",
        "defaults": {
        "exportRule": "0.0.0.0/0",
        "size": "200Gi"
    }
  7. Instruct Trident to create the Azure NetApp Files back- end in the trident namespace, using anf-backend.json as the configuration file as follows:

    tridentctl create backend -f anf-backend.json -n trident
  8. Create the storage class:

    1. K8 users provision volumes by using PVCs that specify a storage class by name. Instruct K8s to create a storage class azurenetappfiles that will reference the Azure NetApp Files back end created in the previous step using the following:

      kubectl create -f anf-storage-class.yaml
    2. Check that storage class is created by using the following command:

      kubectl get sc azurenetappfiles

      The output should look like the following example:

      Figure showing input/output dialog or representing written content

Deploy and set up volume snapshot components on AKS

If your cluster does not come pre-installed with the correct volume snapshot components, you may manually install these components by running the following steps:

Note AKS 1.18.14 does not have pre-installed Snapshot Controller.
  1. Install Snapshot Beta CRDs by using the following commands:

    kubectl create -f https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/release-3.0/client/config/crd/snapshot.storage.k8s.io_volumesnapshotclasses.yaml
    kubectl create -f https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/release-3.0/client/config/crd/snapshot.storage.k8s.io_volumesnapshotcontents.yaml
    kubectl create -f https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/release-3.0/client/config/crd/snapshot.storage.k8s.io_volumesnapshots.yaml
  2. Install Snapshot Controller by using the following documents from GitHub:

    kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/release-3.0/deploy/kubernetes/snapshot-controller/rbac-snapshot-controller.yaml
    kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/release-3.0/deploy/kubernetes/snapshot-controller/setup-snapshot-controller.yaml
  3. Set up K8s volumesnapshotclass: Before creating a volume snapshot, a volume snapshot class must be set up. Create a volume snapshot class for Azure NetApp Files, and use it to achieve ML versioning by using NetApp Snapshot technology. Create volumesnapshotclass netapp-csi-snapclass and set it to default `volumesnapshotclass `as such:

    kubectl create -f netapp-volume-snapshot-class.yaml

    The output should look like the following example:

    Figure showing input/output dialog or representing written content

  4. Check that the volume Snapshot copy class was created by using the following command:

    kubectl get volumesnapshotclass

    The output should look like the following example:

    Figure showing input/output dialog or representing written content

RUN:AI installation

To install RUN:AI, complete the following steps:

  1. Install RUN:AI cluster on AKS.

  2. Go to app.runai.ai, click create New Project, and name it lane-detection. It will create a namespace on a K8s cluster starting with runai- followed by the project name. In this case, the namespace created would be runai-lane-detection.

    Figure showing input/output dialog or representing written content

  3. Install RUN:AI CLI.

  4. On your terminal, set lane-detection as a default RUN: AI project by using the following command:

    `runai config project lane-detection`

    The output should look like the following example:

    Figure showing input/output dialog or representing written content

  5. Create ClusterRole and ClusterRoleBinding for the project namespace (for example, lane-detection) so the default service account belonging to runai-lane-detection namespace has permission to perform volumesnapshot operations during job execution:

    1. List namespaces to check that runai-lane-detection exists by using this command:

      kubectl get namespaces

      The output should appear like the following example:

      Figure showing input/output dialog or representing written content

  6. Create ClusterRole netappsnapshot and ClusterRoleBinding netappsnapshot using the following commands:

    `kubectl create -f runai-project-snap-role.yaml`
    `kubectl create -f runai-project-snap-role-binding.yaml`

Download and process the TuSimple dataset as RUN:AI job

The process to download and process the TuSimple dataset as a RUN: AI job is optional. It involves the following steps:

  1. Build and push the docker image, or omit this step if you want to use an existing docker image (for example, muneer7589/download-tusimple:1.0)

    1. Switch to the home directory:

      cd ~
    2. Go to the data directory of the project lane-detection-SCNN-horovod:

      cd ./lane-detection-SCNN-horovod/data
    3. Modify build_image.sh shell script and change docker repository to yours. For example, replace muneer7589 with your docker repository name. You could also change the docker image name and TAG (such as download-tusimple and 1.0):

      Figure showing input/output dialog or representing written content

    4. Run the script to build the docker image and push it to the docker repository using these commands:

      chmod +x build_image.sh
      ./build_image.sh
  2. Submit the RUN: AI job to download, extract, pre-process, and store the TuSimple lane detection dataset in a pvc, which is dynamically created by NetApp Trident:

    1. Use the following commands to submit the RUN: AI job:

      runai submit
      --name download-tusimple-data
      --pvc azurenetappfiles:100Gi:/mnt
      --image muneer7589/download-tusimple:1.0
    2. Enter the information from the table below to submit the RUN:AI job:

      Field Value or description

      -name

      Name of the job

      -pvc

      PVC of the format
      [StorageClassName]:Size:ContainerMountPath

      In the above job submission, you are creating an PVC based on-demand using Trident with storage class azurenetappfiles. Persistent volume capacity here is 100Gi and it’s mounted at path /mnt.

      -image

      Docker image to use when creating the container for this job

      The output should look like the following example:

      Figure showing input/output dialog or representing written content

    3. List the submitted RUN:AI jobs.

      runai list jobs

      Figure showing input/output dialog or representing written content

    4. Check the submitted job logs.

      runai logs download-tusimple-data -t 10

      Figure showing input/output dialog or representing written content

    5. List the pvc created. Use this pvc command for training in the next step.

      kubectl get pvc | grep download-tusimple-data

      The output should look like the following example:

      Figure showing input/output dialog or representing written content

    6. Check the job in RUN: AI UI (or app.run.ai).

      Figure showing input/output dialog or representing written content

Perform distributed lane detection training using Horovod

Performing distributed lane detection training using Horovod is an optional process. However, here are the steps involved:

  1. Build and push the docker image, or skip this step if you want to use the existing docker image (for example, muneer7589/dist-lane-detection:3.1):

    1. Switch to home directory.

      cd ~
    2. Go to the project directory lane-detection-SCNN-horovod.

      cd ./lane-detection-SCNN-horovod
    3. Modify the build_image.sh shell script and change docker repository to yours (for example, replace muneer7589 with your docker repository name). You could also change the docker image name and TAG (dist-lane-detection and 3.1, for example).

      Figure showing input/output dialog or representing written content

    4. Run the script to build the docker image and push to the docker repository.

      chmod +x build_image.sh
      ./build_image.sh
  2. Submit the RUN: AI job for carrying out distributed training (MPI):

    1. Using submit of RUN: AI for automatically creating PVC in the previous step (for downloading data) only allows you to have RWO access, which does not allow multiple pods or nodes to access the same PVC for distributed training. Update the access mode to ReadWriteMany and use the Kubernetes patch to do so.

    2. First, get the volume name of the PVC by running the following command:

      kubectl get pvc | grep download-tusimple-data

      Figure showing input/output dialog or representing written content

    3. Patch the volume and update access mode to ReadWriteMany (replace volume name with yours in the following command):

      kubectl patch pv pvc-bb03b74d-2c17-40c4-a445-79f3de8d16d5 -p '{"spec":{"accessModes":["ReadWriteMany"]}}'
    4. Submit the RUN: AI MPI job for executing the distributed training` job using information from the table below:

      runai submit-mpi
      --name dist-lane-detection-training
      --large-shm
      --processes=3
      --gpu 1
      --pvc pvc-download-tusimple-data-0:/mnt
      --image muneer7589/dist-lane-detection:3.1
      -e USE_WORKERS="true"
      -e NUM_WORKERS=4
      -e BATCH_SIZE=33
      -e USE_VAL="false"
      -e VAL_BATCH_SIZE=99
      -e ENABLE_SNAPSHOT="true"
      -e PVC_NAME="pvc-download-tusimple-data-0"
      Field Value or description

      name

      Name of the distributed training job

      large shm

      Mount a large /dev/shm device

      It is a shared file system mounted on RAM and provides large enough shared memory for multiple CPU workers to process and load batches into CPU RAM.

      processes

      Number of distributed training processes

      gpu

      Number of GPUs/processes to allocate for the job

      In this job, there are three GPU worker processes (--processes=3), each allocated with a single GPU (--gpu 1)

      pvc

      Use existing persistent volume (pvc-download-tusimple-data-0) created by previous job (download-tusimple-data) and it is mounted at path /mnt

      image

      Docker image to use when creating the container for this job

      Define environment variables to be set in the container

      USE_WORKERS

      Setting the argument to true turns on multi-process data loading

      NUM_WORKERS

      Number of data loader worker processes

      BATCH_SIZE

      Training batch size

      USE_VAL

      Setting the argument to true allows validation

      VAL_BATCH_SIZE

      Validation batch size

      ENABLE_SNAPSHOT

      Setting the argument to true enables taking data and trained model snapshots for ML versioning purposes

      PVC_NAME

      Name of the pvc to take a snapshot of. In the above job submission, you are taking a snapshot of pvc-download-tusimple-data-0, consisting of dataset and trained models

      The output should look like the following example:

      Figure showing input/output dialog or representing written content

    5. List the submitted job.

      runai list jobs

      Figure showing input/output dialog or representing written content

    6. Submitted job logs:

      runai logs dist-lane-detection-training

      Figure showing input/output dialog or representing written content

    7. Check training job in RUN: AI GUI (or app.runai.ai): RUN: AI Dashboard, as seen in the figures below. The first figure details three GPUs allocated for the distributed training job spread across three nodes on AKS, and the second RUN:AI jobs:

      Figure showing input/output dialog or representing written content

      Figure showing input/output dialog or representing written content

    8. After the training is finished, check the NetApp Snapshot copy that was created and linked with RUN: AI job.

      runai logs dist-lane-detection-training --tail 1

      Figure showing input/output dialog or representing written content

      kubectl get volumesnapshots | grep download-tusimple-data-0

Restore data from the NetApp Snapshot copy

To restore data from the NetApp Snapshot copy, complete the following steps:

  1. Switch to home directory.

    cd ~
  2. Go to the project directory lane-detection-SCNN-horovod.

    cd ./lane-detection-SCNN-horovod
  3. Modify restore-snaphot-pvc.yaml and update dataSource name field to the Snapshot copy from which you want to restore data. You could also change PVC name where the data will be restored to, in this example its restored-tusimple.

    Figure showing input/output dialog or representing written content

  4. Create a new PVC by using restore-snapshot-pvc.yaml.

    kubectl create -f restore-snapshot-pvc.yaml

    The output should look like the following example:

    Figure showing input/output dialog or representing written content

  5. If you want to use the just restored data for training, job submission remains the same as before; only replace the PVC_NAME with the restored PVC_NAME when submitting the training job, as seen in the following commands:

    runai submit-mpi
    --name dist-lane-detection-training
    --large-shm
    --processes=3
    --gpu 1
    --pvc restored-tusimple:/mnt
    --image muneer7589/dist-lane-detection:3.1
    -e USE_WORKERS="true"
    -e NUM_WORKERS=4
    -e BATCH_SIZE=33
    -e USE_VAL="false"
    -e VAL_BATCH_SIZE=99
    -e ENABLE_SNAPSHOT="true"
    -e PVC_NAME="restored-tusimple"

Performance evaluation

To show the linear scalability of the solution, performance tests have been done for two scenarios: one GPU and three GPUs. GPU allocation, GPU and memory utilization, different single- and three- node metrics have been captured during the training on the TuSimple lane detection dataset. Data is increased five- fold just for the sake of analyzing resource utilization during the training processes.

The solution enables customers to start with a small dataset and a few GPUs. When the amount of data and the demand of GPUs increase, customers can dynamically scale out the terabytes in the Standard Tier and quickly scale up to the Premium Tier to get four times the throughput per terabyte without moving any data. This process is further explained in the section, Azure NetApp Files service levels.

Processing time on one GPU was 12 hours and 45 minutes. Processing time on three GPUs across three nodes was approximately 4 hours and 30 minutes.

The figures shown throughout the remainder of this document illustrate examples of performance and scalability based on individual business needs.

The figure below illustrates 1 GPU allocation and memory utilization.

Figure showing input/output dialog or representing written content

The figure below illustrates single node GPU utilization.

Figure showing input/output dialog or representing written content

The figure below illustrates single node memory size (16GB).

Figure showing input/output dialog or representing written content

The figure below illustrates single node GPU count (1).

Figure showing input/output dialog or representing written content

The figure below illustrates single node GPU allocation (%).

Figure showing input/output dialog or representing written content

The figure below illustrates three GPUs across three nodes – GPUs allocation and memory.

Figure showing input/output dialog or representing written content

The figure below illustrates three GPUs across three nodes utilization (%).

Figure showing input/output dialog or representing written content

The figure below illustrates three GPUs across three nodes memory utilization (%).

Figure showing input/output dialog or representing written content

Azure NetApp Files service levels

You can change the service level of an existing volume by moving the volume to another capacity pool that uses the service level you want for the volume. This existing service-level change for the volume does not require that you migrate data. It also does not affect access to the volume.

Dynamically change the service level of a volume

To change the service level of a volume, use the following steps:

  1. On the Volumes page, right-click the volume whose service level you want to change. Select Change Pool.

    Figure showing input/output dialog or representing written content

  2. In the Change Pool window, select the capacity pool you want to move the volume to. Then, click OK.

    Figure showing input/output dialog or representing written content

Automate service level change

Dynamic Service Level change is currently still in Public Preview, but it is not enabled by default. To enable this feature on the Azure subscription, follow these steps provided in the document “ Dynamically change the service level of a volume.”

  • You can also use the following commands for Azure: CLI. For more information about changing the pool size of Azure NetApp Files, visit az netappfiles volume: Manage Azure NetApp Files (ANF) volume resources.

    az netappfiles volume pool-change -g mygroup
    --account-name myaccname
    -pool-name mypoolname
    --name myvolname
    --new-pool-resource-id mynewresourceid
  • The set- aznetappfilesvolumepool cmdlet shown here can change the pool of an Azure NetApp Files volume. More information about changing volume pool size and Azure PowerShell can be found by visiting Change pool for an Azure NetApp Files volume.

    Set-AzNetAppFilesVolumePool
    -ResourceGroupName "MyRG"
    -AccountName "MyAnfAccount"
    -PoolName "MyAnfPool"
    -Name "MyAnfVolume"
    -NewPoolResourceId 7d6e4069-6c78-6c61-7bf6-c60968e45fbf