Achieving High Cluster Utilization
In this section, we emulate a realistic scenario in which four data science teams each submit their own workloads to demonstrate the Run:AI orchestration solution that achieves high cluster utilization while maintaining prioritization and balancing GPU resources. We start by using the ResNet-50 benchmark described in the section ResNet-50 with ImageNet Dataset Benchmark Summary:
$ runai submit netapp1 -i netapp/tensorflow-tf1-py3:20.01.0 --local-image --large-shm -v /mnt:/mnt -v /tmp:/tmp --command python --args "/netapp/scripts/run.py" --args "--dataset_dir=/mnt/mount_0/dataset/imagenet/imagenet_original/" --args "--num_mounts=2" --args "--dgx_version=dgx1" --args "--num_devices=1" -g 1
We ran the same ResNet-50 benchmark as in NVA-1121. We used the flag --local-image
for containers not residing in the public docker repository. We mounted the directories /mnt
and /tmp
on the host DGX-1 node to /mnt
and /tmp
to the container, respectively. The dataset is at NetApp AFFA800 with the dataset_dir
argument pointing to the directory. Both --num_devices=1
and -g 1
mean that we allocate one GPU for this job. The former is an argument for the run.py
script, while the latter is a flag for the runai submit
command.
The following figure shows a system overview dashboard with 97% GPU utilization and all sixteen available GPUs allocated. You can easily see how many GPUs are allocated for each team in the GPUs/Project bar chart. The Running Jobs pane shows the current running job names, project, user, type, node, GPUs consumed, run time, progress, and utilization details. A list of workloads in queue with their wait time is shown in Pending Jobs. Finally, the Nodes box offers GPU numbers and utilization for individual DGX-1 nodes in the cluster.