GPU setup

04/28/2023 Contributors

Previous: Output for execution of GATK using the Cromwell engine.

At the time of publication, the GATK tool does not have native support for GPU-based execution on premises. The following setup and guidance is provided to enable the readers understand how simple it is to use FlexPod with a rear-mounted NVIDIA Tesla P6 GPU using a PCIe mezzanine card for GATK.

We used the following Cisco-Validated Design (CVD) as the reference architecture and best-practice guide to set up the FlexPod environment so that we can run applications that use GPUs.

FlexPod Datacenter for AI/ML with Cisco UCS 480 ML for Deep Learning

Here is a set of key takeaways during this setup:

We used a PCIe NVIDIA Tesla P6 GPU in a mezzanine slot in the UCS B200 M5 servers.
For this setup, we registered on the NVIDIA partner portal and obtained an evaluation license (also known as an entitlement) to be able to use the GPUs in compute mode.
We downloaded the NVIDIA vGPU software required from the NVIDIA partner website.
We downloaded the entitlement *.bin file from the NVIDIA partner website.
We installed an NVIDIA vGPU license server and added the entitlements to the license server using the *.bin file downloaded from the NVIDIA partner site.
Make sure to choose the correct NVIDIA vGPU software version for your deployment on the NVIDIA partner portal. For this setup we used driver version 460.73.02.

This command installs the NVIDIA vGPU Manager in ESXi.

[root@localhost:~] esxcli software vib install -v /vmfs/volumes/infra_datastore_nfs/nvidia/vib/NVIDIA_bootbank_NVIDIA-VMware_ESXi_7.0_Host_Driver_460.73.02-1OEM.700.0.0.15525992.vib
Installation Result
Message: Operation finished successfully.
Reboot Required: false
VIBs Installed: NVIDIA_bootbank_NVIDIA-VMware_ESXi_7.0_Host_Driver_460.73.02-1OEM.700.0.0.15525992
VIBs Removed:
VIBs Skipped:

After rebooting the ESXi server, run the following command to validate the installation and check the health of the GPUs.

[root@localhost:~] nvidia-smi
Wed Aug 18 21:37:19 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.73.02    Driver Version: 460.73.02    CUDA Version: N/A      |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla P6            On   | 00000000:D8:00.0 Off |                    0 |
| N/A   35C    P8     9W /  90W |  15208MiB / 15359MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A   2812553    C+G   RHEL01                          15168MiB |
+-----------------------------------------------------------------------------+
[root@localhost:~]

Using vCenter, configure the graphics device settings to “Shared Direct.”
Make sure that secure boot is disabled for the RedHat VM.
Make sure that the VM Boot Options firmware is set to EFI ( ref).
Make sure that the following PARAMS are added to the VM Options advanced Edit Configuration. The value of the pciPassthru.64bitMMIOSizeGB parameter depends on the GPU memory and number of GPUs assigned to the VM. For example:
1. If a VM is assigned 4 x 32GB V100 GPUs, then this value should be 128.
2. If a VM is assigned 4 x 16GB P6 GPUs, then this value should be 64.
When adding vGPUs as a new PCI Device to the virtual machine in vCenter, make sure to select NVIDIA GRID vGPU as the PCI Device type.
Choose the correct GPU profile that suites the GPU being used, the GPU memory, and the usage purpose: for example, graphics versus compute.
On the RedHat Linux VM, NVIDIA drivers can be installed by running the following command:
```
[root@genomics1 genomics]#sh NVIDIA-Linux-x86_64-460.73.01-grid.run
```

Verify that the correct vGPU profile is being reported by running the following command:

[root@genomics1 genomics]# nvidia-smi –query-gpu=gpu_name –format=csv,noheader –id=0 | sed -e ‘s/ /-/g’
GRID-P6-16C
[root@genomics1 genomics]#

After reboot, verify that the correct NVIDIA vGPU are reported along with the driver versions.

[root@genomics1 genomics]# nvidia-smi
Wed Aug 18 20:30:56 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.73.01    Driver Version: 460.73.01    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GRID P6-16C         On   | 00000000:02:02.0 Off |                  N/A |
| N/A   N/A    P8    N/A /  N/A |   2205MiB / 16384MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      8604      G   /usr/libexec/Xorg                  13MiB |
+-----------------------------------------------------------------------------+
[root@genomics1 genomics]#

Make sure that the license server IP is configured on the VM in the vGPU grid configuration file.
1. Copy the template.
  [root@genomics1 genomics]# cp /etc/nvidia/gridd.conf.template /etc/nvidia/gridd.conf
2. Edit the file /etc/nvidia/rid.conf, add the license server IP address, and set the feature type to 1.
  ServerAddress=192.168.169.10
  FeatureType=1
After restarting the VM, you should see an entry under Licensed Clients in the license server as shown below.
Refer to the Solutions Setup section for more information on downloading the GATK and Cromwell software.

After GATK can use GPUs on premises, the workflow description language *. wdl has the runtime attributes as shown below.

task ValidateBAM {
  input {
    # Command parameters
    File input_bam
    String output_basename
    String? validation_mode
    String gatk_path
    # Runtime parameters
    String docker
    Int machine_mem_gb = 4
    Int addtional_disk_space_gb = 50
  }
  Int disk_size = ceil(size(input_bam, "GB")) + addtional_disk_space_gb
  String output_name = "${output_basename}_${validation_mode}.txt"
  command {
    ${gatk_path} \
      ValidateSamFile \
      --INPUT ${input_bam} \
      --OUTPUT ${output_name} \
      --MODE ${default="SUMMARY" validation_mode}
  }
  runtime {
    gpuCount: 1
    gpuType: "nvidia-tesla-p6"
    docker: docker
    memory: machine_mem_gb + " GB"
    disks: "local-disk " + disk_size + " HDD"
  }
  output {
    File validation_report = "${output_name}"
  }
}

Next: Conclusion.

GPU setup

Creating your file...