GPU setup

Contributors

At the time of publication, the GATK tool does not have native support for GPU-based execution on premises. The following setup and guidance is provided to enable the readers understand how simple it is to use FlexPod with a rear-mounted NVIDIA Tesla P6 GPU using a PCIe mezzanine card for GATK.

We used the following Cisco-Validated Design (CVD) as the reference architecture and best-practice guide to set up the FlexPod environment so that we can run applications that use GPUs.

Here is a set of key takeaways during this setup:

  1. We used a PCIe NVIDIA Tesla P6 GPU in a mezzanine slot in the UCS B200 M5 servers.

    Error: Missing Graphic Image

    Error: Missing Graphic Image

  2. For this setup, we registered on the NVIDIA partner portal and obtained an evaluation license (also known as an entitlement) to be able to use the GPUs in compute mode.

  3. We downloaded the NVIDIA vGPU software required from the NVIDIA partner website.

  4. We downloaded the entitlement *.bin file from the NVIDIA partner website.

  5. We installed an NVIDIA vGPU license server and added the entitlements to the license server using the *.bin file downloaded from the NVIDIA partner site.

  6. Make sure to choose the correct NVIDIA vGPU software version for your deployment on the NVIDIA partner portal. For this setup we used driver version 460.73.02.

  7. This command installs the NVIDIA vGPU Manager in ESXi.

    [root@localhost:~] esxcli software vib install -v /vmfs/volumes/infra_datastore_nfs/nvidia/vib/NVIDIA_bootbank_NVIDIA-VMware_ESXi_7.0_Host_Driver_460.73.02-1OEM.700.0.0.15525992.vib
    Installation Result
    Message: Operation finished successfully.
    Reboot Required: false
    VIBs Installed: NVIDIA_bootbank_NVIDIA-VMware_ESXi_7.0_Host_Driver_460.73.02-1OEM.700.0.0.15525992
    VIBs Removed:
    VIBs Skipped:
  8. After rebooting the ESXi server, run the following command to validate the installation and check the health of the GPUs.

    [root@localhost:~] nvidia-smi
    Wed Aug 18 21:37:19 2021
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 460.73.02    Driver Version: 460.73.02    CUDA Version: N/A      |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |                               |                      |               MIG M. |
    |===============================+======================+======================|
    |   0  Tesla P6            On   | 00000000:D8:00.0 Off |                    0 |
    | N/A   35C    P8     9W /  90W |  15208MiB / 15359MiB |      0%      Default |
    |                               |                      |                  N/A |
    +-------------------------------+----------------------+----------------------+
    +-----------------------------------------------------------------------------+
    | Processes:                                                                  |
    |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
    |        ID   ID                                                   Usage      |
    |=============================================================================|
    |    0   N/A  N/A   2812553    C+G   RHEL01                          15168MiB |
    +-----------------------------------------------------------------------------+
    [root@localhost:~]
  9. Using vCenter, configure the graphics device settings to “Shared Direct.”

    Error: Missing Graphic Image

    Error: Missing Graphic Image

  10. Make sure that secure boot is disabled for the RedHat VM.

  11. Make sure that the VM Boot Options firmware is set to EFI ( ref).

    Error: Missing Graphic Image

  12. Make sure that the following PARAMS are added to the VM Options advanced Edit Configuration. The value of the pciPassthru.64bitMMIOSizeGB parameter depends on the GPU memory and number of GPUs assigned to the VM. For example:

    1. If a VM is assigned 4 x 32GB V100 GPUs, then this value should be 128.

    2. If a VM is assigned 4 x 16GB P6 GPUs, then this value should be 64.

      Error: Missing Graphic Image

      Error: Missing Graphic Image

  13. When adding vGPUs as a new PCI Device to the virtual machine in vCenter, make sure to select NVIDIA GRID vGPU as the PCI Device type.

  14. Choose the correct GPU profile that suites the GPU being used, the GPU memory, and the usage purpose: for example, graphics versus compute.

    Error: Missing Graphic Image

  15. On the RedHat Linux VM, NVIDIA drivers can be installed by running the following command:

    [root@genomics1 genomics]#sh NVIDIA-Linux-x86_64-460.73.01-grid.run
  16. Verify that the correct vGPU profile is being reported by running the following command:

    [root@genomics1 genomics]# nvidia-smi –query-gpu=gpu_name –format=csv,noheader –id=0 | sed -e ‘s/ /-/g’
    GRID-P6-16C
    [root@genomics1 genomics]#
  17. After reboot, verify that the correct NVIDIA vGPU are reported along with the driver versions.

    [root@genomics1 genomics]# nvidia-smi
    Wed Aug 18 20:30:56 2021
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 460.73.01    Driver Version: 460.73.01    CUDA Version: 11.2     |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |                               |                      |               MIG M. |
    |===============================+======================+======================|
    |   0  GRID P6-16C         On   | 00000000:02:02.0 Off |                  N/A |
    | N/A   N/A    P8    N/A /  N/A |   2205MiB / 16384MiB |      0%      Default |
    |                               |                      |                  N/A |
    +-------------------------------+----------------------+----------------------+
    +-----------------------------------------------------------------------------+
    | Processes:                                                                  |
    |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
    |        ID   ID                                                   Usage      |
    |=============================================================================|
    |    0   N/A  N/A      8604      G   /usr/libexec/Xorg                  13MiB |
    +-----------------------------------------------------------------------------+
    [root@genomics1 genomics]#
  18. Make sure that the license server IP is configured on the VM in the vGPU grid configuration file.

    1. Copy the template.

      [root@genomics1 genomics]# cp /etc/nvidia/gridd.conf.template /etc/nvidia/gridd.conf
    2. Edit the file /etc/nvidia/rid.conf, add the license server IP address, and set the feature type to 1.

       ServerAddress=192.168.169.10
       FeatureType=1
  19. After restarting the VM, you should see an entry under Licensed Clients in the license server as shown below.

    Error: Missing Graphic Image

  20. Refer to the Solutions Setup section for more information on downloading the GATK and Cromwell software.

  21. After GATK can use GPUs on premises, the workflow description language *. wdl has the runtime attributes as shown below.

    task ValidateBAM {
      input {
        # Command parameters
        File input_bam
        String output_basename
        String? validation_mode
        String gatk_path
        # Runtime parameters
        String docker
        Int machine_mem_gb = 4
        Int addtional_disk_space_gb = 50
      }
      Int disk_size = ceil(size(input_bam, "GB")) + addtional_disk_space_gb
      String output_name = "${output_basename}_${validation_mode}.txt"
      command {
        ${gatk_path} \
          ValidateSamFile \
          --INPUT ${input_bam} \
          --OUTPUT ${output_name} \
          --MODE ${default="SUMMARY" validation_mode}
      }
      runtime {
        gpuCount: 1
        gpuType: "nvidia-tesla-p6"
        docker: docker
        memory: machine_mem_gb + " GB"
        disks: "local-disk " + disk_size + " HDD"
      }
      output {
        File validation_report = "${output_name}"
      }
    }