English

Deploy a Kubernetes Cluster with NVIDIA DeepOps Automated Deployment

Contributors netapp-dorianh Download PDF of this page

To deploy and configure the Kubernetes Cluster with NVIDIA DeepOps, complete the following steps:

  1. Make sure that the same user account is present on all the Kubernetes master and worker nodes.

  2. Clone the DeepOps repository.

    git clone https://github.com/NVIDIA/deepops.git
  3. Check out a recent release tag.

    cd deepops
    git checkout tags/20.08

    If this step is skipped, the latest development code is used, not an official release.

  4. Prepare the Deployment Jump by installing the necessary prerequisites.

    ./scripts/setup.sh
  5. Create and edit the Ansible inventory by opening a VI editor to deepops/config/inventory.

    1. List all the master and worker nodes under [all].

    2. List all the master nodes under [kube-master]

    3. List all the master nodes under [etcd]

    4. List all the worker nodes under [kube-node]

      Error: Missing Graphic Image

  6. Enable GPUOperator by opening a VI editor to deepops/config/group_vars/k8s-cluster.yml.

    Error: Missing Graphic Image

  7. Set the value of deepops_gpu_operator_enabled to true.

  8. Verify the permissions and network configuration.

    ansible all -m raw -a "hostname" -k -K
    • If SSH to the remote hosts requires a password, use -k.

    • If sudo on the remote hosts requires a password, use -K.

  9. If the previous step passed without any issues, proceed with the setup of Kubernetes.

    ansible-playbook --limit k8s-cluster playbooks/k8s-cluster.yml -k -K
  10. To verify the status of the Kubernetes nodes and the pods, run the following commands:

    kubectl get nodes

    Error: Missing Graphic Image

    kubectl get pods -A

    It can take a few minutes for all the pods to run.

    Error: Missing Graphic Image

  11. Verify that the Kubernetes setup can access and use the GPUs.

    ./scripts/k8s_verify_gpu.sh

    Expected sample output:

    rarvind@deployment-jump:~/deepops$ ./scripts/k8s_verify_gpu.sh
    job_name=cluster-gpu-tests
    Node found with 3 GPUs
    Node found with 3 GPUs
    total_gpus=6
    Creating/Deleting sandbox Namespace
    updating test yml
    downloading containers ...
    job.batch/cluster-gpu-tests condition met
    executing ...
    Mon Aug 17 16:02:45 2020
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 440.64.00    Driver Version: 440.64.00    CUDA Version: 10.2     |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |===============================+======================+======================|
    |   0  Tesla T4            On   | 00000000:18:00.0 Off |                    0 |
    | N/A   38C    P8    10W /  70W |      0MiB / 15109MiB |      0%      Default |
    +-------------------------------+----------------------+----------------------+
    +-----------------------------------------------------------------------------+
    | Processes:                                                       GPU Memory |
    |  GPU       PID   Type   Process name                             Usage      |
    |=============================================================================|
    |  No running processes found                                                 |
    +-----------------------------------------------------------------------------+
    Mon Aug 17 16:02:45 2020
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 440.64.00    Driver Version: 440.64.00    CUDA Version: 10.2     |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |===============================+======================+======================|
    |   0  Tesla T4            On   | 00000000:18:00.0 Off |                    0 |
    | N/A   38C    P8    10W /  70W |      0MiB / 15109MiB |      0%      Default |
    +-------------------------------+----------------------+----------------------+
    +-----------------------------------------------------------------------------+
    | Processes:                                                       GPU Memory |
    |  GPU       PID   Type   Process name                             Usage      |
    |=============================================================================|
    |  No running processes found                                                 |
    +-----------------------------------------------------------------------------+
    Mon Aug 17 16:02:45 2020
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 440.64.00    Driver Version: 440.64.00    CUDA Version: 10.2     |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |===============================+======================+======================|
    |   0  Tesla T4            On   | 00000000:18:00.0 Off |                    0 |
    | N/A   38C    P8    10W /  70W |      0MiB / 15109MiB |      0%      Default |
    +-------------------------------+----------------------+----------------------+
    +-----------------------------------------------------------------------------+
    | Processes:                                                       GPU Memory |
    |  GPU       PID   Type   Process name                             Usage      |
    |=============================================================================|
    |  No running processes found                                                 |
    +-----------------------------------------------------------------------------+
    Mon Aug 17 16:02:45 2020
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 440.64.00    Driver Version: 440.64.00    CUDA Version: 10.2     |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |===============================+======================+======================|
    |   0  Tesla T4            On   | 00000000:18:00.0 Off |                    0 |
    | N/A   38C    P8    10W /  70W |      0MiB / 15109MiB |      0%      Default |
    +-------------------------------+----------------------+----------------------+
    +-----------------------------------------------------------------------------+
    | Processes:                                                       GPU Memory |
    |  GPU       PID   Type   Process name                             Usage      |
    |=============================================================================|
    |  No running processes found                                                 |
    +-----------------------------------------------------------------------------+
    Mon Aug 17 16:02:45 2020
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 440.64.00    Driver Version: 440.64.00    CUDA Version: 10.2     |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |===============================+======================+======================|
    |   0  Tesla T4            On   | 00000000:18:00.0 Off |                    0 |
    | N/A   38C    P8    10W /  70W |      0MiB / 15109MiB |      0%      Default |
    +-------------------------------+----------------------+----------------------+
    +-----------------------------------------------------------------------------+
    | Processes:                                                       GPU Memory |
    |  GPU       PID   Type   Process name                             Usage      |
    |=============================================================================|
    |  No running processes found                                                 |
    +-----------------------------------------------------------------------------+
    Mon Aug 17 16:02:45 2020
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 440.64.00    Driver Version: 440.64.00    CUDA Version: 10.2     |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |===============================+======================+======================|
    |   0  Tesla T4            On   | 00000000:18:00.0 Off |                    0 |
    | N/A   38C    P8    10W /  70W |      0MiB / 15109MiB |      0%      Default |
    +-------------------------------+----------------------+----------------------+
    +-----------------------------------------------------------------------------+
    | Processes:                                                       GPU Memory |
    |  GPU       PID   Type   Process name                             Usage      |
    |=============================================================================|
    |  No running processes found                                                 |
    +-----------------------------------------------------------------------------+
    Number of Nodes: 2
    Number of GPUs: 6
    6 / 6 GPU Jobs COMPLETED
    job.batch "cluster-gpu-tests" deleted
    namespace "cluster-gpu-verify" deleted
  12. Install Helm on the Deployment Jump.

    ./scripts/install_helm.sh
  13. Remove the taints on the master nodes.

    kubectl taint nodes --all node-role.kubernetes.io/master-

    This step is required to run the LoadBalancer pods.

  14. Deploy LoadBalancer.

  15. Edit the config/helm/metallb.yml file and provide a range of IP ddresses in the Application Network to be used as LoadBalancer.

    ---
    # Default address range matches private network for the virtual cluster
    # defined in virtual/.
    # You should set this address range based on your site's infrastructure.
    configInline:
      address-pools:
      - name: default
        protocol: layer2
        addresses:
        - 172.21.231.130-172.21.231.140#Application Network
    controller:
      nodeSelector:
        node-role.kubernetes.io/master: ""
  16. Run a script to deploy LoadBalancer.

    ./scripts/k8s_deploy_loadbalancer.sh
  17. Deploy an Ingress Controller.

    ./scripts/k8s_deploy_ingress.sh