简体中文版经机器翻译而成,仅供参考。如与英语版出现任何冲突,应以英语版为准。

使用 NVIDIA DeepOps 自动化部署部署 Kubernetes 集群

提供者 kevin-hoke

要使用 NVIDIA DeepOps 部署和配置 Kubernetes 集群,请完成以下步骤:

  1. 确保所有 Kubernetes 主节点和工作节点上存在相同的用户帐户。

  2. 克隆 DeepOps 存储库。

    git clone https://github.com/NVIDIA/deepops.git
  3. 查看最新版本标记。

    cd deepops
    git checkout tags/20.08

    如果跳过此步骤,则会使用最新的开发代码,而不是正式版本。

  4. 通过安装必要的前提条件来准备部署跳转。

    ./scripts/setup.sh
  5. 通过打开 VI 编辑器 deepops/config/inventory 来创建和编辑 Ansible 清单。

    1. 在 "All" 下列出所有主节点和辅助节点。

    2. 列出 [Kube-master] 下的所有主节点

    3. 列出 [etcd] 下的所有主节点

    4. 列出 [Kube-node] 下的所有工作节点

      错误:缺少图形映像

  6. 通过打开 VI 编辑器 deepops/config/group_vars/K8s-cluster.yml 来启用 GPUOperator 。

    错误:缺少图形映像

  7. deepops_gp_operator_enabled 的值设置为 true 。

  8. 验证权限和网络配置。

    ansible all -m raw -a "hostname" -k -K
    • 如果通过 SSH 连接到远程主机需要密码,请使用 -k

    • 如果远程主机上的 sudo 需要密码,请使用 -k

  9. 如果上一步顺利完成,请继续设置 Kubernetes 。

    ansible-playbook --limit k8s-cluster playbooks/k8s-cluster.yml -k -K
  10. 要验证 Kubernetes 节点和 Pod 的状态,请运行以下命令:

    kubectl get nodes

    错误:缺少图形映像

    kubectl get pods -A

    运行所有 Pod 可能需要几分钟时间。

    错误:缺少图形映像

  11. 验证 Kubernetes 设置是否可以访问和使用 GPU 。

    ./scripts/k8s_verify_gpu.sh

    预期示例输出:

    rarvind@deployment-jump:~/deepops$ ./scripts/k8s_verify_gpu.sh
    job_name=cluster-gpu-tests
    Node found with 3 GPUs
    Node found with 3 GPUs
    total_gpus=6
    Creating/Deleting sandbox Namespace
    updating test yml
    downloading containers ...
    job.batch/cluster-gpu-tests condition met
    executing ...
    Mon Aug 17 16:02:45 2020
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 440.64.00    Driver Version: 440.64.00    CUDA Version: 10.2     |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |===============================+======================+======================|
    |   0  Tesla T4            On   | 00000000:18:00.0 Off |                    0 |
    | N/A   38C    P8    10W /  70W |      0MiB / 15109MiB |      0%      Default |
    +-------------------------------+----------------------+----------------------+
    +-----------------------------------------------------------------------------+
    | Processes:                                                       GPU Memory |
    |  GPU       PID   Type   Process name                             Usage      |
    |=============================================================================|
    |  No running processes found                                                 |
    +-----------------------------------------------------------------------------+
    Mon Aug 17 16:02:45 2020
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 440.64.00    Driver Version: 440.64.00    CUDA Version: 10.2     |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |===============================+======================+======================|
    |   0  Tesla T4            On   | 00000000:18:00.0 Off |                    0 |
    | N/A   38C    P8    10W /  70W |      0MiB / 15109MiB |      0%      Default |
    +-------------------------------+----------------------+----------------------+
    +-----------------------------------------------------------------------------+
    | Processes:                                                       GPU Memory |
    |  GPU       PID   Type   Process name                             Usage      |
    |=============================================================================|
    |  No running processes found                                                 |
    +-----------------------------------------------------------------------------+
    Mon Aug 17 16:02:45 2020
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 440.64.00    Driver Version: 440.64.00    CUDA Version: 10.2     |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |===============================+======================+======================|
    |   0  Tesla T4            On   | 00000000:18:00.0 Off |                    0 |
    | N/A   38C    P8    10W /  70W |      0MiB / 15109MiB |      0%      Default |
    +-------------------------------+----------------------+----------------------+
    +-----------------------------------------------------------------------------+
    | Processes:                                                       GPU Memory |
    |  GPU       PID   Type   Process name                             Usage      |
    |=============================================================================|
    |  No running processes found                                                 |
    +-----------------------------------------------------------------------------+
    Mon Aug 17 16:02:45 2020
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 440.64.00    Driver Version: 440.64.00    CUDA Version: 10.2     |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |===============================+======================+======================|
    |   0  Tesla T4            On   | 00000000:18:00.0 Off |                    0 |
    | N/A   38C    P8    10W /  70W |      0MiB / 15109MiB |      0%      Default |
    +-------------------------------+----------------------+----------------------+
    +-----------------------------------------------------------------------------+
    | Processes:                                                       GPU Memory |
    |  GPU       PID   Type   Process name                             Usage      |
    |=============================================================================|
    |  No running processes found                                                 |
    +-----------------------------------------------------------------------------+
    Mon Aug 17 16:02:45 2020
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 440.64.00    Driver Version: 440.64.00    CUDA Version: 10.2     |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |===============================+======================+======================|
    |   0  Tesla T4            On   | 00000000:18:00.0 Off |                    0 |
    | N/A   38C    P8    10W /  70W |      0MiB / 15109MiB |      0%      Default |
    +-------------------------------+----------------------+----------------------+
    +-----------------------------------------------------------------------------+
    | Processes:                                                       GPU Memory |
    |  GPU       PID   Type   Process name                             Usage      |
    |=============================================================================|
    |  No running processes found                                                 |
    +-----------------------------------------------------------------------------+
    Mon Aug 17 16:02:45 2020
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 440.64.00    Driver Version: 440.64.00    CUDA Version: 10.2     |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |===============================+======================+======================|
    |   0  Tesla T4            On   | 00000000:18:00.0 Off |                    0 |
    | N/A   38C    P8    10W /  70W |      0MiB / 15109MiB |      0%      Default |
    +-------------------------------+----------------------+----------------------+
    +-----------------------------------------------------------------------------+
    | Processes:                                                       GPU Memory |
    |  GPU       PID   Type   Process name                             Usage      |
    |=============================================================================|
    |  No running processes found                                                 |
    +-----------------------------------------------------------------------------+
    Number of Nodes: 2
    Number of GPUs: 6
    6 / 6 GPU Jobs COMPLETED
    job.batch "cluster-gpu-tests" deleted
    namespace "cluster-gpu-verify" deleted
  12. 在部署跳转中安装 Helm 。

    ./scripts/install_helm.sh
  13. 删除主节点上的相关项。

    kubectl taint nodes --all node-role.kubernetes.io/master-

    要运行负载平衡器 Pod ,需要执行此步骤。

  14. 部署负载平衡器。

  15. 编辑 config/hel/metalb.yml 文件,并在 Application Network 中提供一系列 IP 地址,以用作负载平衡器。

    ---
    # Default address range matches private network for the virtual cluster
    # defined in virtual/.
    # You should set this address range based on your site's infrastructure.
    configInline:
      address-pools:
      - name: default
        protocol: layer2
        addresses:
        - 172.21.231.130-172.21.231.140#Application Network
    controller:
      nodeSelector:
        node-role.kubernetes.io/master: ""
  16. 运行脚本以部署负载平衡器。

    ./scripts/k8s_deploy_loadbalancer.sh
  17. 部署入口控制器。

    ./scripts/k8s_deploy_ingress.sh