Kubeflow 구축
이 섹션에서는 Kubernetes 클러스터에 Kubeflow를 구축하기 위해 완료해야 하는 작업에 대해 설명합니다.
필수 구성 요소
이 섹션에 요약된 배포 연습을 수행하기 전에 이미 다음 작업을 수행했다고 가정합니다.
-
Kubernetes 작업 클러스터가 이미 있으며, Kubeflow에서 지원하는 Kubernetes 버전을 실행하고 있습니다. 지원되는 버전 목록은 를 참조하십시오 "Kubeflow 공식 문서".
-
에 설명된 대로 Kubernetes 클러스터에 NetApp Trident를 이미 설치 및 구성했습니다 "Trident 구축 및 구성".
기본 Kubernetes StorageClass를 설정합니다
Kubeflow를 구현하기 전에 Kubernetes 클러스터 내에서 기본 StorageClass를 지정해야 합니다. Kubeflow 배포 프로세스에서는 기본 StorageClass를 사용하여 새 영구 볼륨의 프로비저닝을 시도합니다. 기본 StorageClass로 지정된 StorageClass가 없으면 배포가 실패합니다. 클러스터 내에서 기본 StorageClass를 지정하려면 배포 점프 호스트에서 다음 작업을 수행합니다. 클러스터 내에서 기본 StorageClass를 이미 지정한 경우에는 이 단계를 건너뛸 수 있습니다.
-
기존 StorageClasses 중 하나를 기본 StorageClass로 지정합니다. 다음 명령을 실행하면 기본 StorageClass로 ONTAP-ai-FlexVols-Retain이라는 StorageClass가 지정됩니다.
ONTAP-NAS-Flexgroup Trident 백엔드 유형은 PVC 크기가 매우 큽니다. 기본적으로 Kubeflow는 크기가 몇 GB인 PVC를 프로비저닝하려고 시도합니다. 따라서 Kubeflow 구축을 위해 "ONTAP-NAS-flexgroup" 백엔드 유형을 기본 StorageClass로 사용하는 StorageClass를 지정할 수 없습니다. |
$ kubectl get sc NAME PROVISIONER AGE ontap-ai-flexgroups-retain csi.trident.netapp.io 25h ontap-ai-flexgroups-retain-iface1 csi.trident.netapp.io 25h ontap-ai-flexgroups-retain-iface2 csi.trident.netapp.io 25h ontap-ai-flexvols-retain csi.trident.netapp.io 3s $ kubectl patch storageclass ontap-ai-flexvols-retain -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}' storageclass.storage.k8s.io/ontap-ai-flexvols-retain patched $ kubectl get sc NAME PROVISIONER AGE ontap-ai-flexgroups-retain csi.trident.netapp.io 25h ontap-ai-flexgroups-retain-iface1 csi.trident.netapp.io 25h ontap-ai-flexgroups-retain-iface2 csi.trident.netapp.io 25h ontap-ai-flexvols-retain (default) csi.trident.netapp.io 54s
NVIDIA DeepOps를 사용하여 Kubeflow를 배포합니다
NVIDIA DeepOps에서 제공하는 Kubeflow 구현 툴을 사용할 것을 권장합니다. DeepOps 구축 툴을 사용하여 Kubernetes 클러스터에 Kubeflow를 배포하려면 배포 점프 호스트에서 다음 작업을 수행합니다.
또는 에 따라 Kubeflow를 수동으로 배포할 수도 있습니다 "설치 지침" 공식 Kubeflow 문서에서 제공됩니다 |
-
에 따라 클러스터에 Kubeflow를 구현합니다 "Kubeflow 구축 지침" NVIDIA DeepOps GitHub 사이트에서 다운로드할 수 있습니다.
-
DeepOps Kubeflow 구현 도구에서 출력하는 Kubeflow 대시보드 URL을 기록합니다.
$ ./scripts/k8s/deploy_kubeflow.sh -x … INFO[0007] Applied the configuration Successfully! filename="cmd/apply.go:72" Kubeflow app installed to: /home/ai/kubeflow It may take several minutes for all services to start. Run 'kubectl get pods -n kubeflow' to verify To remove (excluding CRDs, istio, auth, and cert-manager), run: ./scripts/k8s_deploy_kubeflow.sh -d To perform a full uninstall : ./scripts/k8s_deploy_kubeflow.sh -D Kubeflow Dashboard (HTTP NodePort): http://10.61.188.111:31380
-
Kubeflow 네임스페이스 내에 배포된 모든 Pod에 'Running'이라는 'Status'가 표시되는지 확인하고 네임스페이스 내에 배포된 구성 요소가 오류 상태에 있지 않은지 확인합니다. 모든 Pod를 시작하는 데 몇 분 정도 걸릴 수 있습니다.
$ kubectl get all -n kubeflow NAME READY STATUS RESTARTS AGE pod/admission-webhook-bootstrap-stateful-set-0 1/1 Running 0 95s pod/admission-webhook-deployment-6b89c84c98-vrtbh 1/1 Running 0 91s pod/application-controller-stateful-set-0 1/1 Running 0 98s pod/argo-ui-5dcf5d8b4f-m2wn4 1/1 Running 0 97s pod/centraldashboard-cf4874ddc-7hcr8 1/1 Running 0 97s pod/jupyter-web-app-deployment-685b455447-gjhh7 1/1 Running 0 96s pod/katib-controller-88c97d85c-kgq66 1/1 Running 1 95s pod/katib-db-8598468fd8-5jw2c 1/1 Running 0 95s pod/katib-manager-574c8c67f9-wtrf5 1/1 Running 1 95s pod/katib-manager-rest-778857c989-fjbzn 1/1 Running 0 95s pod/katib-suggestion-bayesianoptimization-65df4d7455-qthmw 1/1 Running 0 94s pod/katib-suggestion-grid-56bf69f597-98vwn 1/1 Running 0 94s pod/katib-suggestion-hyperband-7777b76cb9-9v6dq 1/1 Running 0 93s pod/katib-suggestion-nasrl-77f6f9458c-2qzxq 1/1 Running 0 93s pod/katib-suggestion-random-77b88b5c79-l64j9 1/1 Running 0 93s pod/katib-ui-7587c5b967-nd629 1/1 Running 0 95s pod/metacontroller-0 1/1 Running 0 96s pod/metadata-db-5dd459cc-swzkm 1/1 Running 0 94s pod/metadata-deployment-6cf77db994-69fk7 1/1 Running 3 93s pod/metadata-deployment-6cf77db994-mpbjt 1/1 Running 3 93s pod/metadata-deployment-6cf77db994-xg7tz 1/1 Running 3 94s pod/metadata-ui-78f5b59b56-qb6kr 1/1 Running 0 94s pod/minio-758b769d67-llvdr 1/1 Running 0 91s pod/ml-pipeline-5875b9db95-g8t2k 1/1 Running 0 91s pod/ml-pipeline-persistenceagent-9b69ddd46-bt9r9 1/1 Running 0 90s pod/ml-pipeline-scheduledworkflow-7b8d756c76-7x56s 1/1 Running 0 90s pod/ml-pipeline-ui-79ffd9c76-fcwpd 1/1 Running 0 90s pod/ml-pipeline-viewer-controller-deployment-5fdc87f58-b2t9r 1/1 Running 0 90s pod/mysql-657f87857d-l5k9z 1/1 Running 0 91s pod/notebook-controller-deployment-56b4f59bbf-8bvnr 1/1 Running 0 92s pod/profiles-deployment-6bc745947-mrdkh 2/2 Running 0 90s pod/pytorch-operator-77c97f4879-hmlrv 1/1 Running 0 92s pod/seldon-operator-controller-manager-0 1/1 Running 1 91s pod/spartakus-volunteer-5fdfddb779-l7qkm 1/1 Running 0 92s pod/tensorboard-6544748d94-nh8b2 1/1 Running 0 92s pod/tf-job-dashboard-56f79c59dd-6w59t 1/1 Running 0 92s pod/tf-job-operator-79cbfd6dbc-rb58c 1/1 Running 0 91s pod/workflow-controller-db644d554-cwrnb 1/1 Running 0 97s NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/admission-webhook-service ClusterIP 10.233.51.169 <none> 443/TCP 97s service/application-controller-service ClusterIP 10.233.4.54 <none> 443/TCP 98s service/argo-ui NodePort 10.233.47.191 <none> 80:31799/TCP 97s service/centraldashboard ClusterIP 10.233.8.36 <none> 80/TCP 97s service/jupyter-web-app-service ClusterIP 10.233.1.42 <none> 80/TCP 97s service/katib-controller ClusterIP 10.233.25.226 <none> 443/TCP 96s service/katib-db ClusterIP 10.233.33.151 <none> 3306/TCP 97s service/katib-manager ClusterIP 10.233.46.239 <none> 6789/TCP 96s service/katib-manager-rest ClusterIP 10.233.55.32 <none> 80/TCP 96s service/katib-suggestion-bayesianoptimization ClusterIP 10.233.49.191 <none> 6789/TCP 95s service/katib-suggestion-grid ClusterIP 10.233.9.105 <none> 6789/TCP 95s service/katib-suggestion-hyperband ClusterIP 10.233.22.2 <none> 6789/TCP 95s service/katib-suggestion-nasrl ClusterIP 10.233.63.73 <none> 6789/TCP 95s service/katib-suggestion-random ClusterIP 10.233.57.210 <none> 6789/TCP 95s service/katib-ui ClusterIP 10.233.6.116 <none> 80/TCP 96s service/metadata-db ClusterIP 10.233.31.2 <none> 3306/TCP 96s service/metadata-service ClusterIP 10.233.27.104 <none> 8080/TCP 96s service/metadata-ui ClusterIP 10.233.57.177 <none> 80/TCP 96s service/minio-service ClusterIP 10.233.44.90 <none> 9000/TCP 94s service/ml-pipeline ClusterIP 10.233.41.201 <none> 8888/TCP,8887/TCP 94s service/ml-pipeline-tensorboard-ui ClusterIP 10.233.36.207 <none> 80/TCP 93s service/ml-pipeline-ui ClusterIP 10.233.61.150 <none> 80/TCP 93s service/mysql ClusterIP 10.233.55.117 <none> 3306/TCP 94s service/notebook-controller-service ClusterIP 10.233.10.166 <none> 443/TCP 95s service/profiles-kfam ClusterIP 10.233.33.79 <none> 8081/TCP 92s service/pytorch-operator ClusterIP 10.233.37.112 <none> 8443/TCP 95s service/seldon-operator-controller-manager-service ClusterIP 10.233.30.178 <none> 443/TCP 92s service/tensorboard ClusterIP 10.233.58.151 <none> 9000/TCP 94s service/tf-job-dashboard ClusterIP 10.233.4.17 <none> 80/TCP 94s service/tf-job-operator ClusterIP 10.233.60.32 <none> 8443/TCP 94s service/webhook-server-service ClusterIP 10.233.32.167 <none> 443/TCP 87s NAME READY UP-TO-DATE AVAILABLE AGE deployment.apps/admission-webhook-deployment 1/1 1 1 97s deployment.apps/argo-ui 1/1 1 1 97s deployment.apps/centraldashboard 1/1 1 1 97s deployment.apps/jupyter-web-app-deployment 1/1 1 1 97s deployment.apps/katib-controller 1/1 1 1 96s deployment.apps/katib-db 1/1 1 1 97s deployment.apps/katib-manager 1/1 1 1 96s deployment.apps/katib-manager-rest 1/1 1 1 96s deployment.apps/katib-suggestion-bayesianoptimization 1/1 1 1 95s deployment.apps/katib-suggestion-grid 1/1 1 1 95s deployment.apps/katib-suggestion-hyperband 1/1 1 1 95s deployment.apps/katib-suggestion-nasrl 1/1 1 1 95s deployment.apps/katib-suggestion-random 1/1 1 1 95s deployment.apps/katib-ui 1/1 1 1 96s deployment.apps/metadata-db 1/1 1 1 96s deployment.apps/metadata-deployment 3/3 3 3 96s deployment.apps/metadata-ui 1/1 1 1 96s deployment.apps/minio 1/1 1 1 94s deployment.apps/ml-pipeline 1/1 1 1 94s deployment.apps/ml-pipeline-persistenceagent 1/1 1 1 93s deployment.apps/ml-pipeline-scheduledworkflow 1/1 1 1 93s deployment.apps/ml-pipeline-ui 1/1 1 1 93s deployment.apps/ml-pipeline-viewer-controller-deployment 1/1 1 1 93s deployment.apps/mysql 1/1 1 1 94s deployment.apps/notebook-controller-deployment 1/1 1 1 95s deployment.apps/profiles-deployment 1/1 1 1 92s deployment.apps/pytorch-operator 1/1 1 1 95s deployment.apps/spartakus-volunteer 1/1 1 1 94s deployment.apps/tensorboard 1/1 1 1 94s deployment.apps/tf-job-dashboard 1/1 1 1 94s deployment.apps/tf-job-operator 1/1 1 1 94s deployment.apps/workflow-controller 1/1 1 1 97s NAME DESIRED CURRENT READY AGE replicaset.apps/admission-webhook-deployment-6b89c84c98 1 1 1 97s replicaset.apps/argo-ui-5dcf5d8b4f 1 1 1 97s replicaset.apps/centraldashboard-cf4874ddc 1 1 1 97s replicaset.apps/jupyter-web-app-deployment-685b455447 1 1 1 97s replicaset.apps/katib-controller-88c97d85c 1 1 1 96s replicaset.apps/katib-db-8598468fd8 1 1 1 97s replicaset.apps/katib-manager-574c8c67f9 1 1 1 96s replicaset.apps/katib-manager-rest-778857c989 1 1 1 96s replicaset.apps/katib-suggestion-bayesianoptimization-65df4d7455 1 1 1 95s replicaset.apps/katib-suggestion-grid-56bf69f597 1 1 1 95s replicaset.apps/katib-suggestion-hyperband-7777b76cb9 1 1 1 95s replicaset.apps/katib-suggestion-nasrl-77f6f9458c 1 1 1 95s replicaset.apps/katib-suggestion-random-77b88b5c79 1 1 1 95s replicaset.apps/katib-ui-7587c5b967 1 1 1 96s replicaset.apps/metadata-db-5dd459cc 1 1 1 96s replicaset.apps/metadata-deployment-6cf77db994 3 3 3 96s replicaset.apps/metadata-ui-78f5b59b56 1 1 1 96s replicaset.apps/minio-758b769d67 1 1 1 93s replicaset.apps/ml-pipeline-5875b9db95 1 1 1 93s replicaset.apps/ml-pipeline-persistenceagent-9b69ddd46 1 1 1 92s replicaset.apps/ml-pipeline-scheduledworkflow-7b8d756c76 1 1 1 91s replicaset.apps/ml-pipeline-ui-79ffd9c76 1 1 1 91s replicaset.apps/ml-pipeline-viewer-controller-deployment-5fdc87f58 1 1 1 91s replicaset.apps/mysql-657f87857d 1 1 1 92s replicaset.apps/notebook-controller-deployment-56b4f59bbf 1 1 1 94s replicaset.apps/profiles-deployment-6bc745947 1 1 1 91s replicaset.apps/pytorch-operator-77c97f4879 1 1 1 94s replicaset.apps/spartakus-volunteer-5fdfddb779 1 1 1 94s replicaset.apps/tensorboard-6544748d94 1 1 1 93s replicaset.apps/tf-job-dashboard-56f79c59dd 1 1 1 93s replicaset.apps/tf-job-operator-79cbfd6dbc 1 1 1 93s replicaset.apps/workflow-controller-db644d554 1 1 1 97s NAME READY AGE statefulset.apps/admission-webhook-bootstrap-stateful-set 1/1 97s statefulset.apps/application-controller-stateful-set 1/1 98s statefulset.apps/metacontroller 1/1 98s statefulset.apps/seldon-operator-controller-manager 1/1 92s $ kubectl get pvc -n kubeflow NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE katib-mysql Bound pvc-b07f293e-d028-11e9-9b9d-00505681a82d 10Gi RWO ontap-ai-flexvols-retain 27m metadata-mysql Bound pvc-b0f3f032-d028-11e9-9b9d-00505681a82d 10Gi RWO ontap-ai-flexvols-retain 27m minio-pv-claim Bound pvc-b22727ee-d028-11e9-9b9d-00505681a82d 20Gi RWO ontap-ai-flexvols-retain 27m mysql-pv-claim Bound pvc-b2429afd-d028-11e9-9b9d-00505681a82d 20Gi RWO ontap-ai-flexvols-retain 27m
-
웹 브라우저에서 2단계에서 기록해 둔 URL로 이동하여 Kubeflow 중앙 대시보드에 액세스합니다.
기본 사용자 이름은 admin@kubeflow.org, 기본 암호는 12341234입니다. 추가 사용자를 생성하려면 의 지침을 따르십시오 "Kubeflow 공식 문서".