Create a Kubeflow Pipeline to Rapidly Clone a Dataset for a Data Scientist Workspace

Contributors netapp-dorianh kevin-hoke Download PDF of this page

To define and execute a new Kubeflow Pipeline that takes advantage of NetApp FlexClone technology in order to rapidly and efficiently clone a dataset volume in order to create a data scientist or developer workspace, perform the following tasks. For more information about Kubeflow Pipelines, see the official Kubeflow documentation. Note that the example pipeline that is shown in this section only works with volumes that reside on ONTAP storage systems or software-defined instances.

  1. If you have not already done so, create a Kubernetes secret containing the username and password of the cluster admin account for the ONTAP cluster on which your volumes reside. This secret needs to be created in the kubeflow namespace because this is the namespace that pipelines are executed in. Note that you must replace username and password with your username and password when executing these commands, and you must use the output of the base64 commands (see highlighted text) in your secret definition accordingly.

    $ echo -n 'username' | base64
    dXNlcm5hbWU=
    $ echo -n 'password' | base64
    cGFzc3dvcmQ=
    $ cat << EOF > ./secret-ontap-cluster-mgmt-account.yaml
    apiVersion: v1
    kind: Secret
    metadata:
      name: ontap-cluster-mgmt-account
      namespace: kubeflow
    data:
      username: dXNlcm5hbWU=
      password: cGFzc3dvcmQ=
    EOF
    $ kubectl create -f ./secret-ontap-cluster-mgmt-account.yaml
    secret/ontap-cluster-mgmt-account created
  2. If you have not already done so, you must install the Kubeflow Pipelines SDK. Refer to the official Kubeflow documentation for installation instructions.

  3. Define your Kubeflow Pipeline in Python using the Kubeflow Pipelines SDK. The example commands that follow show the creation of a pipeline definition for a pipeline that will accept the following parameters at run-time and then execute the following steps. Modify the pipeline definition as needed depending on your specific process.

    Run-time Parameters:

    • ontap_cluster_mgmt_hostname: Hostname/IP address of the ONTAP cluster on which your dataset and model volumes are stored.

    • ontap_cluster_admin_acct_k8s_secret: Name of the Kubernetes secret that was created in step 1.

    • ontap_verify_ssl_cert: Denotes whether or not to verify your cluster’s SSL certificate when communicating with the ONTAP API (True/False).

    • workspace_name: Name that you want to give to your new workspace.

    • jupyter_namespace: Namespace in wich you intend to create a Jupyter Notebook workspace. See aicp_provision_a_jupyter_notebook_workspace_for_data_scientist_or_developer_use.html[Provision a Jupyter Notebook Workspace^] for details on creating a Jupyter Notebook workspace. The dataset clone that this pipeline creates will be mountable in the Jupyter Notebook workspace.

    • dataset_volume_pv_existing: Name of the Kubernetes PersistentVolume (PV) object that corresponds to the dataset volume PVC that is tied to the volume that you wish to clone. To get the name of the PV, you can run kubectl -n <namespace> get pvc. The name of the PV that corresponds to a given PVC can be found in the VOLUME column. Note that this could be the volume that was imported in the section Provision a Jupyter Notebook Workspace, step 1, or it could be a different volume.

    • trident_storage_class: Kubernetes StorageClass that you wish to use to create this clone. This would generally be a StorageClass that you created in the section Example Kubernetes StorageClasses for ONTAP AI Deployments.

    • trident_namespace: Namespace that Trident is installed in. Note that, by default, Trident is installed in the trident namespace.

    • trident_backend: Trident Backend that you wish to use to create this clone. This would generally be a Backend that you created in the section Example Kubernetes StorageClasses for ONTAP AI Deployments.

      Pipeline Steps:

      1. Triggers the creation of a clone, using NetApp FlexClone technology, of your dataset volume.

      2. Prints instructions for deploying an interactive Jupyter Notebook workspace that has access to the dataset clone.

        $ cat << EOF > ./create-data-scientist-workspace.py
        # Kubeflow Pipeline Definition: Create Data Scientist Workspace
        import kfp.dsl as dsl
        import kfp.components as comp
        from kubernetes import client as k8s_client
        # Define function that triggers the creation of a NetApp snapshot
        def netappClone(
            ontapClusterMgmtHostname: str,
            sourcePvName: str,
            verifySSLCert: bool = True
        ) -> str :
            # Install netapp_ontap package
            import sys, subprocess;
            subprocess.run([sys.executable, '-m', 'pip', 'install', 'netapp_ontap'])
        
            # Import needed functions/classes
            from netapp_ontap import config as netappConfig
            from netapp_ontap.host_connection import HostConnection as NetAppHostConnection
            from netapp_ontap.resources import Volume, Snapshot
            from datetime import datetime
            import json
            # Retrieve ONTAP cluster admin account details from mounted K8s secrets
            usernameSecret = open('/mnt/secret/username', 'r')
            ontapClusterAdminUsername = usernameSecret.read().strip()
            passwordSecret = open('/mnt/secret/password', 'r')
            ontapClusterAdminPassword = passwordSecret.read().strip()
        
            # Configure connection to ONTAP cluster/instance
            netappConfig.CONNECTION = NetAppHostConnection(
                host = ontapClusterMgmtHostname,
                username = ontapClusterAdminUsername,
                password = ontapClusterAdminPassword,
                verify = verifySSLCert
            )
        
            # Convert pv name to ONTAP volume name
            # The following will not work if you specified a custom storagePrefix when creating your
            #   Trident backend. If you specified a custom storagePrefix, you will need to update this
            #   code to match your prefix.
            sourceVolumeName = 'trident_%s' % sourcePvName.replace("-", "_")
            print('\nSource pv name: ', sourcePvName)
            print('Source ONTAP volume name: ', sourceVolumeName)
            # Create clone
            sourceVolume = Volume.find(name = sourceVolumeName)
            timestamp = datetime.today().strftime("%Y%m%d_%H%M%S")
            cloneVolumeName = 'kfp_clone_%s' % timestamp
            cloneVolume = Volume.from_dict({
                'name': cloneVolumeName,
                'svm': sourceVolume.to_dict()['svm'],
                'clone': {
                    'is_flexclone':'true',
                    'parent_volume': sourceVolume.to_dict()
                },
                'nas': {
                    'path': '/%s' % cloneVolumeName
                }
            })
            response = cloneVolume.post()
            print("\nAPI Response:")
            print(response.http_response.text)
            # Retrieve clone volume details
            cloneVolume.get()
            # Convert clone volume details to JSON string
            cloneVolumeDetails = cloneVolume.to_dict()
            print("\nClone Volume Details:")
            print(json.dumps(cloneVolumeDetails, indent=2))
            # Return name of new clone volume
            return cloneVolumeDetails['name']
        # Convert netappClone function to Kubeflow Pipeline ContainerOp named 'NetappCloneOp'
        NetappCloneOp = comp.func_to_container_op(netappClone, base_image='python:3')
        # Define Kubeflow Pipeline
        @dsl.pipeline(
            name="Create Data Scientist Workspace",
            description="Template for cloning dataset volume in order to create data scientist/developer workspace"
        )
        def create_data_scientist_workspace(
            # Define variables that the user can set in the pipelines UI; set default values
            ontap_cluster_mgmt_hostname: str = "10.61.188.40",
            ontap_cluster_admin_acct_k8s_secret: str = "ontap-cluster-mgmt-account",
            ontap_api_verify_ssl_cert: bool = True,
            workspace_name: str = "dev",
            jupyter_namespace: str = "admin",
            dataset_volume_pv_existing: str = "pvc-db963a53-abf2-4ffa-9c07-8815ce78d506",
            trident_storage_class: str = "ontap-ai-flexvols-retain",
            trident_namespace: str = "trident",
            trident_backend: str = "ontap-ai"
        ) :
            # Pipeline Steps:
            # Create a clone of the source dataset volume
            dataset_clone = NetappCloneOp(
                ontap_cluster_mgmt_hostname,
                dataset_volume_pv_existing,
                ontap_api_verify_ssl_cert
            )
            # Mount k8s secret containing ONTAP cluster admin account details
            dataset_clone.add_pvolumes({
                '/mnt/secret': k8s_client.V1Volume(
                    name='ontap-cluster-admin',
                    secret=k8s_client.V1SecretVolumeSource(
                        secret_name=ontap_cluster_admin_acct_k8s_secret
                    )
                )
            })
            # Retrieve clone volume name from op output
            clone_volume_Name = dataset_clone.output
            # Convert clone volume name to allowed pvc name (for user instructions)
            workspace_pvc_name = 'dataset-workspace-' + str(workspace_name)
            # Define user instructions
            user_instructions = '''
        1) Execute the following commands against your Kubernetes cluster:
        cat << EOD > import-pvc-pipeline-clone.yaml
        kind: PersistentVolumeClaim
        apiVersion: v1
        metadata:
          name: %s
          namespace: %s
        spec:
          accessModes:
            - ReadWriteMany
          storageClassName: %s
        EOD
        tridentctl -n %s import volume %s %s -f ./import-pvc-pipeline-clone.yaml
        2) From Kubeflow "Notebook Servers" dashboard, provision a new Jupyter workspace in namespace, "%s", and mount dataset pvc, "%s".
        ''' % (workspace_pvc_name, jupyter_namespace, trident_storage_class, trident_namespace, trident_backend, clone_volume_Name, jupyter_namespace, workspace_pvc_name)
            # Print instructions for deploying an interactive workspace
            print_instructions = dsl.ContainerOp(
                name="print-instructions",
                image="ubuntu:bionic",
                command=["sh", "-c"],
                arguments=["echo '%s'" % user_instructions]
            )
            # State that instructions should be printed after clone is created
            print_instructions.after(dataset_clone)
        if __name__ == '__main__' :
            import kfp.compiler as compiler
            compiler.Compiler().compile(create_data_scientist_workspace, __file__ + '.yaml')
        EOF
        $ python3 create-data-scientist-workspace.py
        $ ls create-data-scientist-workspace.py.yaml
        create-data-scientist-workspace.py.yaml
  4. From the Kubeflow central dashboard, click Pipelines in the main menu to navigate to the Kubeflow Pipelines administration page.

    Error: Missing Graphic Image

  5. Click Upload Pipeline to upload your pipeline definition.

    Error: Missing Graphic Image

  6. Choose the .yaml file containing your pipeline definition that you created in step 3, give your pipeline a name, and click Upload.

    Error: Missing Graphic Image

  7. You should now see your new pipeline in the list of pipelines on the pipeline administration page. Click your pipeline’s name to view it.

    Error: Missing Graphic Image

  8. Review your pipeline to confirm that it looks correct.

    Error: Missing Graphic Image

  9. Click Create run to run your pipeline.
    [+
    Error: Missing Graphic Image

  10. You are now presented with a screen from which you can start a pipeline run. Create a name for the run, enter a description, choose an experiment to file the run under, and choose whether you want to initiate a one-off run or schedule a recurring run.

    Error: Missing Graphic Image

  11. Define parameters for the run, and then click Start. In the following example, the default values are accepted for most parameters. The name of an already-existing PV is entered for dataset_volume_pv_existing. The value, admin, is entered for jupyter_namespace as this is the namespace that we intend to provision a new Jupyter Notebook workspace in. Note that you defined the default values for the parameters within your pipeline definition (see step 3).

    Error: Missing Graphic Image

  12. You are now presented with a screen listing all runs that fall under the specific experiment. Click the name of the run that you just started to view it.

    Error: Missing Graphic Image

    At this point, the run is likely still in progress.

    Error: Missing Graphic Image

  13. Confirm that the run completed successfully. When the run is complete, every stage of the pipeline shows a green check mark icon.

    Error: Missing Graphic Image

  14. Click the netappclone stage, and then click Logs to view output for that stage.

    Error: Missing Graphic Image

  15. Click the print-instructions stage, and then click Logs to view the outputted instructions. See the section Provision a Jupyter Notebook Workspace for details on creating a Jupyter Notebook workspace.

Error: Missing Graphic Image