Example Workflow - Train an Image Recognition Model Using Kubeflow and the NetApp DataOps Toolkit
-
PDF of this doc site
-
Artificial Intelligence
-
Public and Hybrid Cloud
-
Virtualization
-
Containers
-
Red Hat OpenShift with NetApp
-
-
![](https://docs.netapp.com/common/images/pdf-zip.png)
Collection of separate PDF docs
Creating your file...
This section describes the steps involved in training and deploying a Neural Network for Image Recognition using Kubeflow and the NetApp DataOps Toolkit. This is intended to serve as an example to show a training job that incorporates NetApp storage.
Prerequisites
Create a Dockerfile with the required configurations to use for the train and test steps within the Kubeflow pipeline.
Here is an example of a Dockerfile -
FROM pytorch/pytorch:latest
RUN pip install torchvision numpy scikit-learn matplotlib tensorboard
WORKDIR /app
COPY . /app
COPY train_mnist.py /app/train_mnist.py
CMD ["python", "train_mnist.py"]
Depending on your requirements, install all required libraries and packages needed to run the program. Before you train the Machine Learning model, it is assumed that you already have a working Kubeflow deployment.
Train a Small NN on MNIST Data Using PyTorch and Kubeflow Pipelines
We use the example of a small Neural Network trained on MNIST data. The MNIST dataset consists of handwritten images of digits from 0-9. The images are 28x28 pixels in size. The dataset is divided into 60,000 train images and 10,000 validation images. The Neural Network used for this experiment is a 2-layer feedforward network. Training is executed using Kubeflow Pipelines. Refer to the documentation here for more information. Our Kubeflow pipeline incorporates the docker image from the Prerequisites section.
![Kubeflow Pipeline Run Visualization](./../media/kubeflow_pipeline.png)
Visualize Results Using Tensorboard
Once the model is trained, we can visualize the results using Tensorboard. Tensorboard is available as a feature on the Kubeflow Dashboard. You can create a custom tensorboard for your job. An example below shows the plot of training accuracy vs. number of epochs and training loss vs. number of epochs.
![Tensorboard graph for training loss and accuracy](./../media/tensorboard_graph.png)
Experiment with Hyperparameters Using Katib
Katib is a tool within Kubeflow that can be used to experiment with the model hyperparameters. To create an experiment, define a desired metric/goal first. This is usually the test accuracy. Once the metric is defined, choose hyperparameters that you would like to play around with (optimizer/learning_rate/number of layers). Katib does a hyperparameter sweep with the user-defined values to find the best combination of parameters that satisfy the desired metric. You can define these parameters in each section in the UI. Alternatively, you could define a YAML file with the necessary specifications. Below is an illustration of a Katib experiment -
![Katib Experiment Dashboard with hyperparameters](./../media/katib_experiment_1.png)
![Successful trial check](./../media/katib_experiment_2.png)
Use NetApp Snapshots to Save Data for Traceability
During the model training, we may want to save a snapshot of the training dataset for traceability. To do this, we can add a snapshot step to the pipeline as shown below. To create the snapshot, we can use the NetApp DataOps Toolkit for Kubernetes.
![Code to build a Snapshot pipeline in Kubeflow](./../media/kubeflow_snapshot.png)
Refer to the NetApp DataOps Toolkit example for Kubeflow for more information.