Running distributed machine learning workloads has been a hot topic lately. Intel has shared documents walk through the process of using Kubeflow* to run distributed TensorFlow* jobs with Kubernetes, as well as a blog on using the Volume Controller for Kubernetes (KVC) for data management on clusters, and a blog describing a real-world use case where a distributed TensorFlow model was used to predict the location of tumors in brain scan images. Although running distributed workloads on a cluster is something that sounds attractive to many data scientists, it’s almost never easy.
Introducing Machine Learning Container Templates
Intel is introducing Machine Learning Container Templates (MLT) v0.1.2. MLT is a new open-source command line tool used for streamlining the creation and deployment of machine learning jobs on Kubernetes. MLT bridges the gap between the data scientist and the infrastructure engineer by providing templates that serve as a starting point for machine learning jobs, and simple commands to build Docker images and deploy jobs on the cluster. At the same time, data scientists have complete flexibility to customize their application, because MLT templates include the raw ingredients that are used to build and deploy the job (such as the Dockerfile and Kubernetes resource file).
We believe that MLT is like the “Keras of Kubernetes” because it provides easy commands that enable data scientists to easily get started with running distributed model training jobs on Kubernetes, without having to be a DevOps expert.
The MLT workflow
In creating their application, users begin by listing the different templates MLT has to offer and selecting the one that most closely resembles their use-case. When the app is initialized, it creates a directory that includes a Dockerfile, a template for the Kubernetes job manifest, a configuration file, and the model training code. In the example below, we list the templates and then initialize an app based on the distributed TensorFlow MNIST template.
$ mlt templates list Template Description -------------- -------------------------------------------------------------------------------------------------- hello-world A TensorFlow python HelloWorld example run through Kubernetes Jobs. pytorch Sample distributed application taken from http://pytorch.org/tutorials/intermediate/dist_tuto.html pytorch-distributed A distributed PyTorch MNIST example run using the pytorch-operator. tf-dist-mnist A distributed TensorFlow MNIST model which designates worker 0 as the chief. tf-distributed A distributed TensorFlow matrix multiplication run through the TensorFlow Kubernetes Operator. $ mlt init my-app --template tf-dist-mnist --namespace dina [master (root-commit) b2ded22] Initial commit. 8 files changed, 502 insertions(+) create mode 100644 .gitignore create mode 100644 Dockerfile create mode 100644 Makefile create mode 100644 README.md create mode 100644 crd-requirements.txt create mode 100644 k8s-templates/tfjob.yaml create mode 100644 main.py create mode 100644 requirements.txt $ cd my-app/ $ ls Dockerfile Makefile README.md crd-requirements.txt k8s k8s-templates main.py mlt.json requirements.txt $ mlt config list Parameter Name Value ------------------------------- ---------------------- gceProject cluster-123456 namespace dina name my-app template_parameters.num_ps 1 template_parameters.num_workers 2
MLT’s templates work out-of-the-box, which means that after the user has initialized their app, they can use MLT’s build and deploy commands to run the app on their Kubernetes cluster. After deploying the application, we check the status of the job and see that the pods are running on the cluster, then view the logs. When the job is done, MLT’s undeploy command is used to delete the job and free resources on the cluster.
$ mlt build Starting build my-app:ba4ab530-42f9-4a0e-8036-7e41e5409f12 Building - (Elapsed Time: 0:00:02) Built my-app:ba4ab530-42f9-4a0e-8036-7e41e5409f12 $ mlt deploy Pushing (Elapsed Time: 0:00:21) Pushed to gcr.io/cluster-123456/my-app:ba4ab530-42f9-4a0e-8036-7e41e5409f12 Deploying gcr.io/cluster-123456/my-app:ba4ab530-42f9-4a0e-8036-7e41e5409f12 Inspect created objects by running: $ kubectl get --namespace=dina all $ mlt status TF Job: NAME AGE my-app-62d0687d-a375-43ea-b4be-391a33750be9 5s Pods: NAME READY STATUS RESTARTS AGE my-app-62d0687d-a375-43ea-b4be-391a33750-ps-1cjq-0-lu16r 1/1 Running 0 6s my-app-62d0687d-a375-43ea-b4be-391a33750-worker-1cjq-0-ywxoy 1/1 Running 0 6s my-app-62d0687d-a375-43ea-b4be-391a33750-worker-1cjq-1-0cbd1 1/1 Running 0 6s $ mlt logs Checking for pod(s) readiness Will tail 3 logs... my-app-62d0687d-a375-43ea-b4be-391a33750-ps-1cjq-0-lu16r my-app-62d0687d-a375-43ea-b4be-391a33750-worker-1cjq-0-ywxoy my-app-62d0687d-a375-43ea-b4be-391a33750-worker-1cjq-1-0cbd1 ... $ mlt undeploy
Next, we can perform iterative development by making model updates, then rebuild and redeploy our app using MLT. MLT also has a --watch command for doing automatic rebuilds when file changes are detected and an --interactive command which gives the user a shell into the container. The Docker files in the MLT templates are designed such that subsequent builds to update source files are faster than the initial build.
Locating brain tumors using MLT
Now that we have walked through the basics of MLT, let’s look at how to adapt this app to a real-world use case. For this example, we will be using a U-Net model which predicts the location of tumors in brain scan images using the BraTS dataset. Details on this model are described in a previous blog.
To get the dataset downloaded onto the nodes in the Kubernetes cluster, we used the Volume Controller for Kubernetes (KVC). (We won’t go through the whole process of using KVC; there is already a blog discussing this.)
After KVC finished downloading the dataset to the nodes, we got the node list and host path from the volumemanager custom resource and used this information to add a node affinity and volume mount in the k8s-templates/tfjob.yaml file. In this same file, we also added environment variables with our cloud storage credentials and bucket information, since the model will be writing checkpoint files to the cloud.
Next, we replaced the original MNIST main.py file with the U-Net model training python files (from the GitHub repo here). We updated the Dockerfile to execute test_dist.py, which is the name of the main model training script. Because we are using the TFJob operator, modifications were needed in order to have the model training script get the cluster information (list of workers and parameters server, job name, and task index) from the TF_CONFIG environment variable instead of from flags. Lastly, we updated the requirements.txtfile to include libraries that are required to run this particular model.
After making those changes, we are ready to rebuild and deploy the model using MLT:
$ mlt build Starting build distributed-unet:9d271e57-e27c-47d8-b1a8-34a9d824c03a Building (Elapsed Time: 0:02:26) Built distributed-unet:9d271e57-e27c-47d8-b1a8-34a9d824c03a $ mlt deploy Pushing | (Elapsed Time: 0:01:43) Pushed to gcr.io/cluster-123456/distributed-unet:9d271e57-e27c-47d8-b1a8-34a9d824c03a Deploying gcr.io/cluster-123456/distributed-unet:9d271e57-e27c-47d8-b1a8-34a9d824c03a Inspect created objects by running: $ kubectl get --namespace=dina all $ mlt status TF Job: NAME AGE distributed-unet-e925bc66-3766-4bf2-bbdd-cd20eb0cb033 3m Pods: NAME READY STATUS RESTARTS AGE distributed-unet-e925bc66-3766-4bf2-bbdd-ps-fkoj-0-4kupn 1/1 Running 0 3m distributed-unet-e925bc66-3766-4bf2-bbdd-worker-fkoj-0-yc96p 1/1 Running 0 3m distributed-unet-e925bc66-3766-4bf2-bbdd-worker-fkoj-1-divnx 1/1 Running 0 3m distributed-unet-e925bc66-3766-4bf2-bbdd-worker-fkoj-2-zac9u 1/1 Running 0 3m
For our model, checkpoint files are saved to a cloud storage location, so we can point TensorBoard to that location and watch the progress of the model training. The screenshot below was taken after letting it train for a while. The first row of images is the ground truth, the second row has the brain scan image, and the last row is our prediction. As you can see, the predictions are getting pretty close!
We have demonstrated a simple example that runs an MLT template out-of-the-box, as well as a more complex use-case that uses volume mounts from KVC and modifies the MLT template to run the U-Net model on a dataset of brain scan images. We found that starting out with a simple working example helped ease the process of moving to a more complex use-case. And of course, it didn’t work on the first try, so MLT’s build and deploy features helped to simplify the iterative development process.
We are constantly adding new features to MLT in order to further streamline the process of deploying machine learning workloads on Kubernetes. Upcoming features in our pipeline include a Horovod* template, code syncing commands in order to reduce the iteration time, and a hyperparameter experiments template.
After you’ve used MLT, please share with us any feature requests or ideas you may have to improve the process of running machine learning jobs using Kubernetes.
Intel AI Blog: Biomedical Image Segmentation with U-Net:
MLT example using the distributed U-Net model:
Distributed U-Net model:
Information on using the BraTS datasets:
Volume Controller for Kubernetes (KVC):
The Kubeflow TF-Operator:
Notices and Disclaimers
Whenever using and/or referring to https://www.ncbi.nlm.nih.gov/pubmed/25494501
Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries.
© Intel Corporation
*Other names and brands may be claimed as the property of others.