Optimize a ResNet50* V1.5 FP32 Training Package with TensorFlow* for Kubernetes*

Published: 11/20/2020

Customer Reviews

☆☆☆☆☆ (0)  Rate this solution


Download Command

wget https://storage.googleapis.com/intel-optimized-tensorflow/models/v2_3_0/resnet50v1-5-fp32-training-k8s.tar.gz

Description

This document has instructions for running ResNet50* v1.5 FP32 training using Intel® Optimization for TensorFlow* on Kubernetes*.

The ImageNet dataset is used in these ResNet50 v1.5 Kubernetes examples. Download and preprocess the ImageNet dataset using the instructions here. After running the conversion script you should have a directory with the ImageNet dataset in the TF records format.

Kubernetes*

Download and untar the ResNet50 v1.5 FP32 training package:

wget https://storage.googleapis.com/intel-optimized-tensorflow/models/v2_3_0/resnet50v1-5-fp32-training-k8s.tar.gz
tar -xvf resnet50v1-5-fp32-training-k8s.tar.gz

Prerequisites

Both single and multi-node deployments use Kustomize v3.8.7 to configure deployment parameters. Kustomize v3.8.7 should be downloaded, extracted and moved to a directory within your PATH. You can verify that you've installed the correct version of Kustomize by typing kustomize version. On OS X* you would see:

{Version:kustomize/v3.8.7 GitCommit:ad092cc7a91c07fdf63a2e4b7f13fa588a39af4f BuildDate:2020-11-11T23:19:38Z GoOs:darwin GoArch:amd64}

Execution

The Kubernetes* package for ResNet50 v1.5 FP32 training includes single and multi-node Kubernetes deployments. Within the single and multi-node deployments are common use cases that include storage and security variations that are common across different kubernetes installations. The directory tree within the model package is shown below, where single and multi-node directories are below the mlops directory. Common use cases are found under the single and multi-node directories:

quickstart
└── mlops
      ├── multi-node
      │       ├── user-allocated-pvc
      │       └── user-mounted-nfs
      └── single-node
              ├── user-allocated-pvc
              └── user-mounted-nfs

Multi-node Distributed Training

The multi-node use cases (user-allocated-pvc, user-mounted-nfs) make the following assumptions:

  • the message passing interface (MPI) operator has been deployed on the cluster by DevOps (see below).
  • the OUTPUT_DIR parameter is a shared volume that is writable and available cluster wide.
  • the DATASET_DIR parameter is a dataset volume also available cluster wide (for example: using zfs or other performant storage). Typically these volumes are read-only.
DevOps

The k8 resources needed to run the multi-node resnet50v1-5-fp32-training-k8s quick  start require deployment of an message passing interface (MPI) operator. See the MPI Operator deployment section of the Kubernetes DevOps document for instructions.

Once these resources have been deployed, the mlops user then has a choice of running resnet50v1-5-fp32-training-k8s multi-node (distributed training) or single-node.

Mlops

Distributed training is done by posting an MPI job to the k8s api-server which is handled by the message passing interface (MPI) operator that was deployed by DevOps. The message passing interface (MPI) operator parses the MPI job and then runs a launcher and workers specified in the MPI job. Launcher and workers communicate through Horovod*. The distributed training algorithm is handled by mpirun.

In a terminal, cd to the multi-node directory. Each use case under this directory has parameters that can be changed using Kustomize's cfg set

Mlops
User-mounted NFS and user-allocated PVC.
NAME VALUE DESCRIPTION
DATASET_DIR /datasets input dataset directory
FS_ID 0 owner id of mounted volumes
GROUP_ID 0 process group id
GROUP_NAME root process group name
NFS_PATH /nfs nfs path
NFS_SERVER 0.0.0.0 nfs server
PVC_NAME workdisk model-builder
PVC_PATH /pvc model-builder
USER_ID 0 process owner id
USER_NAME root process owner name

For the user-mounted NFS use case, change NFS_PATH and NFS_SERVER.

For the user-allocated PVC use case, change PVC_NAME and PVC_PATH.

For example, to change the NFS_SERVER address, run:

kustomize cfg set . NFS_SERVER <ip address> -R

To change the PVC_NAME, run:

kustomize cfg set . PVC_NAME <PVC Name> -R

In both use cases, you should change the values below so the pod is deployed with the user's identity.

kustomize cfg set . FS_ID <Group ID> -R
kustomize cfg set . GROUP_ID <Group ID> -R
kustomize cfg set . GROUP_NAME <Group Name> -R
kustomize cfg set . USER_ID <User ID> -R
kustomize cfg set . USER_NAME <User Name> -R

Change the default namespace of all resources by running the following Kustomize command:

pushd <use-case>
kustomize edit set namespace <User's namespace>
popd

This will place all resources within the specified namespace. Note: this namespace should be created prior to deployment.

To change your default kubectl context, run:

kubectl config set-context --current --namespace=<User's namespace>

After you change the parameter values, deploy the use case by running:

kustomize build  <use-case> > <use-case>.yaml
kubectl apply -f <use-case>.yaml
Multi-node training output

Viewing the log output of the resnet50v1_5 MPI job is done by viewing the logs of the launcher pod. The launcher pod aggregrates output from the workerpods. This pod is found by filtering the list of pods for the name 'launcher':

kubectl get pods -oname|grep launch|cut -c5-

This can be combined with the kubectl logs subcommand to tail the output of the training job:

kubectl logs -f $(kubectl get pods -oname|grep launch|cut -c5-)

Note that the mpirun parameter -output-filename causes a segfault when attempting to write to the $OUTPUT_DIR that is NFS mounted when the securityContext has been changed to run as the user's UID/GID.

Multi-node Training Cleanup

Removing the MPI job and related resources is done by running:

kubectl delete -f <use-case>.yaml

Single-node training

Single node training is similar to the Docker* use case but the command is run within a pod. Training is done by submitting a pod.yaml to the k8s api-server which results in the pod creation and running the fp32_training_demo.sh command within the pod's container.

In a terminal, cd to the single-node directory. Each use case under this directory has parameters that can be changed using Kustomize's cfg set

Mlops
User-mounted NFS and user-allocated PVC.
NAME VALUE DESCRIPTION
DATASET_DIR /datasets input dataset directory
FS_ID 0 owner id of mounted volumes
GROUP_ID 0 process group id
GROUP_NAME root process group name
NFS_PATH /nfs nfs path
NFS_SERVER 0.0.0.0 nfs server
PVC_NAME workdisk model-builder
PVC_PATH /pvc model-builder
USER_ID 0 process owner id
USER_NAME root process owner name

For the user-mounted NFS use case, change NFS_PATH and NFS_SERVER.

For the user-allocated PVC use case, change PVC_NAME and PVC_PATH.

For example, to change the NFS_SERVER address, run:

kustomize cfg set . NFS_SERVER <ip address> -R

To change the PVC_NAME, run:

kustomize cfg set . PVC_NAME <PVC Name> -R

In both use cases, you should change the values below so the pod is deployed with the user's identity.

kustomize cfg set . FS_ID <Group ID> -R
kustomize cfg set . GROUP_ID <Group ID> -R
kustomize cfg set . GROUP_NAME <Group Name> -R
kustomize cfg set . USER_ID <User ID> -R
kustomize cfg set . USER_NAME <User Name> -R

Change the default namespace of all resources by running the following Kustomize command:

pushd <use-case>
kustomize edit set namespace <User's namespace>
popd

This will place all resources within the specified namespace. Note: this namespace should be created prior to deployment.

To change your default kubectl context, run:

kubectl config set-context --current --namespace=<User's namespace>

After you change the parameter values, deploy the use case by running:

kustomize build  <use-case> > <use-case>.yaml
kubectl apply -f <use-case>.yaml
Single-node Training Output

Viewing the log output of the resnet50v1_5 Pod is done by viewing the logs of the training pod. This pod is found by filtering the list of pods for the name 'training':

kubectl get pods -oname|grep training|cut -c5-

This can be combined with the kubectl logs subcommand to tail the output of the training job:

kubectl logs -f $(kubectl get pods -oname|grep training|cut -c5-)
Single-node training cleanup

Removing the pod and related resources is done by running:

kubectl delete -f <use-case>.yaml

Documentation and Sources

Get Started​
Main GitHub*
Readme
Release Notes
Get Started Guide

Code Sources
Report Issue


License Agreement

LEGAL NOTICE: By accessing, downloading or using this software and any required dependent software (the “Software Package”), you agree to the terms and conditions of the software license agreements for the Software Package, which may also include notices, disclaimers, or license terms for third party software included with the Software Package. Please refer to the license file for additional details.


Related Containers and Solutions

ResNet50 V1.5 FP32 Training TensorFlow* Container
ResNet50 V1.5 FP32 Training TensorFlow* Model Package

View All Containers and Solutions 🡢

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.