Wide & Deep Large Data FP32 Training TensorFlow* Kubernetes* Package

Published: 12/09/2020

Customer Reviews

☆☆☆☆☆ (0)  Rate this solution


Download Command

wget https://storage.googleapis.com/intel-optimized-tensorflow/models/v2_3_0/wide-deep-large-ds-fp32-training-k8s.tar.gz

Description

This document has instructions to run a Wide & Deep FP32 training using Intel® Optimizations for TensorFlow* on Kubernetes*.

Dataset

The large Kaggle* Display Advertising Challenge Dataset will be used to train Wide & Deep. The data is from Criteo and has a field indicating if an ad was clicked (1) or not (0), along with integer and categorical features.

Download large Kaggle Display Advertising Challenge Dataset from Criteo Labs.

The directory where you've downloaded the train.csv and eval.csv files should be used as the DATASET_DIR when running quickstart scripts.

Quick Start Scripts

Script name Description
fp32_training_check_accuracy.sh Trains the model for a specified number of steps (default is 500) and then compare the accuracy against the specified target accuracy. If the accuracy is not met, then script exits with error code 1. The CHECKPOINT_DIR environment variable can optionally be defined to start training based on previous set of checkpoints.
launch_benchmark.py Trains the model for 10 epochs if -- steps is not specified. The CHECKPOINT_DIR environment variable can optionally be defined to start training based on previous set of checkpoints.

Kubernetes*

Download and untar the model training package to get the yaml and config files for running training on a single node using Kubernetes.

wget https://storage.googleapis.com/intel-optimized-tensorflow/models/v2_3_0/wide-deep-large-ds-fp32-training-k8s.tar.gz
tar -xvf wide-deep-large-ds-fp32-training-k8s.tar.gz
Execution

The Kubernetes* package for Wide and Deep Large Dataset FP32 training includes single-node and pipeline kubernetes deployments. Within the single and pipeline deployments are common use cases that include storage and security variations that are common across different kubernetes installations. The directory tree within the kubernetes package is shown below, where single-node and pipeline directories are below the mlops directory. Common use cases are found under the single and pipeline directories:

quickstart
└── mlops
    ├── pipeline
    │       ├── user-allocated-pvc
    │       └── user-mounted-nfs
    └── single-node
            ├── user-allocated-pvc
            └── user-mounted-nfs
Prerequisites

Both single-node and pipeline deployments use kustomize-v3.8.7 to configure deployment parameters. Kustomize-v3.8.7 should be downloaded, extracted and moved to a directory within your PATH. You can verify that you've installed the correct version of kustomize by typing kustomize version. On MACOSX you would see:

{Version:kustomize/v3.8.7 GitCommit:ad092cc7a91c07fdf63a2e4b7f13fa588a39af4f BuildDate:2020-11-11T23:19:38Z GoOs:darwin GoArch:amd64}
Single-Node Training

Single node training is similar to the Docker* use case but is run within a pod. Training is done by submitting a pod.yaml to the k8s api-server which results in the pod creation and running the launch_benchmark.py script within the pod's container.

In a terminal, cd to the single-node directory. Each use case under this directory has parameters that can be changed using kustomize's cfg set.

User Mounted NFS and User Allocated PVC Parameter Values
NAME VALUE DESCRIPTION
DATASET_DIR /datasets input dataset directory
FS_ID 0 owner id of mounted volumes
GROUP_ID 0 process group id
GROUP_NAME root process group name
NFS_PATH /nfs nfs path
NFS_SERVER 0.0.0.0 nfs server
PVC_NAME workdisk model-builder
PVC_PATH /pvc model-builder
USER_ID 0 process owner id
USER_NAME root process owner name

For the user mounted NFS use case, the user should change NFS_PATH and NFS_SERVER.

For the user allocated PVC use case, the user should change PVC_NAME and PVC_PATH.

For example to change the NFS_SERVER address the user would run:

kustomize cfg set . NFS_SERVER <ip address> -R

To change the PVC_NAME the user would run:

kustomize cfg set . PVC_NAME <PVC Name> -R

In both use cases, the user should change the values below so the pod is deployed with the user's identity.

kustomize cfg set . FS_ID <Group ID> -R
kustomize cfg set . GROUP_ID <Group ID> -R
kustomize cfg set . GROUP_NAME <Group Name> -R
kustomize cfg set . USER_ID <User ID> -R
kustomize cfg set . USER_NAME <User Name> -R

The user should change the default namespace of all the resources by running the kustomize command:

pushd <use-case>
kustomize edit set namespace <User's namespace>
popd

This will place all resources within the specified namespace. Note: this namespace should be created prior to deployment.

The user can also change their default kubectl context by running:

kubectl config set-context --current --namespace=<User's namespace>

Once the user has changed parameter values they can then deploy the use-case by running:

kustomize build  <use-case> > <use-case>.yaml
kubectl apply -f <use-case>.yaml
Single-Node Training output

Viewing the log output of the wide-deep-large-ds-fp32-training-k8s.tar.gz Pod is done by viewing the logs of the training pod. This pod is found by filtering the list of pods for the name 'training':

kubectl get pods -oname|grep training|cut -c5-

This can be combined with the kubectl logs subcommand to tail the output of the training job:

kubectl logs -f $(kubectl get pods -oname|grep training|cut -c5-)
Single-Node Training Cleanup

Removing the pod and related resources is done by running:

kubectl delete -f <use-case>.yaml
Model Training and TF Serving Pipeline

This pipeline runs the following steps using an Argo workflow:

  1. Train the model on a single node using the fp32_training_check_accuracy.sh script. This script runs model training for a specified number of steps, exports the saved model, and compares the accuacy against the value specified in the TARGET_ACCURACY environment variable. If the model's accuracy does not meet the target accuracy, this step will be retried and continues training based on previous checkpoints in the specified CHECKPOINT_DIR. If the TARGET_ACCURACY environment variable has not been defined, then no accuracy check is done and it will continue on to the next step, regardless of the model's accuracy.
  2. Deploy TensorFlow Serving containers with the saved model
  3. Create a service that exposes the TensorFlow Serving containers as a NodePort

The TensorFlow Serving steps in this pipeline follows the TensorFlow Serving with Kubernetes instructions with the exception that it does not use a Google Cloud Kubernetes cluster. Since the Kubernetes cluster being used does not have a load balancer, the configuration is setup for NodePort, which will allow external requests. In a terminal, cd to the multi-node directory. Each use case under this directory has parameters that can be changed using kustomize's cfg set.

Mlops
User Mounted NFS and User Allocated PVC Parameter Values
NAME VALUE DESCRIPTION
DATASET_DIR /datasets input dataset directory
FS_ID 0 owner id of mounted volumes
GROUP_ID 0 process group id
GROUP_NAME root process group name
NFS_PATH /nfs nfs path
NFS_SERVER 0.0.0.0 nfs server
PVC_NAME workdisk model-builder
PVC_PATH /pvc model-builder
REPLICAS 3 replica number
RETRY_LIMIT 10 replica number
TARGET_ACCURACY 0.74 target accuracy
TF_SERVING_PORT 8501 tf serving port
USER_ID 0 process owner id
USER_NAME root process owner name

For the user mounted NFS use case, the user should change NFS_PATH and NFS_SERVER.

For the user allocated PVC use case, the user should change PVC_NAME and PVC_PATH.

For example to change the NFS_SERVER address the user would run:

kustomize cfg set . NFS_SERVER <ip address> -R

To change the PVC_NAME the user would run:

kustomize cfg set . PVC_NAME <PVC Name> -R

In both use cases, the user should change the values below so the argo workflow1 is deployed with the user's identity.

kustomize cfg set . FS_ID <Group ID> -R
kustomize cfg set . GROUP_ID <Group ID> -R
kustomize cfg set . GROUP_NAME <Group Name> -R
kustomize cfg set . USER_ID <User ID> -R
kustomize cfg set . USER_NAME <User Name> -R

The user should change the default namespace of all the resources by running the kustomize command:

pushd <use-case>
kustomize edit set namespace <User's namespace>
popd

This will place all resources within the specified namespace. Note: this namespace should be created prior to deployment.

The user can also change their default kubectl context by running:

kubectl config set-context --current --namespace=<User's namespace>

Once the user has changed parameter values they can then deploy the use-case by running:

kustomize build  <use-case> > <use-case>.yaml
kubectl apply -f <use-case>.yaml

Once the job has been submitted, the status and logs can be viewed using the Argo user inferface or from the command line using kubectl or argo. The commands below describe how to use kubectl to see the workflow, pods, and log files:

$ kubectl get wf
$ kubectl get pods
$ kubectl logs <pod name> main
TensorFlow Serving Client

Once all the steps in the workflow have completed, the TensorFlow Serving GRPC client can be used to run inference on the served model.

Prior to running the client script, install the following dependency in your environment:

  • tensorflow-serving-api

The client script reads a csv file (in this example we are using the eval.csv file), formats the data in for input parameter, and then calls the served model. Accuracy and benchmarking metrics are printed out.

Run the run_tf_serving_client.py script with the --help flag to see the argument options:

$ python run_wide_deep_client.py --help
usage: wide-deep-large-ds-fp32-training-k8s/quickstart/run_tf_serving_client.py [-h]
       [-s SERVER] -d DATA_FILE [-b BATCH_SIZE] [-n NUM_ITERATION] [-w WARM_UP_ITERATION]

optional arguments:
  -h, --help            show this help message and exit
  -s SERVER, --server SERVER
                        Server URL and port (default=localhost:8500).
  -d DATA_FILE, --data_file DATA_FILE
                        Path to csv data file
  -b BATCH_SIZE, --batch_size BATCH_SIZE
                        Batch size to use (default=1).
  -n NUM_ITERATION, --num_iteration NUM_ITERATION
                        Number of times to repeat (default=40).
  -w WARM_UP_ITERATION, --warm_up_iteration WARM_UP_ITERATION
                        Number of initial iterations to ignore in benchmarking (default=10).
  1. Find the INTERNAL-IP one of the nodes in your cluster using kubectl get nodes -o wide. This IP should be used as the server URL in the --server arg.

  2. Get the NodePort using kubectl describe service. This NodePort should be used as the port in the --server arg.

  3. Run the client script with your preferred parameters. For example:

    python wide-deep-large-ds-fp32-training-k8s/quickstart/run_tf_serving_client.py -s <Internal IP>:<Node Port> -d <path to eval.csv> --b <batch size>

    The script will call the served model using data from the csv file and output performance and accuracy metrics.

Clean-Up the Pipeline

To clean up the model training/serving pipeline, delete the service, deployment, and other resources using the following commands:

kubectl delete -f <use-case>.yaml

Troubleshooting

  • Pod doesn't start. Status is ErrImagePull.
    Docker* recently implemented rate limits.
    See this note about rate limits and work-arounds.

  • Argo workflow steps do not execute.
    Error from argo get is 'failed to save outputs: Failed to establish pod watch: timed out waiting for the condition'.
    See this argo issue. This is due to the workflow running as non-root.
    Devops will need to change the workflow-executor to k8sapi as described here.

  • MpiOperator can't create workers. Error is '/bin/sh: /etc/hosts: Permission denied'. This is due to a bug in mpi-operator in the 'latest' container image when the workers run as non-root. See this issue.
    Use the container images: mpioperator/mpi-operator:v02.3 and mpioperator/kubectl-delivery:v0.2.3.


Documentation and Sources

Get Started
Main GitHub* Repository
Readme
Release Notes
Get Started Guide

Code Sources
Report Issue


License Agreement

LEGAL NOTICE: By accessing, downloading or using this software and any required dependent software (the “Software Package”), you agree to the terms and conditions of the software license agreements for the Software Package, which may also include notices, disclaimers, or license terms for third party software included with the Software Package. Please refer to the license file for additional details.


Related Containers and Solutions

Wide & Deep Large Dataset FP32 Training TensorFlow* Container
Wide & Deep Large Dataset FP32 Training TensorFlow* Model Package

View All Containers and Solutions 🡢

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.