Optimize a BERT-Large FP32 Training Model Package with TensorFlow* for Kubernetes*

Published: 12/09/2020

Customer Reviews

☆☆☆☆☆ (0)  Rate this solution


Download Command

wget https://storage.googleapis.com/intel-optimized-tensorflow/models/v2_3_0/bert-large-fp32-training-k8s.tar.gz

Description

This document has instructions for running BERT FP32 training using Intel® Optimization for TensorFlow* on Kubernetes*.

For all fine-tuning, the datasets (Stanford Question Answering Dataset [SQuAD], MultiNLI, Microsoft* Research Paraphrase Corpus [MRPC], and more) and checkpoints should be downloaded as mentioned in the Google* BERT repository.

Refer to Google reference page for checkpoints.

Pretrained Models

Download and extract checkpoints for the BERT pretrained model from the Google BERT repository. The extracted directory should be set to the CHECKPOINT_DIR environment variable when running the quick start scripts.

For training from scratch, Wikipedia and BookCorpus need to be downloaded and preprocessed.

General Language Understanding Evaluation (GLUE) Data

GLUE data is used when running BERT classification training. Download and unpack the GLUE data by running this script.

SQuAD Data

Download the SQuAD dataset files from the Google BERT repository. Download the three files (train-v1.1.json, dev-v1.1.json, and evaluate-v1.1.py) to the same directory. Set the DATASET_DIR to point to that directory when running BERT fine-tuning using the SQuAD data.

Quick Start Scripts

Script name Description
launch_benchmark.py The single-node Kubernetes job uses this script to run BERT classifier inference.

Kubernetes

Download and untar the BERT-Large FP32 training package:

wget https://storage.googleapis.com/intel-optimized-tensorflow/models/v2_3_0/bert-large-fp32-training-k8s.tar.gz
tar -xvf bert-large-fp32-training-k8s.tar.gz
Prerequisites

Both single and multi-node deployments use Kustomize v3.8.7 to configure deployment parameters. Kustomize v3.8.7 should be downloaded, extracted and moved to a directory within your PATH. You can verify that you've installed the correct version of Kustomize by typing Kustomize version. On OS X* you would see:

{Version:kustomize/v3.8.7 GitCommit:ad092cc7a91c07fdf63a2e4b7f13fa588a39af4f BuildDate:2020-11-11T23:19:38Z GoOs:darwin GoArch:amd64}
Running

The Kubernetes package for BERT Large FP32 training includes single and multi-node kubernetes deployments. Within the single and multi-node deployments are common use cases that include storage and security variations that are common across different kubernetes installations. The directory tree within the model package is shown below, where single and multi-node directories are below the mlops directory. Common use cases are found under the single and multi-node directories:

quickstart
└── mlops
      ├── multi-node
      │       ├── user-allocated-pvc
      │       └── user-mounted-nfs
      └── single-node
              ├── user-allocated-pvc
              └── user-mounted-nfs
Multi-node Distributed Training

The multi-node use cases (user-allocated-pvc, user-mounted-nfs) make the following assumptions:

  • the mpi-operator has been deployed on the cluster by DevOps (see below).
  • the OUTPUT_DIR parameter is a shared volume that is writable by the user and available cluster wide.
  • the DATASET_DIR parameter is a dataset volume also available cluster wide (for exmple: using ZFS or other performant storage). Typically these volumes are read-only.
Devops

The k8 resources needed to run the multi-node bert-large-fp32-training-k8s quick start require deployment of an mpi-operator. See the MPI Operator deployment section of the Kubernetes DevOps document for instructions.

Once these resources have been deployed, the mlops user then has a choice of running bert-large-fp32-training-k8s multi-node (distributed training) or single-node.

Mlops

Distributed training is done by posting an MPIJob to the k8s api-server which is handled by the mpi-operator that was deployed by DevOps. The mpi-operator parses the MPIJob and then runs a launcher and workers specified in the MPIJob. Launcher and workers communicate through Horovod*. The distributed training algorithm is handled by mpirun.

In a terminal, cd to the multi-node directory. Each use case under this directory has parameters that can be changed using the Kustomize cfg set.

User mounted nfs and user allocated pvc parameter values
NAME VALUE DESCRIPTION
DATASET_DIR /datasets input dataset directory
FS_ID 0 owner id of mounted volumes
GROUP_ID 0 process group id
GROUP_NAME root process group name
NFS_PATH /nfs nfs path
NFS_SERVER 0.0.0.0 nfs server
PVC_NAME workdisk model-builder
PVC_PATH /pvc model-builder
USER_ID 0 process owner id
USER_NAME root process owner name

For the user-mounted NFS, change NFS_PATH and NFS_SERVER.

For the user-allocated pvc use case, change PVC_NAME and PVC_PATH.

For example, to change the NFS_SERVER address, run:

kustomize cfg set . NFS_SERVER <ip address> -R

To change the PVC_NAME, run:

kustomize cfg set . PVC_NAME <PVC Name> -R

In both use cases, the following values so the pod deploys with the user's identity. the values below so the pod is deployed with the user's identity.

kustomize cfg set . FS_ID <Group ID> -R
kustomize cfg set . GROUP_ID <Group ID> -R
kustomize cfg set . GROUP_NAME <Group Name> -R
kustomize cfg set . USER_ID <User ID> -R
kustomize cfg set . USER_NAME <User Name> -R

Change the default namespace of all the resources by running the Kustomize command:

pushd <use-case>
kustomize edit set namespace <User's namespace>
popd

This will place all resources within the specified namespace. Note: this namespace should be created prior to deployment.

You can also change your default kubectl context by running:

kubectl config set-context --current --namespace=<User's namespace>

Once the user has changed parameter values they can then deploy the use-case by running:

kustomize build  <use-case> > <use-case>.yaml
kubectl apply -f <use-case>.yaml
Multi-node Training Output

Viewing the log output of the bert-large-fp32-training-k8s.tar.gz MPIJob is done by viewing the logs of the launcher pod. The launcher pod aggregrates output from the workerpods. .Locate this pod by filtering the list of pods for the name launcher.

kubectl get pods -oname|grep launch|cut -c5-

This can be combined with the kubectl logs subcommand to tail the output of the training job:

kubectl logs -f $(kubectl get pods -oname|grep launch|cut -c5-)

Note that the mpirun parameter -output-filename is actually a directory and is set to $OUTPUT_DIR.

Multi-node Training Cleanup

Removing the mpijob and related resources is done by running:

kubectl delete -f <use-case>.yaml
Single-Node Training

Single-node training is similar to the Docker* use case but the command is run within a pod. Training is done by submitting a pod.yaml to the k8s api-server that results in the pod creation and running the fp32_training_demo.sh command within the pod's container.

In a terminal, cd to the single-node directory. Each use case under this directory has parameters that can be changed using the Kustomize cfg set.

Mlops
User mounted nfs and user allocated pvc parameter values
NAME VALUE DESCRIPTION
DATASET_DIR /datasets input dataset directory
FS_ID 0 owner id of mounted volumes
GROUP_ID 0 process group id
GROUP_NAME root process group name
NFS_PATH /nfs nfs path
NFS_SERVER 0.0.0.0 nfs server
PVC_NAME workdisk model-builder
PVC_PATH /pvc model-builder
USER_ID 0 process owner id
USER_NAME root process owner name

For the user mounted nfs use case, the user should change NFS_PATH and NFS_SERVER.

For the user allocated pvc use case, the user should change PVC_NAME and PVC_PATH.

For example to change the NFS_SERVER address the user would run:

kustomize cfg set . NFS_SERVER <ip address> -R

To change the PVC_NAME the user would run:

kustomize cfg set . PVC_NAME <PVC Name> -R

In both use cases, the user should change the values below so the pod is deployed with the user's identity.

kustomize cfg set . FS_ID <Group ID> -R
kustomize cfg set . GROUP_ID <Group ID> -R
kustomize cfg set . GROUP_NAME <Group Name> -R
kustomize cfg set . USER_ID <User ID> -R
kustomize cfg set . USER_NAME <User Name> -R

The user should change the default namespace of all the resources by running the Kustomize command:

pushd <use-case>
kustomize edit set namespace <User's namespace>
popd

This will place all resources within the specified namespace. Note: this namespace should be created prior to deployment.

You can also change your default kubectl context by running:

kubectl config set-context --current --namespace=<User's namespace>

After you change the parameter values, you can deploy the use case by running:

kustomize build  <use-case> > <use-case>.yaml
kubectl apply -f <use-case>.yaml
Single-Node Training Output

Viewing the log output of the bert-large-fp32-training-k8s.tar.gz Pod is done by viewing the logs of the training pod. Find this pod by filtering the list of pods for the name training:

kubectl get pods -oname|grep training|cut -c5-

This can be combined with the kubectl logs subcommand to tail the output of the training job:

kubectl logs -f $(kubectl get pods -oname|grep training|cut -c5-)
Single-Node Training Cleanup

Removing the pod and related resources is done by running:

kubectl delete -f <use-case>.yaml

Troubleshooting

  • Pod doesn't start. Status is ErrImagePull.
    Docker recently implemented rate limits.
    See this note about rate limits and work-arounds.

  • Argo workflow steps do not run.
    Error from argo get is 'failed to save outputs: Failed to establish pod watch: timed out waiting for the condition'.
    See this Argo issue. This is due to the workflow running as non-root.
    Devops will need to change the workflow-executor to k8sapi as described in workflow-executors.

  • Mpi-operator can't create workers. Error is '/bin/sh: /etc/hosts: Permission denied'. This is due to a bug in mpi-operator in the 'latest' container image when the workers run as non-root. See this issue.
    Use the container images: mpioperator/mpi-operator:v02.3 and mpioperator/kubectl-delivery:v0.2.3.


Documentation and Sources

Get Started
Main GitHub*
Readme
Release Notes
Get Started Guide

Code Sources
Report Issue


License Agreement

LEGAL NOTICE: By accessing, downloading or using this software and any required dependent software (the “Software Package”), you agree to the terms and conditions of the software license agreements for the Software Package, which may also include notices, disclaimers, or license terms for third party software included with the Software Package. Please refer to the license file for additional details.


Related Containers and Solutions

BERT-Large FP32 Training TensorFlow* Container
BERT-Large FP32 Training TensorFlow* Model Package

View All Containers and Solutions 🡢

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.