Optimize a ResNet50* V1.5 Bfloat16 Training Container with TensorFlow*

ID 679191
Updated 6/15/2022
Version Latest
Public

author-image

By

Pull Command

docker pull intel/image-recognition:tf-latest-resnet50v1-5-bfloat16-training

Description

This document has instructions for running ResNet50* v1.5 bfloat16 training using Intel® Optimization for TensorFlow*.

Note that the ImageNet dataset is used in these ResNet50 v1.5 examples.Download and preprocess the ImageNet dataset using the instructions here. After running the conversion script you should have a directory with the ImageNet dataset in the TF records format.

Set the DATASET_DIR to point to this directory when running ResNet50 v1.5.

Quick Start Scripts

Script name Description
bfloat16_training_demo Launches a short run using small batch sizes and a limited number of steps to demonstrate the training flow
bfloat16_training_1_epoch Launches a test run that trains the model for one epoch and saves checkpoint files to an output directory.
bfloat16_training_full Trains the model using the full dataset and runs until convergence (90 epochs) and saves checkpoint files to an output directory. Note that this will take a considerable amount of time.

Docker*

The ResNet50 v1.5 bfloat16 training model container includes the scripts and libraries needed to run ResNet50 v1.5 bfloat16 training. To run one of the model training quickstart scripts using this container, you'll need to provide volume mounts for the ImageNet dataset and an output directory where checkpoint files will be written.

DATASET_DIR=<path to the preprocessed imagenet dataset>
OUTPUT_DIR=<directory where checkpoint and log files will be written>

docker run \
  --env DATASET_DIR=${DATASET_DIR} \
  --env OUTPUT_DIR=${OUTPUT_DIR} \
  --env http_proxy=${http_proxy} --env https_proxy=${https_proxy} \
  --volume ${DATASET_DIR}:${DATASET_DIR} \
  --volume ${OUTPUT_DIR}:${OUTPUT_DIR} \
  --privileged --init -t \
  intel/image-recognition:tf-latest-resnet50v1-5-bfloat16-training \
  /bin/bash quickstart/<script name>.sh

To run distributed training (one message passing interface [MPI] process per socket) for better throughput, set the MPI_NUM_PROCESSES var to the number of sockets to use.

DATASET_DIR=<path to the preprocessed imagenet dataset>
OUTPUT_DIR=<directory where checkpoint and log files will be written>
MPI_NUM_PROCESSES=<number of sockets to use>

docker run \
  --env DATASET_DIR=${DATASET_DIR} \
  --env OUTPUT_DIR=${OUTPUT_DIR} \
  --env MPI_NUM_PROCESSES=${MPI_NUM_PROCESSES} \
  --env http_proxy=${http_proxy} --env https_proxy=${https_proxy} \
  --volume ${DATASET_DIR}:${DATASET_DIR} \
  --volume ${OUTPUT_DIR}:${OUTPUT_DIR} \
  --privileged --init -t \
  intel/image-recognition:tf-latest-resnet50v1-5-bfloat16-training \
  /bin/bash quickstart/<script name>.sh

Documentation and Sources

Get Started
Docker* Repository
Main GitHub*
Readme
Release Notes
Get Started Guide

Code Sources
Dockerfile
Report Issue


License Agreement

LEGAL NOTICE: By accessing, downloading or using this software and any required dependent software (the “Software Package”), you agree to the terms and conditions of the software license agreements for the Software Package, which may also include notices, disclaimers, or license terms for third party software included with the Software Package. Please refer to the license file for additional details.


View All Containers and Solutions 🡢