Optimize a ResNet50* V1.5 Bfloat16 Training Model Package with TensorFlow*

Published: 12/09/2020  

Last Updated: 06/15/2022

Download Command

wget https://storage.googleapis.com/intel-optimized-tensorflow/models/v2_3_0/resnet50v1-5-bfloat16-training.tar.gz

Description

This document has instructions for running ResNet50* v1.5 bfloat16 training using Intel® Optimization for TensorFlow*.

Note that the ImageNet dataset is used in these ResNet50 v1.5 examples. Download and preprocess the ImageNet dataset using the instructions here. After running the conversion script you should have a directory with the ImageNet dataset in the TF records format.

Set the DATASET_DIR to point to this directory when running ResNet50 v1.5.

Quick Start Scripts

Script name Description
bfloat16_training_demo Launches a short run using small batch sizes and a limited number of steps to demonstrate the training flow
bfloat16_training_1_epoch Launches a test run that trains the model for one epoch and saves checkpoint files to an output directory.
bfloat16_training_full Trains the model using the full dataset and runs until convergence (90 epochs) and saves checkpoint files to an output directory. Note that this will take a considerable amount of time.

Bare Metal

To run on bare metal, the following prerequisites must be installed in your enviornment:

Download and untar the model package and then run a quick start script.

DATASET_DIR=<path to the preprocessed imagenet dataset>
OUTPUT_DIR=<directory where checkpoint and log files will be written>

wget https://storage.googleapis.com/intel-optimized-tensorflow/models/v2_3_0/resnet50v1-5-bfloat16-training.tar.gz
tar -xvf resnet50v1-5-bfloat16-training.tar.gz
cd resnet50v1-5-bfloat16-training

quickstart/<script name>.sh

To run distributed training (one MPI process per socket) for better throughput, set the MPI_NUM_PROCESSES var to the number of sockets to use. To run with multiple instances, these additional dependencies will need to be installed in your environment:

  • Openmpi-bin
  • Openmpi-common
  • OpenSSH client
  • OpenSSH server
  • Libopenmpi-dev
  • Horovod* 0.19.1
DATASET_DIR=<path to the preprocessed imagenet dataset>
OUTPUT_DIR=<directory where checkpoint and log files will be written>
MPI_NUM_PROCESSES=<number of sockets to use>

wget https://storage.googleapis.com/intel-optimized-tensorflow/models/v2_3_0/resnet50v1-5-bfloat16-training.tar.gz
tar -xvf resnet50v1-5-bfloat16-training.tar.gz
cd resnet50v1-5-bfloat16-training

quickstart/<script name>.sh

Documentation and Sources

Get Started
Main GitHub*
Readme
Release Notes
Get Started Guide

Code Sources
Report Issue

 


License Agreement

LEGAL NOTICE: By accessing, downloading or using this software and any required dependent software (the “Software Package”), you agree to the terms and conditions of the software license agreements for the Software Package, which may also include notices, disclaimers, or license terms for third party software included with the Software Package. Please refer to the license file for additional details.


Related Containers and Solutions

ResNet50 V1.5 BFloat16 Training TensorFlow* Container

View All Containers and Solutions 🡢

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.