Optimize a BERT-Large FP32 Training Model Package with TensorFlow*

Published: 10/23/2020  

Last Updated: 06/15/2022

Download Command

wget https://storage.googleapis.com/intel-optimized-tensorflow/models/v2_3_0/bert-large-fp32-training.tar.gz

Description

This document has instructions for running BERT FP32 training using Intel® Optimizations for TensorFlow*.

For all fine-tuning the datasets (Stanford Question Answering Dataset [SQuAD], MultiNLI, Microsoft* Research Paraphrase Corpus [MRPC], and more), download checkpoints as mentioned in the Google* BERT repository.

Refer to the Google reference page for checkpoints.

Datasets

Follow instructions in BERT Large datasets to download and preprocess the dataset. You can do either classification training or fine-tuning using SQuAD.

Quick Start Scripts

Script name Description
fp32_classifier_training This script fine-tunes the bert base model on the Microsoft Research Paraphrase Corpus (MRPC) corpus, which only contains 3,600 examples. Download the bert base uncased 12-layer, 768-hidden pretrained model and set the CHECKPOINT_DIR to that directory. The DATASET_DIR should point to the GLUE data.
fp32_squad_training This script fine-tunes bert using SQuAD data. Download the bert large uncased (whole word masking) pretrained model and set the CHECKPOINT_DIR to that directory. The DATASET_DIR should point to the squad data files.
fp32_squad_training_demo This script does a short demo run of 0.01 epochs using the mini-dev-v1.1.json file instead of the full SQuAD dataset.

Bare Metal

To run on bare metal, the following prerequisites must be installed in your enviornment:

Once the above dependencies have been installed, download and untar the model package, set environment variables, and then run a quick start script. See the datasets and list of quick start scripts for more details on the different options. If switching between running squad and classifier training or running classifier training multiple times, use a new empty OUTPUT_DIR to prevent incompatible checkpoints from getting picked up. See the list of quickstart scripts for details on the different options.

The snippet below shows a quick start script running with a single instance:

wget https://storage.googleapis.com/intel-optimized-tensorflow/models/v2_3_0/bert-large-fp32-training.tar.gz
tar -xvf bert-large-fp32-training.tar.gz
cd bert-large-fp32-training

CHECKPOINT_DIR=<path to the pretrained bert model directory>
DATASET_DIR=<path to the dataset being used>
OUTPUT_DIR=<directory where checkpoints and log files will be saved>

# Run a script for your desired usage
./quickstart/<script name>.sh

To run distributed training (one message passing interface [MPI] process per socket) for better throughput, set the MPI_NUM_PROCESSES var to the number of sockets to use. Note that the global batch size is mpi_num_processes * train_batch_size and sometimes the learning rate needs to be adjusted for convergence. By default, the script uses square root learning rate scaling.

For fine-tuning tasks like BERT, state-of-the-art accuracy can be achieved via parallel training without synchronizing gradients between MPI workers. The mpi_workers_sync_gradients=[True/False] var controls whether the MPI workers sync gradients. By default it is set to False meaning the workers are training independently and the best performing training results will be picked in the end. To enable gradient synchronization, set thempi_workers_sync_gradients to True in BERT options. To modify the BERT options, modify the quick start .sh script or call the launch_benchmarks.py script directly with your preferred args.

To run with multiple instances, these additional dependencies will need to be installed in your environment:

  • Openmpi-bin
  • Openmpi-common
  • OpenSSH client
  • OpenSSH server
  • Libopenmpi-dev
  • Horovod* 0.19.1
wget https://storage.googleapis.com/intel-optimized-tensorflow/models/v2_3_0/bert-large-fp32-training.tar.gz
tar -xvf bert-large-fp32-training.tar.gz
cd bert-large-fp32-training

CHECKPOINT_DIR=<path to the pretrained bert model directory>
DATASET_DIR=<path to the dataset being used>
OUTPUT_DIR=<directory where checkpoints and log files will be saved>
MPI_NUM_PROCESSES=<number of sockets to use>

# Run a script for your desired usage
./quickstart/<script name>.sh

Documentation and Sources

Get Started​
Main GitHub*
Readme
Release Notes
Get Started Guide

Code Sources
Report Issue


License Agreement

LEGAL NOTICE: By accessing, downloading or using this software and any required dependent software (the “Software Package”), you agree to the terms and conditions of the software license agreements for the Software Package, which may also include notices, disclaimers, or license terms for third party software included with the Software Package. Please refer to the license file for additional details.


Related Containers and Solutions

BERT Large FP32 Training TensorFlow Container

View All Containers and Solutions 🡢

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.