Develop Solutions on Intel® Gaudi® AI Accelerators

Pretrain Llama-2-7B Using Megatron-DeepSpeed* with the FP8 Datatype on the Intel® Gaudi® 2 AI Accelerator

Learn how to run pretraining of Meta* Llama-2-7b using the Megatron-DeepSpeed* library on the Intel® Gaudi® AI accelerator. The Megatron-DeepSpeed library improves memory consumption on the Intel Gaudi AI accelerator while running large language models.

Set up the environment, select parameters, run the workload, and then see a price-performance comparison. The Intel Gaudi AI accelerator supports PyTorch* as the main framework for training (based on the Habana* implementation of DeepSpeed). Additional examples can be found for training large transformer language models such as Llama 2 at scale.

The following steps will let you:

Get access to a node for the Intel Gaudi AI accelerator on the Intel® Tiber™ AI Cloud.
Ensure that all the software is installed and configured properly by running the PyTorch version of the Docker* image for the accelerator.
Install prerequisites.
Download and preprocess the dataset.
Select parameters and run pretraining on the model.

Performance Evaluation

Before running the model, look at the performance measurements and price-performance comparison to an equivalent H100 pretraining example. In this case, select the Llama-2-7b parameter model using FP8 with a sequence length of 4,096 and a batch size of 1,024. Use eight Intel Gaudi AI accelerators (Model Performance) and compare this against the same model configuration using the H100 GPU with published inference benchmarks from NVIDIA*.

The following image shows that the tokens per dollar are higher than the NVIDIA solution.

Performance cost differences

Setup Instructions
Run and Fine-Tune

Accessing the Intel Gaudi Node in the Intel® Tiber® AI Cloud

To access an Intel® Gaudi® node in the Intel® Tiber™ AI cloud, go to Intel® Tiber™ AI Cloud Console and access the hardware instances to select the Intel® Gaudi® 2 platform for deep learning and follow the steps to start and connect to the node.

This is your Alt Text

The website will provide an ssh command to login to the node, and it’s advisable to add a local port forwarding to the command to be able to access a local Jupyter Notebook. For example, add the command: ssh -L 8888:localhost:8888 ... to be able to access the Notebook.

Details about setting up Jupyter Notebooks on an Intel® Gaudi® Platform are available here.

Docker Setup

With access to the node, use the latest Intel® Gaudi® Docker image by first calling the Docker run command which will automatically download and run the Docker image:

docker run -itd --name Gaudi_Docker --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.21.0/ubuntu22.04/habanalabs/pytorch-installer-2.6.0:latest

Start the Docker image and enter the Docker environment by issuing the following command:

docker exec -it Gaudi_Docker bash

More information on Gaudi Docker setup and validation can be found here.

Install pre-requisites

Once in the Docker environment, install the necessary libraries:

Start in the root directory and install the DeepSpeed Library:

cd ~
pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.21.0

Now install the Hugging Face Optimum Intel® Gaudi® library and GitHub Examples, selecting the latest validated release of both:

pip install optimum-habana==1.16.0
git clone -b 1.19.0 https://github.com/HabanaAI/Megatron-DeepSpeed.git

Next, transition to the Megatron-DeepSpeed directory and install the set of requirements to perform training:

cd Megatron-DeepSpeed
pip install -r megatron/core/requirements.txt

Setup the correct path for Megatron-DeepSpeed:

export MEGATRON_DEEPSPEED_ROOT=`pwd`
export PYTHONPATH=$MEGATRON_DEEPSPEED_ROOT:$PYTHONPATH

Finally, Set Python 3.10 as the default Python version. If Python 3.10 is not the default version, replace any call to the Python command on your model with $PYTHON and define the environment variable as below:

export PYTHON=/usr/bin/python3.10

Download Dataset

To download datasets used for training Llama2, follow the directions in the Megatron-Deepspeed Github page. This tutorial uses a subset of the Oscar dataset to pre-train language models and word representations.

It is possible to download the full (500GB+) oscar dataset. Or a subset of the dataset can be downloaded for a quick start. These steps are based on the Oscar dataset repository.

First, clone the dataset repository:

cd ~
git clone https://github.com/bigscience-workshop/bigscience.git
cd bigscience/data/oscar

Next, edit the file oscar-to-jsonl.py. This example downloads the zh dataset (Chinese). Edit the file in the language subsets list. Remove the comment on unshuffled_deduplicated_zh and comment out unshuffled_deduplicated_en:

### Build/Load Datasets

# Once this part of the process completes it gets cached, so on subsequent runs it'll be much faster
language_subsets = (
     # "unshuffled_deduplicated_ar",
     # "unshuffled_deduplicated_sw",
     "unshuffled_deduplicated_zh",
     # "unshuffled_deduplicated_en",
     # "unshuffled_deduplicated_fr",
     # "unshuffled_deduplicated_pt",
     # "unshuffled_deduplicated_es",
)

Run the Python script that downloads and pre-process the data. Note the use of the -s option, that will download only a subset of the dataset, for the purposes of this tutorial (this operation can take some time, depending on the download speed and hardware used):

$PYTHON ./oscar-to-jsonl.py -s

When the above operation completes, the ~/bigscience/data/oscar/ directory will contain the following data files:

-rw-r--r-- 1 root root 66707628 Jul 26 00:38 oscar-0.jsonl
-rw-r--r-- 1 root root 63555928 Jul 26 00:38 oscar-1.jsonl
-rw-r--r-- 1 root root 59082488 Jul 26 00:38 oscar-2.jsonl
-rw-r--r-- 1 root root 63054515 Jul 26 00:38 oscar-3.jsonl
-rw-r--r-- 1 root root 59592060 Jul 26 00:38 oscar-4.jsonl

The next step is to tokenize the dataset. There are different ways to perform tokenization of a dataset. This example uses the GPT2BPETokenizer method (Byte-Pair Encoding).

According to the directions in the Gaudi Megatron-DeepSpeed github page, the five jsonl files above can be concatenated into a single large file to be tokenized, or the tokenization can be done on each one of the five files separately (and then the 5 tokenized files can be merged). In this tutorial the smaller files are processed individually, to prevent possible host out of memory issues.

The GPT2BPETokenizer method is used to tokenize the five jsonl files separately. First, download the gpt2 vocabulary.json and the merges.txt file:

wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json
wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt

Next, create and execute a shell script as follows. This script will tokenize the individual jsonl files one at a time and will write the tokenized files to the zh_tokenized directory. In the sixth line, the number of workers can be changed according to the number of cores in the CPU that is being used:

# tokenize individual jsonl files
# loop count will change based on number of files for a given dataset
mkdir zh_tokenized
for i in $(seq 0 4);
do
    $PYTHON $MEGATRON_DEEPSPEED_ROOT/tools/preprocess_data.py --input oscar-${i}.jsonl --output-prefix zh_tokenized/tokenized${i} --tokenizer-type GPT2BPETokenizer --vocab-file gpt2-vocab.json --merge-file gpt2-merges.txt --append-eod --workers 16
done

After the above operation is completed, the “zh_tokenized” directory will contain the following files:

-rw-r--r-- 1 root root 93115006 Jul 26 00:47 tokenized0_text_document.bin
-rw-r--r-- 1 root root 166882 Jul 26 00:47 tokenized0_text_document.idx
-rw-r--r-- 1 root root 88055238 Jul 26 00:47 tokenized1_text_document.bin
-rw-r--r-- 1 root root 166882 Jul 26 00:47 tokenized1_text_document.idx
-rw-r--r-- 1 root root 82539576 Jul 26 00:47 tokenized2_text_document.bin
-rw-r--r-- 1 root root 166882 Jul 26 00:47 tokenized2_text_document.idx
-rw-r--r-- 1 root root 87806904 Jul 26 00:47 tokenized3_text_document.bin
-rw-r--r-- 1 root root 166882 Jul 26 00:47 tokenized3_text_document.idx
-rw-r--r-- 1 root root 82680922 Jul 26 00:48 tokenized4_text_document.bin
-rw-r--r-- 1 root root 166862 Jul 26 00:48 tokenized4_text_document.idx

To complete the tokenization step, the multiple tokenized dataset files generated above should be merged into a single file. For this, run the following commands:

# merge tokenized files
mkdir zh_tokenized_merged
$PYTHON $MEGATRON_DEEPSPEED_ROOT/tools/merge_datasets.py --input zh_tokenized --output-prefix zh_tokenized_merged/tokenized_text_document

Which will result in the zh_tokenized_merged directory created and containing the following merged files:

ls -lt zh_tokenized_merged

-rw-r--r-- 1 root root 834222 Jul 26 00:57 tokenized_text_document.idx
-rw-r--r-- 1 root root 434197646 Jul 26 00:57 tokenized_text_document.bin

To make pretraining easier, copy the gpt2-merges.txt and gpt2-vocab.json files into the zh_tokenized_merged directory. Using the GPT2BPETokenizer with pretraining requires those files to be in the same directory as the data.

cp gpt2-* zh_tokenized_merged

This completes the dataset downloading and preprocessing steps.

Intel® Tiber® AI Cloud

Llama2 7B Training

Write an example script, called run_llama_wrapper.sh, to perform training on Llama 2 7B. The first part of the script contains debugging information (see the debugging guide documentation for more information). The Habana log enablement env variables are described in the debugging guide documentation:

export LOG_LEVEL_ALL=4
export ENABLE_CONSOLE=true
export HABANA_LOGS=./habana_log

Next, setup environment variable of the directories containing the model references and data used for training:

export MODEL_REFERENCES_ROOT=${MEGATRON_DEEPSPEED_ROOT}
export HL_DATA_DIR_ROOT=~/bigscience/data/oscar/zh_tokenized_merged
export HL_DATA_FILE_PREFIX=tokenized_text_document
export OUT_DIR="Llama2-7B-training"
export HL_HOSTSFILE=/launch/hostsfile
export HL_TOKENIZER_TYPE=GPT2BPETokenizer

The rest of the script contains variables that will control training:

mkdir -p ${OUT_DIR}

HL_SAVE=0 \
HL_EXIT_INTERVAL=80 \
HL_RESULTS_DIR=${OUT_DIR} \
HL_LOG_INTERVAL=10 \
HL_TOKENIZER_TYPE=${HL_TOKENIZER_TYPE} \
HL_NUM_NODES=1 \
HL_PP=1 HL_TP=1 HL_DP=8 \
HL_DATA_DIR_ROOT=${HL_DATA_DIR_ROOT} \
HL_LLAMA_MODEL_SIZE=7 \
HL_LLAMA_VER=2 \
HL_DATA_FILE_PREFIX=${HL_DATA_FILE_PREFIX} \
HL_ZERO_STAGE=1 \
HL_CKP_ACT=2 \
HL_SEQ_LEN=4096 \
HL_GBS=512 \
HL_USE_FAST_SOFTMAX=1 \
HL_GRAD_ACCUM_DTYPE=bf16 \
HL_USE_TRANSFORMER_ENGINE=1 \
HL_USE_CACHE_FP8_WEIGHT_FWD=1 \
HL_USE_CACHE_FP8_WEIGHT=1 \
${MODEL_REFERENCES_ROOT}/scripts/run_llama.sh 2>&1 | tee ${OUT_DIR}/llama_8x.log

Execute the script to start the training:

./run_llama_wrapper.sh &

FP8 performance enhancements are enabled by setting HL_USE_TRANSFORMER_ENGINE=1. The HL_USE_CACHE_FP8_WEIGHT_FWD=1 and HL_USE_CACHE_FP8_WEIGHT=1 settings improve FP8 performance.

Untested ZeRO Optimizer errors

If the version of Deepspeed being used has an untested zero optimizer the run may terminate with the following error message:

AssertionError: You are using an untested ZeRO Optimizer. Please add <"zero_allow_untested_optimizer": true> in the configuration file to use it.

To bypass this issue, add the following entry to the EOT statement that creates the ds_config.json file in the ~/Megatron-DeepSpeed/scripts/run_llama.sh shell script:

  "zero_allow_untested_optimizer": true

Llama2 7B Training Results

As the performance results can vary depending on the hardware used, the results shown in this section are to be considered as examples and not as benchmark results. Detailed information about performance data for Intel® Gaudi® AI Accelerators can be found here.

In a sample run of the run_llama script, the following information is reported in the output log at the end of the execution (remember that the sample run ended after 80 iterations, as specified by the env variable: HL_EXIT_INTERVAL=80):

iteration 80/ 500000 | consumed samples:81920 | consumed tokens:335544320 | elapsed time per iteration (ms): 62373.1 | learning rate:1.200E-05 | global batch size: 1024 | lm loss:3.354671E+00 | loss scale:1.0 | grad norm:4.962 | num zeros:0.0 | actual seqlen:4096 | number of skipped iterations:0 | number of nan iterations:0 | samples per second:16.417 | tokens per gpu per second (tgs):8405.678 | TFLOPs:409.21 |

The total number of tokens per second is:

tokens per gpu per second (tgs) \* 8 HPUs ~= 8400 \* 8 ~= 67,200 tokens/sec

These results align with the published numbers for Intel Gaudi 2.

Next Steps

Now that you have run a pre-training case, you can go back to the HuggingFace* Optimum Habana validated models to see more options for running training or inference.

Intel® Tiber® AI Cloud

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in