Skip To Main Content
Intel logo - Return to the home page
My Tools

Select Your Language

  • Bahasa Indonesia
  • Deutsch
  • English
  • Español
  • Français
  • Português
  • Tiếng Việt
  • ไทย
  • 한국어
  • 日本語
  • 简体中文
  • 繁體中文
Sign In to access restricted content

Using Intel.com Search

You can easily search the entire Intel.com site in several ways.

  • Brand Name: Core i9
  • Document Number: 123456
  • Code Name: Emerald Rapids
  • Special Operators: “Ice Lake”, Ice AND Lake, Ice OR Lake, Ice*

Quick Links

You can also try the quick links below to see results for most popular searches.

  • Product Information
  • Support
  • Drivers & Software

Recent Searches

Sign In to access restricted content

Advanced Search

Only search in

Sign in to access restricted content.

The browser version you are using is not recommended for this site.
Please consider upgrading to the latest version of your browser by clicking one of the following links.

  • Safari
  • Chrome
  • Edge
  • Firefox



Develop Solutions on Intel® Gaudi® AI Accelerators

 

 

 

 

  • Overview
  • Inference
  • Fine-Tune
  • Pretrain

Pretrain Llama-2-7B Using  Megatron-DeepSpeed* with the FP8 Datatype on the Intel® Gaudi® 2 AI Accelerator

Learn how to run pretraining of Meta* Llama-2-7b using the Megatron-DeepSpeed* library on the Intel® Gaudi® AI accelerator.  The Megatron-DeepSpeed library improves memory consumption on the Intel Gaudi AI accelerator while running large language models.

Set up the environment, select parameters, run the workload, and then see a price-performance comparison. The Intel Gaudi AI accelerator supports PyTorch* as the main framework for training (based on the Habana* implementation of DeepSpeed). Additional examples can be found for training large transformer language models such as Llama 2 at scale.

The following steps will let you:

  • Get access to a node for the Intel Gaudi AI accelerator on the Intel® Tiber™ AI Cloud.
  • Ensure that all the software is installed and configured properly by running the PyTorch version of the Docker* image for the accelerator.
  • Install prerequisites.
  • Download and preprocess the dataset.
  • Select parameters and run pretraining on the model.

 

Performance Evaluation

Before running the model, look at the performance measurements and price-performance comparison to an equivalent H100 pretraining example. In this case, select the Llama-2-7b parameter model using FP8 with a sequence length of 4,096 and a batch size of 1,024. Use eight Intel Gaudi AI accelerators (Model Performance) and compare this against the same model configuration using the H100 GPU with published inference benchmarks from NVIDIA*.

The following image shows that the tokens per dollar are higher than the NVIDIA solution.

Performance cost differences

  • Setup Instructions
  • Run and Fine-Tune

Accessing the Intel Gaudi Node in the Intel® Tiber® AI Cloud

To access an Intel® Gaudi® node in the Intel® Tiber™ AI cloud, go to Intel® Tiber™ AI Cloud Console and access the hardware instances to select the Intel® Gaudi® 2 platform for deep learning and follow the steps to start and connect to the node.

This is your Alt Text

The website will provide an ssh command to login to the node, and it’s advisable to add a local port forwarding to the command to be able to access a local Jupyter Notebook. For example, add the command:  ssh -L 8888:localhost:8888 ... to be able to access the Notebook.

Details about setting up Jupyter Notebooks on an Intel® Gaudi® Platform are available here.

Docker Setup

With access to the node, use the latest Intel® Gaudi® Docker image by first calling the Docker run command which will automatically download and run the Docker image:

docker run -itd --name Gaudi_Docker --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.21.0/ubuntu22.04/habanalabs/pytorch-installer-2.6.0:latest

Start the Docker image and enter the Docker environment by issuing the following command:

docker exec -it Gaudi_Docker bash

More information on Gaudi Docker setup and validation can be found here.

Install pre-requisites

Once in the Docker environment, install the necessary libraries:

Start in the root directory and install the DeepSpeed Library:

cd ~
pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.21.0

Now install the Hugging Face Optimum Intel® Gaudi® library and GitHub Examples, selecting the latest validated release of both:

pip install optimum-habana==1.16.0
git clone -b 1.19.0 https://github.com/HabanaAI/Megatron-DeepSpeed.git

Next, transition to the Megatron-DeepSpeed directory and install the set of requirements to perform training:

cd Megatron-DeepSpeed
pip install -r megatron/core/requirements.txt

Setup the correct path for Megatron-DeepSpeed:

export MEGATRON_DEEPSPEED_ROOT=`pwd`
export PYTHONPATH=$MEGATRON_DEEPSPEED_ROOT:$PYTHONPATH

Finally, Set Python 3.10 as the default Python version. If Python 3.10 is not the default version, replace any call to the Python command on your model with $PYTHON and define the environment variable as below:

export PYTHON=/usr/bin/python3.10

Download Dataset

To download datasets used for training Llama2, follow the directions in the Megatron-Deepspeed Github page. This tutorial uses a subset of the Oscar dataset to pre-train language models and word representations.

It is possible to download the full (500GB+) oscar dataset. Or a subset of the dataset can be downloaded for a quick start. These steps are based on the Oscar dataset repository.

First, clone the dataset repository:

cd ~
git clone https://github.com/bigscience-workshop/bigscience.git
cd bigscience/data/oscar

Next, edit the file oscar-to-jsonl.py. This example downloads the zh dataset (Chinese). Edit the file in the language subsets list. Remove the comment on unshuffled_deduplicated_zh and comment out unshuffled_deduplicated_en:

### Build/Load Datasets

# Once this part of the process completes it gets cached, so on subsequent runs it'll be much faster
language_subsets = (
     # "unshuffled_deduplicated_ar",
     # "unshuffled_deduplicated_sw",
     "unshuffled_deduplicated_zh",
     # "unshuffled_deduplicated_en",
     # "unshuffled_deduplicated_fr",
     # "unshuffled_deduplicated_pt",
     # "unshuffled_deduplicated_es",
)

Run the Python script that downloads and pre-process the data. Note the use of the -s option, that will download only a subset of the dataset, for the purposes of this tutorial (this operation can take some time, depending on the download speed and hardware used):

$PYTHON ./oscar-to-jsonl.py -s

When the above operation completes, the ~/bigscience/data/oscar/ directory will contain the following data files:

-rw-r--r-- 1 root root 66707628 Jul 26 00:38 oscar-0.jsonl
-rw-r--r-- 1 root root 63555928 Jul 26 00:38 oscar-1.jsonl
-rw-r--r-- 1 root root 59082488 Jul 26 00:38 oscar-2.jsonl
-rw-r--r-- 1 root root 63054515 Jul 26 00:38 oscar-3.jsonl
-rw-r--r-- 1 root root 59592060 Jul 26 00:38 oscar-4.jsonl

The next step is to tokenize the dataset. There are different ways to perform tokenization of a dataset. This example uses the GPT2BPETokenizer method (Byte-Pair Encoding).

According to the directions in the Gaudi Megatron-DeepSpeed github page, the five jsonl files above can be concatenated into a single large file to be tokenized, or the tokenization can be done on each one of the five files separately (and then the 5 tokenized files can be merged). In this tutorial the smaller files are processed individually, to prevent possible host out of memory issues.

The GPT2BPETokenizer method is used to tokenize the five jsonl files separately. First, download the gpt2 vocabulary.json and the merges.txt file:

wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json
wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt

Next, create and execute a shell script as follows. This script will tokenize the individual jsonl files one at a time and will write the tokenized files to the zh_tokenized directory. In the sixth line, the number of workers can be changed according to the number of cores in the CPU that is being used:

# tokenize individual jsonl files
# loop count will change based on number of files for a given dataset
mkdir zh_tokenized
for i in $(seq 0 4);
do
    $PYTHON $MEGATRON_DEEPSPEED_ROOT/tools/preprocess_data.py --input oscar-${i}.jsonl --output-prefix zh_tokenized/tokenized${i} --tokenizer-type GPT2BPETokenizer --vocab-file gpt2-vocab.json --merge-file gpt2-merges.txt --append-eod --workers 16
done

After the above operation is completed, the “zh_tokenized” directory will contain the following files:

-rw-r--r-- 1 root root 93115006 Jul 26 00:47 tokenized0_text_document.bin
-rw-r--r-- 1 root root 166882 Jul 26 00:47 tokenized0_text_document.idx
-rw-r--r-- 1 root root 88055238 Jul 26 00:47 tokenized1_text_document.bin
-rw-r--r-- 1 root root 166882 Jul 26 00:47 tokenized1_text_document.idx
-rw-r--r-- 1 root root 82539576 Jul 26 00:47 tokenized2_text_document.bin
-rw-r--r-- 1 root root 166882 Jul 26 00:47 tokenized2_text_document.idx
-rw-r--r-- 1 root root 87806904 Jul 26 00:47 tokenized3_text_document.bin
-rw-r--r-- 1 root root 166882 Jul 26 00:47 tokenized3_text_document.idx
-rw-r--r-- 1 root root 82680922 Jul 26 00:48 tokenized4_text_document.bin
-rw-r--r-- 1 root root 166862 Jul 26 00:48 tokenized4_text_document.idx

To complete the tokenization step, the multiple tokenized dataset files generated above should be merged into a single file. For this, run the following commands:

# merge tokenized files
mkdir zh_tokenized_merged
$PYTHON $MEGATRON_DEEPSPEED_ROOT/tools/merge_datasets.py --input zh_tokenized --output-prefix zh_tokenized_merged/tokenized_text_document

Which will result in the zh_tokenized_merged directory created and containing the following merged files:

ls -lt zh_tokenized_merged
-rw-r--r-- 1 root root 834222 Jul 26 00:57 tokenized_text_document.idx
-rw-r--r-- 1 root root 434197646 Jul 26 00:57 tokenized_text_document.bin

To make pretraining easier, copy the gpt2-merges.txt and gpt2-vocab.json files into the zh_tokenized_merged directory. Using the GPT2BPETokenizer with pretraining requires those files to be in the same directory as the data.

cp gpt2-* zh_tokenized_merged

This completes the dataset downloading and preprocessing steps.

 

 

 

Intel® Tiber® AI Cloud

 

Llama2 7B Training

Write an example script, called run_llama_wrapper.sh, to perform training on Llama 2 7B. The first part of the script contains debugging information (see the debugging guide documentation for more information). The Habana log enablement env variables are described in the debugging guide documentation:

export LOG_LEVEL_ALL=4
export ENABLE_CONSOLE=true
export HABANA_LOGS=./habana_log

Next, setup environment variable of the directories containing the model references and data used for training:

export MODEL_REFERENCES_ROOT=${MEGATRON_DEEPSPEED_ROOT}
export HL_DATA_DIR_ROOT=~/bigscience/data/oscar/zh_tokenized_merged
export HL_DATA_FILE_PREFIX=tokenized_text_document
export OUT_DIR="Llama2-7B-training"
export HL_HOSTSFILE=/launch/hostsfile
export HL_TOKENIZER_TYPE=GPT2BPETokenizer 

The rest of the script contains variables that will control training:

mkdir -p ${OUT_DIR}

HL_SAVE=0 \
HL_EXIT_INTERVAL=80 \
HL_RESULTS_DIR=${OUT_DIR} \
HL_LOG_INTERVAL=10 \
HL_TOKENIZER_TYPE=${HL_TOKENIZER_TYPE} \
HL_NUM_NODES=1 \
HL_PP=1 HL_TP=1 HL_DP=8 \
HL_DATA_DIR_ROOT=${HL_DATA_DIR_ROOT} \
HL_LLAMA_MODEL_SIZE=7 \
HL_LLAMA_VER=2 \
HL_DATA_FILE_PREFIX=${HL_DATA_FILE_PREFIX} \
HL_ZERO_STAGE=1 \
HL_CKP_ACT=2 \
HL_SEQ_LEN=4096 \
HL_GBS=512 \
HL_USE_FAST_SOFTMAX=1 \
HL_GRAD_ACCUM_DTYPE=bf16 \
HL_USE_TRANSFORMER_ENGINE=1 \
HL_USE_CACHE_FP8_WEIGHT_FWD=1 \
HL_USE_CACHE_FP8_WEIGHT=1 \
${MODEL_REFERENCES_ROOT}/scripts/run_llama.sh 2>&1 | tee ${OUT_DIR}/llama_8x.log

Execute the script to start the training:

./run_llama_wrapper.sh &

FP8 performance enhancements are enabled by setting HL_USE_TRANSFORMER_ENGINE=1. The HL_USE_CACHE_FP8_WEIGHT_FWD=1 and HL_USE_CACHE_FP8_WEIGHT=1 settings improve FP8 performance.

Untested ZeRO Optimizer errors

If the version of Deepspeed being used has an untested zero optimizer the run may terminate with the following error message:

AssertionError: You are using an untested ZeRO Optimizer. Please add <"zero_allow_untested_optimizer": true> in the configuration file to use it.

To bypass this issue, add the following entry to the EOT statement that creates the ds_config.json file in the ~/Megatron-DeepSpeed/scripts/run_llama.sh shell script:

  "zero_allow_untested_optimizer": true

Llama2 7B Training Results

As the performance results can vary depending on the hardware used, the results shown in this section are to be considered as examples and not as benchmark results. Detailed information about performance data for Intel® Gaudi® AI Accelerators can be found here.

In a sample run of the run_llama script, the following information is reported in the output log at the end of the execution (remember that the sample run ended after 80 iterations, as specified by the env variable: HL_EXIT_INTERVAL=80):

iteration 80/ 500000 | consumed samples:81920 | consumed tokens:335544320 | elapsed time per iteration (ms): 62373.1 | learning rate:1.200E-05 | global batch size: 1024 | lm loss:3.354671E+00 | loss scale:1.0 | grad norm:4.962 | num zeros:0.0 | actual seqlen:4096 | number of skipped iterations:0 | number of nan iterations:0 | samples per second:16.417 | tokens per gpu per second (tgs):8405.678 | TFLOPs:409.21 |

The total number of tokens per second is:

tokens per gpu per second (tgs) \* 8 HPUs ~= 8400 \* 8 ~= 67,200 tokens/sec

These results align with the published numbers for Intel Gaudi 2.

Next Steps

Now that you have run a pre-training case, you can go back to the HuggingFace* Optimum Habana validated models to see more options for running training or inference.

 

 

 

Intel® Tiber® AI Cloud

 


Stay Informed


Register for the latest Intel Gaudi AI accelerator developer news, events, training, and updates.

Sign Up
  • Company Overview
  • Contact Intel
  • Newsroom
  • Investors
  • Careers
  • Corporate Responsibility
  • Inclusion
  • Public Policy
  • © Intel Corporation
  • Terms of Use
  • *Trademarks
  • Cookies
  • Privacy
  • Supply Chain Transparency
  • Site Map
  • Recycling
  • Your Privacy Choices California Consumer Privacy Act (CCPA) Opt-Out Icon
  • Notice at Collection

Intel technologies may require enabled hardware, software or service activation. // No product or component can be absolutely secure. // Your costs and results may vary. // Performance varies by use, configuration, and other factors. Learn more at intel.com/performanceindex. // See our complete legal Notices and Disclaimers. // Intel is committed to respecting human rights and avoiding causing or contributing to adverse impacts on human rights. See Intel’s Global Human Rights Principles. Intel’s products and software are intended only to be used in applications that do not cause or contribute to adverse impacts on human rights.

Intel Footer Logo