Skip To Main Content
Intel logo - Return to the home page
My Tools

Select Your Language

  • Bahasa Indonesia
  • Deutsch
  • English
  • Español
  • Français
  • Português
  • Tiếng Việt
  • ไทย
  • 한국어
  • 日本語
  • 简体中文
  • 繁體中文
Sign In to access restricted content

Using Intel.com Search

You can easily search the entire Intel.com site in several ways.

  • Brand Name: Core i9
  • Document Number: 123456
  • Code Name: Emerald Rapids
  • Special Operators: “Ice Lake”, Ice AND Lake, Ice OR Lake, Ice*

Quick Links

You can also try the quick links below to see results for most popular searches.

  • Product Information
  • Support
  • Drivers & Software

Recent Searches

Sign In to access restricted content

Advanced Search

Only search in

Sign in to access restricted content.

The browser version you are using is not recommended for this site.
Please consider upgrading to the latest version of your browser by clicking one of the following links.

  • Safari
  • Chrome
  • Edge
  • Firefox



Develop Solutions on Intel® Gaudi® AI Accelerators

 

 

  • Overview
  • Inference
  • Fine-Tune
  • Pretrain

Run an Inference Use Case on the Intel® Gaudi® 2 AI Accelerator

Learn how to select a model, set up the environment, run the workload, and then see a price-performance comparison. The accelerator supports PyTorch* as the main framework for inference.

The following code guides you in how to:

  • Get access to a node for Intel Gaudi AI accelerator on the Intel® Tiber™ AI Cloud.
  • Ensure that all the software is installed and configured properly by running the PyTorch* version of the Docker* image for the accelerator
  • Select the model to run by loading the desired model repository and appropriate libraries for model acceleration.

Run the model and extract the details for evaluation.

There are four methods for running inference on models:

  1. Using Hugging Face* models with the Optimum for Intel Gaudi AI accelerators library.
  2. Using the Intel Gaudi AI accelerators model references repository to use built-in PyTorch models.
  3. Using the GPU Migration toolkit to automatically convert GPU-based models to be compatible with Intel Gaudi AI accelerators.
  4. Manual migration from PyTorch models in the public domain.

The Optimum for Intel Gaudi AI accelerators library and the model-reference repository contain fully optimized and fully documented model examples. Use them as a starting point for running a model.

This example shows model inference with Hugging Face by running the Meta* Llama-2-70b model using the Optimum for Intel Gaudi AI accelerators library. Since Hugging Face models are used with an associated task, run inference with the text-generation task.

 

Performance Evaluation

Before running the model, let's look at the performance measurements and a price-performance comparison to an equivalent H100 inference. In this case, the Llama-2-70b parameter model was selected using FP8, with 128 input tokens, 2048 output tokens, and four Intel Gaudi AI accelerators.

The tokens per dollar or inference runs per dollar are significantly higher than the NVIDIA solution. 

  • View the Intel Gaudi benchmarks and performance data.
  • The model is compared to the same model configuration using the H100 GPU using NVIDIA* published inference benchmarks from June 25, 2024.

Performance cost differences

  • Setup Instructions
  • Run and Fine-Tune

Runtime Instructions

The following are the run instructions needed to set up the node, the model infrastructure and the full runtimes for the model.

Accessing the Intel® Gaudi® Node

To access an Intel® Gaudi® node in the Intel® Tiber™ AI cloud, go to Intel® Tiber™ AI Cloud Console, access the hardware instances to select the Intel® Gaudi® 2 platform for deep learning and follow the steps to start and connect to the node.

The website will provide an ssh command to login to the node, and it’s advisable to add a local port forwarding to the command to be able to access a local Jupyter Notebook. For example, add the command:  ssh -L 8888:localhost:8888 ..  to be able to access the Notebook.

Details about setting up Jupyter Notebooks on an Intel® Gaudi® Platform are available here.

Docker Setup

With access to the node, use the latest Docker image by first calling the docker run command which will automatically download and run the docker:

docker run -itd --name Gaudi_Docker --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.21.0/ubuntu22.04/habanalabs/pytorch-installer-2.6.0:latest

Start the docker and enter the docker environment by issuing the following command:

docker exec -it Gaudi_Docker bash

More information on Gaudi Docker setup and validation can be found here.

Model Setup

In the running Docker environment, install the remaining libraries and model repositories:

Start in the root directory and install the DeepSpeed* Library; DeepSpeed is used to improve memory consumption on Intel® Gaudi® while running large language models.

cd ~
pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.21.0

Now install the Hugging Face Optimum Intel® Gaudi® library and GitHub examples. Selecting the latest validated release of optimum-habana:

pip install optimum-habana==1.16.0
git clone -b v1.16.0 https://github.com/huggingface/optimum-habana

Then, transition to the text-generation example and install the final set of requirements to run the model:

cd ~/optimum-habana/examples/text-generation
pip install -r requirements.txt
pip install -r requirements_lm_eval.txt

How to Access and Use the Llama 2 Model

Use of the pre-trained model is subject to compliance with third-party licenses, including the “Llama 2 Community License Agreement” (LLAMAV2). For guidance on the intended use of the LLAMA2 model, what will be considered misuse and out-of-scope uses, who are the intended users and additional terms please review and read the instructions. Users bear sole liability and responsibility to follow and comply with any third-party licenses, and Habana Labs disclaims and will bear no liability with respect to users’ use or compliance with third-party licenses.

To be able to run gated models like this Llama-2-70b-hf, you need the following:

  • Have a Hugging Face account and agree to the terms of use of the model in its model card on the Hugging Face hub
  • Create a read token and request access to the Llama 2 model from meta-llama
  • Login to your account using the Hugging Face CLI:
huggingface-cli login --token <your_hugging_face_token_here>

If you want to run with the associated Jupyter Notebook for inference, please see the running and fine-tuning addendum section for setup of the Jupyter Notebook and you can run these steps directly in the Jupyter interface. 

 

 

 

Intel® Tiber® AI Cloud

 

Text-generation example on GitHub

 

Pytorch Inference Jupyter Notebook

 

Running the Llama 2 70B Model Using the FP8 Datatype

Note To learn more about Intel® Gaudi® FP8 quantization, see the user guide.

Run a quantization measurement. This is provided by running the local quantization tool using the maxabs_measure.json file that is already loaded on the Hugging Face library on GitHub.

PT_HPU_LAZY_MODE=1 QUANT_CONFIG=./quantization_config/maxabs_measure.json TQDM_DISABLE=1 \
python3 ../gaudi_spawn.py --use_deepspeed --world_size 8 run_lm_eval.py \
--model_name_or_path meta-llama/Llama-2-70b-hf \
-o acc_70b_bs1_measure4.txt \
--attn_softmax_bf16 \
--use_hpu_graphs \
--trim_logits \
--use_kv_cache \
--bucket_size=128 \
--bucket_internal \
--bf16 \
--batch_size 1 \
--use_flash_attention \
--flash_attention_recompute

Note The model will ask permission to run custom code associated with dataset loading. If this is acceptable, answer yes and execution will proceed.

The code generates a set of measurement values in an hqt_output folder that shows what operations were converted to the FP8 datatype.

-rw-r--r--  1 root root 297695 Dec 22 02:31 measure_hooks_maxabs_0_8.json
-rw-r--r--  1 root root 156380 Dec 22 02:31 measure_hooks_maxabs_0_8.npz
-rw-r--r--  1 root root  40297 Dec 22 02:31 measure_hooks_maxabs_0_8_mod_list.json
-rw-r--r--  1 root root 297684 Dec 22 02:31 measure_hooks_maxabs_1_8.json
-rw-r--r--  1 root root 156380 Dec 22 02:31 measure_hooks_maxabs_1_8.npz
-rw-r--r--  1 root root  40297 Dec 22 02:31 measure_hooks_maxabs_1_8_mod_list.json
-rw-r--r--  1 root root 297751 Dec 22 02:32 measure_hooks_maxabs_2_8.json
-rw-r--r--  1 root root 156380 Dec 22 02:32 measure_hooks_maxabs_2_8.npz
-rw-r--r--  1 root root  40297 Dec 22 02:32 measure_hooks_maxabs_2_8_mod_list.json
-rw-r--r--  1 root root 297751 Dec 22 02:31 measure_hooks_maxabs_3_8.json
-rw-r--r--  1 root root 156380 Dec 22 02:31 measure_hooks_maxabs_3_8.npz
-rw-r--r--  1 root root  40297 Dec 22 02:31 measure_hooks_maxabs_3_8_mod_list.json
-rw-r--r--  1 root root 297710 Dec 22 02:32 measure_hooks_maxabs_4_8.json
-rw-r--r--  1 root root 156380 Dec 22 02:32 measure_hooks_maxabs_4_8.npz
-rw-r--r--  1 root root  40297 Dec 22 02:32 measure_hooks_maxabs_4_8_mod_list.json
-rw-r--r--  1 root root 297835 Dec 22 02:32 measure_hooks_maxabs_5_8.json
-rw-r--r--  1 root root 156380 Dec 22 02:32 measure_hooks_maxabs_5_8.npz
-rw-r--r--  1 root root  40297 Dec 22 02:32 measure_hooks_maxabs_5_8_mod_list.json
-rw-r--r--  1 root root 297764 Dec 22 02:31 measure_hooks_maxabs_6_8.json
-rw-r--r--  1 root root 156380 Dec 22 02:31 measure_hooks_maxabs_6_8.npz
-rw-r--r--  1 root root  40297 Dec 22 02:31 measure_hooks_maxabs_6_8_mod_list.json
-rw-r--r--  1 root root 297655 Dec 22 02:31 measure_hooks_maxabs_7_8.json
-rw-r--r--  1 root root 156380 Dec 22 02:31 measure_hooks_maxabs_7_8.npz

You can use these measurements to run the throughput running of the model. In this case, a standard input prompt is used:

PT_HPU_LAZY_MODE=1 QUANT_CONFIG=./quantization_config/maxabs_quant.json TQDM_DISABLE=1 \
python3 ../gaudi_spawn.py --use_deepspeed --world_size 8 run_generation.py \
--model_name_or_path meta-llama/Llama-2-70b-hf \
--attn_softmax_bf16 \
--use_hpu_graphs \
--trim_logits \
--use_kv_cache \
--bucket_size=128 \
--bucket_internal \
--max_new_tokens 2048 \
--max_input_tokens 128 \
--bf16 \
--batch_size 210 \
--use_flash_attention \
--flash_attention_recompute

Notice that the quantization.json config file is used instead of the measurement file and additional input and output parameters are added. In this case, --max_new_tokens 2048 appears, which determines the size of the output text generated and --max_input_tokens 128 appears, which defines the size of the number of input tokens.

You can now see the final values that align with the published numbers.

Stats:
----------------------------------------------------------------------------------
Input tokens
Throughput (including tokenization) = 9095.210106548433 tokens/second
Memory allocated                    = 93.69 GB
Max memory allocated                = 94.57 GB
Total memory available              = 94.62 GB
Graph compilation duration          = 164.68858193303458 seconds
----------------------------------------------------------------------------------

Note These performance numbers were generated using Gaudi 2 devices (HL-225). Better Throughput using runs with greater values for --batch_size can be achieved on Gaudi 3 devices.

Next Steps

Now that you have run a full inference case, you can go back to the Hugging Face Optimum Intel® Gaudi® validated models to see more options for running inference.

 

 

 

Intel® Tiber® AI Cloud

 


Stay Informed


Register for the latest Intel Gaudi AI accelerator developer news, events, training, and updates.

Sign Up
  • Company Overview
  • Contact Intel
  • Newsroom
  • Investors
  • Careers
  • Corporate Responsibility
  • Inclusion
  • Public Policy
  • © Intel Corporation
  • Terms of Use
  • *Trademarks
  • Cookies
  • Privacy
  • Supply Chain Transparency
  • Site Map
  • Recycling
  • Your Privacy Choices California Consumer Privacy Act (CCPA) Opt-Out Icon
  • Notice at Collection

Intel technologies may require enabled hardware, software or service activation. // No product or component can be absolutely secure. // Your costs and results may vary. // Performance varies by use, configuration, and other factors. Learn more at intel.com/performanceindex. // See our complete legal Notices and Disclaimers. // Intel is committed to respecting human rights and avoiding causing or contributing to adverse impacts on human rights. See Intel’s Global Human Rights Principles. Intel’s products and software are intended only to be used in applications that do not cause or contribute to adverse impacts on human rights.

Intel Footer Logo