Skip To Main Content
Intel logo - Return to the home page
My Tools

Select Your Language

  • Bahasa Indonesia
  • Deutsch
  • English
  • Español
  • Français
  • Português
  • Tiếng Việt
  • ไทย
  • 한국어
  • 日本語
  • 简体中文
  • 繁體中文
Sign In to access restricted content

Using Intel.com Search

You can easily search the entire Intel.com site in several ways.

  • Brand Name: Core i9
  • Document Number: 123456
  • Code Name: Emerald Rapids
  • Special Operators: “Ice Lake”, Ice AND Lake, Ice OR Lake, Ice*

Quick Links

You can also try the quick links below to see results for most popular searches.

  • Product Information
  • Support
  • Drivers & Software

Recent Searches

Sign In to access restricted content

Advanced Search

Only search in

Sign in to access restricted content.

The browser version you are using is not recommended for this site.
Please consider upgrading to the latest version of your browser by clicking one of the following links.

  • Safari
  • Chrome
  • Edge
  • Firefox



Develop Solutions on Intel® Gaudi® AI Accelerators

 

 

 

 

  • Overview
  • Inference
  • Fine-Tune
  • Pretrain

Fine-Tune Use Case on Intel® Gaudi® 2 AI Accelerators

Learn how to run a typical model fine-tuning use case on the Intel® Gaudi® AI accelerator. Select a model, set up the environment, and run the workload. Intel Gaudi accelerators support PyTorch* as the main framework for fine-tuning.

Run Fine-Tuning

Fine-tuning on the Intel Gaudi AI accelerator is streamlined, and the code takes you step-by-step through the following items:

  • Get access to a node for the Intel Gaudi AI accelerator on the Intel® Tiber™ AI Cloud.
  • Ensure that all the software is installed and configured properly by running the PyTorch* version of the Docker* image for the accelerator.
  • Select the model to run by loading the desired model repository and appropriate libraries for model acceleration.
  • Run the model and extract the details for evaluation.

Access Models

Accessing models for running fine-tuning can be found in four main ways:

  1. Using Hugging Face* models with the Optimum for Intel library at Hugging Face.
  2. Using the Intel Gaudi AI accelerator model references repository to use built-in PyTorch models.
  3. Using the GPU Migration toolkit to automatically convert GPU-based models to be compatible with Intel Gaudi AI accelerators.
  4. Manual migration from PyTorch models in the public domain.

The Optimum for Intel library at Hugging Face and the model-reference repository contain fully optimized and fully documented model examples. Use them as a starting point for running a model.

This example shows model fine-tuning with Hugging Face by running the Meta* Llama-3-70b-Instruct model using the Optimum for Intel library at Hugging Face. Since Hugging Face models are used with an associated task, run fine-tuning with the language-modeling task.

 

  • Setup Instructions
  • Run and Fine-Tune

Runtime Instructions

The Following are the run instructions needed to setup the node, the model infrastructure and the full runtimes for the model.  

Accessing the Intel® Gaudi® Node 

To access an Intel® Gaudi® node in the Intel® Tiber™ AI Cloud, go to Intel® Tiber™ AI Cloud console and access the hardware instances to select the Intel® Gaudi® 2 platform for deep learning and follow the steps to start and connect to the node.   

The website will provide an ssh command to login to the node, and it’s advisable to add a local port forwarding to the command to be able to access a local Jupyter Notebook. For example, add the command:  ssh -L 8888:localhost:8888 ..  to be able to access the notebook.

Details about setting up Jupyter Notebooks on an Intel® Gaudi® Platform are available here.

Docker Setup  

With access to the node, use the latest Intel® Gaudi® Docker image by first calling the Docker run command which will automatically download and run the Docker: 

docker run -itd --name Gaudi_Docker --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.21.0/ubuntu22.04/habanalabs/pytorch-installer-2.6.0:latest

Start the Docker and enter the Docker environment by issuing the following command:  

docker exec -it Gaudi_Docker bash

More information on Gaudi Docker setup and validation can be found here.

Model Setup  

Once the Docker environment is running, install the remaining libraries and model repositories.

Start in the root directory and install the DeepSpeed Library. DeepSpeed improves memory consumption on Intel® Gaudi® while running large language models. 

cd ~ 
pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.21.0

Now install the Hugging Face Optimum Intel® Gaudi® library and GitHub Examples, selecting the latest validated release of optimum-habana: 

pip install optimum-habana==1.16.0
git clone -b v1.16.0 https://github.com/huggingface/optimum-habana 

Finally, transition to the language-modeling example and install the final set of requirements to run the model: 

cd ~/optimum-habana/examples/language-modeling  
pip install -r requirements.txt 

How to Access and Use the Llama 3 Model 

Use of the pre-trained model is subject to compliance with third-party licenses, including the “META LLAMA 3 COMMUNITY LICENSE AGREEMENT”. For guidance on the intended use of the LLAMA 3 model, what will be considered misuse and out-of-scope uses, who are the intended users and additional terms please review and read the instructions. Users bear sole liability and responsibility to follow and comply with any third-party licenses, and Habana Labs disclaims and will bear no liability with respect to users’ use or compliance with third-party licenses. To be able to run gated models like this Llama-3-70b, perform the following step:

  • Have a Hugging Face account and agree to the terms of use of the model in its model card on the Hugging Face Hub 
  • Create a read token and request access to the Llama 3 model from meta-llama 
  • Login to your account using the Hugging Face CLI:  
huggingface-cli login --token <your_hugging_face_token_here> 

To run with the associated Jupyter Notebook for fine-tuning, please see the running and fine-tuning addendum section for set up of the Jupyter Notebook. You can run these steps directly in the Jupyter interface.

 

 

 

Intel® Tiber™ AI Cloud

 

Jupyter* Notebook

 

Run and Fine-Tune

Fine-tuning a Simple GPT Model 

Start with a simple example of fine-tuning from the Hugging Face* language modeling page. This is using the wikitext dataset to fine-tune the gpt2 model. The fine-tuning of this model takes only a few minutes and the fine-tuned model output is placed in the test_clm folder. 

python run_clm.py \
--model_name_or_path gpt2 \
--dataset_name wikitext \
--dataset_config_name wikitext-2-raw-v1 \
--per_device_train_batch_size 4 \
--per_device_eval_batch_size 4 \
--do_train \
--do_eval \
--overwrite_output_dir \
--report_to none \
--output_dir ./test-clm \
--gaudi_config_name Habana/gpt2 \
--use_habana \
--use_lazy_mode \
--use_hpu_graphs \
--throughput_warmup_steps 3

Fine-tuning the Llama 3 70B Model 

Once simple fine-tuning is complete, start running the full Llama 3 70 model for fine-tuning. Since the Llama 3 70B is a large model, employ the DeepSpeed* library to more efficiently manage the memory usage of the local HBM memory on each Intel Gaudi card. This example also deploys some additional techniques for fine-tuning:  

  • Parameter Efficient fine-tuning (PEFT) is a strategy for adapting large pre-trained language models to specific tasks.  Instead of fine-tuning the entire pre-trained model, PEFT adds a task-specific layer or a few task-specific layers on top of the pre-trained model. These additional layers are relatively smaller and have fewer parameters compared to the base model.

  • DeepSpeed significantly optimizes training efficiency, reducing both computational and memory requirements. It enables the handling of extremely large models by providing advanced parallelism techniques and memory optimization strategies

  • Flash attention is used to reduce memory usage and enhance computational speed through a fused implementation. This includes the use of the FusedSDPA (Scaled Dot Product Attention) applies similar principles to the Intel Gaudi processor environment, optimizing the scaled dot product attention function with reduced memory usage and faster performance while maintaining compatibility with standard PyTorch* functionality.

  • Setting epochs = 2; this is enough to ensure that the training loss is below 1.0, running any more epoch is not needed. 

PT_HPU_MAX_COMPOUND_OP_SIZE=10 DEEPSPEED_HPU_ZERO3_SYNC_MARK_STEP_REQUIRED=1 \
python3 ../gaudi_spawn.py --use_deepspeed --world_size 8 run_lora_clm.py \
--model_name_or_path meta-llama/Meta-Llama-3-70B-Instruct \
--deepspeed llama2_ds_zero3_config.json \
--dataset_name tatsu-lab/alpaca \
--bf16 True \
--output_dir ./llama3_fine_tuning_output \
--num_train_epochs 2 \
--max_seq_len 2048 \
--per_device_train_batch_size 10 \
--per_device_eval_batch_size 10 \
--gradient_checkpointing \
--evaluation_strategy epoch \
--eval_delay 2 \
--save_strategy no \
--learning_rate 0.0018 \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--dataset_concatenation \
--attn_softmax_bf16 True \
--do_train \
--do_eval \
--use_habana \
--use_lazy_mode \
--pipelining_fwd_bwd \
--throughput_warmup_steps 3 \
--report_to none \
--lora_rank 4 \
--lora_target_modules "q_proj" "v_proj" "k_proj" "o_proj" \
--validation_split_percentage 4 \
--use_flash_attention True \
--flash_attention_causal_mask True

The result of the run shows that the fine-tuning of the model required only 38 minutes and achieved 2.2 samples (or sentences) per second. 

***** train metrics ***** 
  epoch                       =        2.0 
  max_memory_allocated (GB)   =      94.53 
  memory_allocated (GB)       =      27.15 
  total_flos                  =  1037280GF 
  total_memory_available (GB) =      94.62 
  train_loss                  =     1.1525 
  train_runtime               = 0:38:47.30 
  train_samples_per_second    =      2.221 
  train_steps_per_second      =      0.028 

The output of the run is in the llama3_fine_tuning_output folder. The full model is the adapter_model.safetensors which contains the additional weights generated by the parameter efficient fine-tuning. These weights can used for inference.

 

 

 

Large Language Model Training

 


Stay Informed


Register for the latest Intel Gaudi AI accelerator developer news, events, training, and updates.

Sign Up
  • Company Overview
  • Contact Intel
  • Newsroom
  • Investors
  • Careers
  • Corporate Responsibility
  • Inclusion
  • Public Policy
  • © Intel Corporation
  • Terms of Use
  • *Trademarks
  • Cookies
  • Privacy
  • Supply Chain Transparency
  • Site Map
  • Recycling
  • Your Privacy Choices California Consumer Privacy Act (CCPA) Opt-Out Icon
  • Notice at Collection

Intel technologies may require enabled hardware, software or service activation. // No product or component can be absolutely secure. // Your costs and results may vary. // Performance varies by use, configuration, and other factors. Learn more at intel.com/performanceindex. // See our complete legal Notices and Disclaimers. // Intel is committed to respecting human rights and avoiding causing or contributing to adverse impacts on human rights. See Intel’s Global Human Rights Principles. Intel’s products and software are intended only to be used in applications that do not cause or contribute to adverse impacts on human rights.

Intel Footer Logo