This tutorial demonstrates fine-tuning a GPT-2* model hosted by Hugging Face* on Intel® Gaudi® AI processors by using the Optimum Habana library developed by Intel and the Microsoft DeepSpeed* library.
Fine-Tuning Defined
Training models from scratch can be expensive, especially with today’s large-scale models. Depending on the model size and scale, the estimated cost for the hardware needed to train such models can range from thousands of dollars to millions of dollars. Fine-tuning is a process of taking a neural network model that has already been trained (usually called a pretrained model) and updating it to create a model that performs a specific task. Assuming that the original task is like the new task, using a pretrained model allows us to take full advantage of the feature extraction that occurs in the top layers of the network without having to develop and train a model from scratch.
This blog focuses on transformers. Pretrained transformers can be quickly fine-tuned for numerous downstream tasks and perform well. Let’s consider a pretrained transformer model that already understands language. Fine-tuning then focuses on training the model to perform question-answering, language generation, named-entity recognition, sentiment analysis, and other tasks.
Given the cost and complexity of training large models, making use of pretrained models is an appealing approach. In fact, there are many publicly available pretrained models targeting specific tasks. This blog focuses on the most popular open source transformer library, Hugging Face. The Hugging Face Hub contains a wide variety of pretrained transformer models that can be used for fine-tuning.
Fine-Tuning is Architecture Agnostic
The pretraining process of a model is done on a specific architecture, but this does not preclude the saved pretrained model from being used on different architectures. For example, a model pretrained using an Intel® Gaudi® AI processor can later be fine-tuned using a GPU. Or a publicly available pretrained model originally pretrained on a GPU can be loaded and trained or fine-tuned on an Intel® Gaudi® AI processor. In other words, fine-tuning a pretrained model is architecture agnostic.
Obtain an Intel® Gaudi® Platform
This tutorial requires an execution environment that enables Intel® Gaudi® AI processors with the appropriate firmware, drivers, and runtime libraries. If needed, reserve the appropriate hardware in the Intel® Tiber™ AICloud.
The latest Docker* images for PyTorch* that support the Intel® Gaudi® processors are available on the Habana Vault and can be downloaded, started, and accessed with the following commands:
docker run -it -d --name GPT2-fine-tune --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.21.0/ubuntu22.04/habanalabs/pytorch-installer-2.6.0
docker exec -it GPT2-fine-tune /bin/bash
Note Successful use of these commands requires a properly installed and configured Habana container runtime supporting the Docker server. For proper installation and configuration of the Habana container runtime, see theIntel® Gaudi® software documentation
Install Optimum-Habana and Other Frameworks and Requirements
After obtaining the base Intel® Gaudi® software execution environment, install Optimum-Habana, DeepSpeed, and the required modules that support language modeling:
git clone https://github.com/huggingface/optimum-habana.git
cd optimum-habana && python3 setup.py install
cd examples/language-modeling && pip install -r requirements.txt
cd ~ && pip install git+https://github.com/HabanaAI/DeepSpeed.git
Fine-Tune the model
To fine-tune the GPT-2 model, create the following main.py Python* script:
from optimum.habana.distributed import DistributedRunner
training_args = {
"output_dir": "/tmp/clm_gpt2_xl",
"dataset_name": "wikitext",
"dataset_config_name": "wikitext-2-raw-v1",
"num_train_epochs": 1,
"per_device_train_batch_size": 4,
"per_device_eval_batch_size": 4,
"gradient_checkpointing": True,
"do_train": True,
"do_eval": True,
"overwrite_output_dir": True,
}
model_name = "gpt2-xl"
training_args["model_name_or_path"] = model_name
training_args["use_habana"] = True # Whether to use HPUs or not
training_args["use_lazy_mode"] = True # Whether to use lazy or eager mode
training_args["gaudi_config_name"] = "Habana/gpt2" # Gaudi configuration to use
training_args["deepspeed"] = "optimum-habana/tests/configs/deepspeed_zero_2.json"
# Build the command to execute
training_args_command_line = " ".join(f"--{key} {value}" for key, value in training_args.items())
command = f"optimum-habana/examples/language-modeling/run_clm.py {training_args_command_line}"
# Instantiate a distributed runner
distributed_runner = DistributedRunner(
command_list=[command], # The command(s) to execute
world_size=8, # The number of HPUs
use_deepspeed=True, # Enable DeepSpeed
)
# Launch training
ret_code = distributed_runner.run()
This code fine-tunes the GPT-2 pretrained model using the WikiText dataset and places the fine-tuned weights in the /tmp/clm_gpt2_xl directory. It runs in distributed mode if multiple Intel® Gaudi® AI processors are available using the DeepSpeed library.
Note The argument model_name_or_path is used for fine-tuning, and it loads the model checkpoint for weights initialization.
Now, run the code:
PT_HPU_LAZY_MODE=1 python main.py
The prompt returns the following output using an Intel® Gaudi® 2 platform that has eight cards available:
***** train metrics *****
epoch = 1.0
max_memory_allocated (GB) = 12.36
memory_allocated (GB) = 6.04
total_flos = 19723388GF
total_memory_available (GB) = 94.62
train_loss = 2.6473
train_runtime = 0:01:51.02
train_samples = 2318
train_samples_per_second = 20.878
train_steps_per_second = 0.657
01/31/2025 21:53:31 - INFO - __main__ - *** Evaluate ***
[INFO|trainer.py:1881] 2025-01-31 21:53:31,010 >>
***** Running Evaluation *****
***** eval metrics *****
epoch = 1.0
eval_accuracy = 0.4806
eval_loss = 2.5549
eval_runtime = 0:00:16.02
eval_samples = 240
eval_samples_per_second = 14.98
eval_steps_per_second = 0.499
max_memory_allocated (GB) = 12.36
memory_allocated (GB) = 6.04
perplexity = 12.8706
total_memory_available (GB) = 94.62
The /tmp/clm_gpt2_xl directory now contains the fine-tuned version of the model.
Use the New Fine-Tuned Model for Text Prediction
To test the new fined-tuned model, create a new file called test.py with the following contents:
# The sequence to complete
prompt_text = "Contrary to the common belief, Chocolate is actually good for you because "
import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import habana_frameworks.torch.core as htcore
path_to_model = "/tmp/clm_gpt2_xl" # the folder where everything related to our run was saved
device = torch.device("hpu")
# Load the tokenizer and the model
tokenizer = GPT2Tokenizer.from_pretrained(path_to_model)
model = GPT2LMHeadModel.from_pretrained(path_to_model)
model.to(device)
# Encode the prompt
encoded_prompt = tokenizer.encode(prompt_text, add_special_tokens=False, return_tensors="pt")
encoded_prompt = encoded_prompt.to(device)
# Generate the following of the prompt
output_sequences = model.generate(
input_ids=encoded_prompt,
max_length=16 + len(encoded_prompt[0]),
do_sample=True,
num_return_sequences=1,
)
# Remove the batch dimension when returning multiple sequences
if len(output_sequences.shape) > 2:
output_sequences.squeeze_()
generated_sequences = []
for generated_sequence_idx, generated_sequence in enumerate(output_sequences):
print(f"=== GENERATED SEQUENCE {generated_sequence_idx + 1} ===")
generated_sequence = generated_sequence.tolist()
# Decode text
text = tokenizer.decode(generated_sequence, clean_up_tokenization_spaces=True)
# Remove all text after the stop token
text = text[: text.find(".")]
# Add the prompt at the beginning of the sequence. Remove the excess text that was used for pre-processing
total_sequence = (
prompt_text + text[len(tokenizer.decode(encoded_prompt[0], clean_up_tokenization_spaces=True)) :]
)
generated_sequences.append(total_sequence)
print(total_sequence)
Note The /tmp/clm_gpt2_xl directory is specified as the path_to_model in the test.py script, so the fine-tuned version of the model will be used to generate a response.
To test the fine-tuned model, run the test.py test:
PT_HPU_LAZY_MODE=1 python test.py
This command produces the following results:
=== GENERATED SEQUENCE 1 ===
Contrary to the common belief, Chocolate is actually good for you because
It provides you with a balanced blood sugar level, and reduces your risk of high blood pressure and improves blood flow.
What’s next?
Try different prompts and configurations to run the model. For more information, see the Optimum-Habana documentation.