Gaudi processor logo

Enable DeepSpeed on Intel® Gaudi® Processors

This tutorial provides example training scripts to demonstrate different DeepSpeed* optimization technologies on Intel® Gaudi® processors. It focuses on memory optimization technologies, including Zero Redundancy Optimizer (ZeRO) and Activation Checkpointing.

author-image

By

Example Overview

The PyTorch* minGPT example is based on the source code forked from the GitHub* repository minGPT.

Setup

Follow the instructions provided in the Installation Guide to set up the environment, including the $PYTHON environment variable. The guide demonstrates how to set up your system to run the model on Intel Gaudi processors.

Clone the Repository

In the Docker* container, use the following code to clone the Intel Gaudi Model-References repository and switch to the branch that matches your Intel Gaudi software version. To determine the Intel Gaudi software version, run the hl-smi utility.

git clone https://github.com/HabanaAI/Gaudi-tutorials /path/to/Gaudi-tutorials
cd Gaudi-tutorials/PyTorch/Large_Model_DeepSpeed/

Install DeepSpeed*

To install Intel® Extension for DeepSpeed on Intel Gaudi software, follow the instructions provided in the DeepSpeed User Guide.

pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.7.1

Memory Consumptions Under Different DeepSpeed Technologies

This section shows how to run two models and then provides charts that show memory consumption across phases.

Before You Begin

  • Make sure there are available Intel Gaudi accelerators. In this tutorial, we use eight Intel Gaudi accelerators.
  • To demonstrate the memory, add –dump-memory in the command line.
  • To limit the training steps (for example, 4 steps), add –steps 4 in the command line.

Run minGPT with Different DeepSpeed Technologies

  1. Create a big model instead of the default gpt-nano model. This makes the memory variation more obvious during different phases.
    To do so, change the model type from gpt-nano to gpt2.
    --- a/Gaudi-tutorials/PyTorch/Large_Model_DeepSpeed/demo_ds.py
    +++ b/Gaudi-tutorials/PyTorch/Large_Model_DeepSpeed/demo_ds.py
    @@ -146,7 +146,7 @@ for a, b in zip(x,y):
     from mingpt.model import GPT
    
     model_config = GPT.get_default_config()
    -model_config.model_type = 'gpt-nano'
    +model_config.model_type = 'gpt2'
     model_config.vocab_size = train_dataset.get_vocab_size()
     model_config.block_size = train_dataset.get_block_size()
  1. n minGPT with DeepSpeed ZeRO0.
    cd /path/to/Gaudi-tutorials/PyTorch/Large_Model_DeepSpeed
    deepspeed demo_ds.py --deepspeed --deepspeed_config ds_config.json --use_hpu --steps 4 --dump-memory

    The following table shows the memory consumption in different training phases and the maximum memory consumption (in MB).
     

    Step

    Before forward (M)

    After forward (M)

    Before backward (M)

    After backward (M)

    Before step (M)

    After step (M)

    Max memory (M)

    0

    328

    328

    328

    1726 (max 1735)

    1726

    1402

    1735

    1

    1726 (max 2700)

    1726

    1726

    2051 (max 2384)

    2051

    2051

    2700

    2

    2051

    2051

    2051

    2051 (max 2384)

    2051

    1726

    2384

    3

    1726

    1726

    1726

    2051 (max 2384)

    2051

    1726

    2384

  2. Run minGPT with DeepSpeed ZeRO1
     

    cd /path/to/Gaudi-tutorials/PyTorch/Large_Model_DeepSpeed
    deepspeed demo_ds.py --deepspeed --deepspeed_config ds_config_zero1.json --use_hpu --steps 4 --dump-memory

    The following table shows the memory consumption in different training phases and the maximum memory consumption (in MB).
     

    Step

    Before forward (M)

    After forward (M)

    Before backward (M)

    After backward (M)

    Before step (M)

    After step (M)

    Max memory (M)

    0

    166

    166

    166

    830 (max 1056)

    830

    835

    1056

    1

    672

    672

    672

    695 (max 997)

    695

    672 (max 857)

    997

    2

    672

    672

    672

    695 (max 997)

    695

    672 (max 857)

    997

    3

    672

    672

    672

    695 (max 997)

    695

    672 (max 857)

    997

  3. Run minGPT with DeepSpeed ZeRO1 and Activation Checkpointing.
     

    cd /path/to/Gaudi-tutorials/PyTorch/Large_Model_DeepSpeed
    deepspeed demo_ds.py --deepspeed --deepspeed_config ds_config_zero1_ac.json --use_hpu --steps 4 --dump-memory --activation-checkpoint

    The following table shows the memory consumption in different training phases and the maximum memory consumption (in MB).
     

    Step

    Before forward (M)

    After forward (M)

    Before backward (M)

    After backward (M)

    Before step (M)

    After step (M)

    Max memory (M)

    0

    166

    166

    166

    581 (max 758)

    581

    423 (max 586)

    758

    1

    423

    423

    423

    446 (max 755)

    446

    423 (max 608)

    755

    2

    423

    423

    423

    446 (max 758)

    446

    423 (max 608)

    758

    3

    423

    423

    423

    446 (max 758)

    446

    423 (max 608)

    758

  4. Run minGPT with DeepSpeed ZeRO2.
     

    cd /path/to/Gaudi-tutorials/PyTorch/Large_Model_DeepSpeed
    deepspeed demo_ds.py --deepspeed --deepspeed_config ds_config_zero2.json --use_hpu --steps 4 --dump-memory

    The following table shows the memory consumption in different training phases and the maximum memory consumption (in MB).
     

    Step

    Before forward (M)

    After forward (M)

    Before backward (M)

    After backward (M)

    Before step (M)

    After step (M)

    Max memory (M)

    0

    166

    166

    166

    660 (max 993)

    660

    682

    993

    1

    520

    520

    520

    663 (max 935)

    663

    523 (max 708)

    935

    2

    523

    523

    523

    568 (max 935)

    568

    523 (max 708)

    935

    3

    523

    523

    523

    568 (max 935)

    568

    523 (max 708)

    935

Results

  • ZeRO0 (basically the default DDP) takes the most memory.
  • ZeRO1 and ZeRO2 take less memory than ZeRO0.
  • With Activation Checkpointing, memory decreases even more.

Use ZeRO to Solve the Out-of-Memory Issue

Because they have limited memory, Intel Gaudi processors may fail to run big models with the default configuration (for example, ZeRO0).

  1. Create a very big model with minGPT.
    To do so, change the model type from gpt-nano to gpt2-xl.
    --- a/PyTorch/examples/DeepSpeed/minGPT/demo_ds.py
    +++ b/PyTorch/examples/DeepSpeed/minGPT/demo_ds.py
    @@ -146,7 +146,7 @@ for a, b in zip(x,y):
     from mingpt.model import GPT
    
     model_config = GPT.get_default_config()
    -model_config.model_type = 'gpt-nano'
    +model_config.model_type = 'gpt2-xl'
     model_config.vocab_size = train_dataset.get_vocab_size()
     model_config.block_size = train_dataset.get_block_size()
  2. Run minGPT with DeepSpeed ZeRO0.
    cd /path/to/Model-References/PyTorch/examples/DeepSpeed/minGPT
    deepspeed demo_ds.py --deepspeed --deepspeed_config ds_config.json --use_hpu --steps 4 --dump-memory

    The following example shows what an out-of-memory error looks like.

    ...
    RuntimeError: FATAL ERROR :: MODULE:BRIDGE Exception in Launch thread...
    FATAL ERROR :: MODULE:DEVMEM Allocation failed for size::40960000 (39.0625)MB
  3. Run minGPT with DeepSpeed ZeRO1.
    cd /path/to/Model-References/PyTorch/examples/DeepSpeed/minGPT
    deepspeed demo_ds.py --deepspeed --deepspeed_config ds_config_zero1.json --use_hpu --steps 4 --dump-memory

    Applying ZeRO technology, for example, ZeRO1, ensures that the model can run successfully on Intel Gaudi processors