Example Overview
The PyTorch* minGPT example is based on the source code forked from the GitHub* repository minGPT.
Setup
Follow the instructions provided in the Installation Guide to set up the environment, including the $PYTHON environment variable. The guide demonstrates how to set up your system to run the model on Intel Gaudi processors.
Clone the Repository
In the Docker* container, use the following code to clone the Intel Gaudi Model-References repository and switch to the branch that matches your Intel Gaudi software version. To determine the Intel Gaudi software version, run the hl-smi utility.
git clone https://github.com/HabanaAI/Gaudi-tutorials /path/to/Gaudi-tutorials
cd Gaudi-tutorials/PyTorch/Large_Model_DeepSpeed/
Install DeepSpeed*
To install Intel® Extension for DeepSpeed on Intel Gaudi software, follow the instructions provided in the DeepSpeed User Guide.
pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.7.1
Memory Consumptions Under Different DeepSpeed Technologies
This section shows how to run two models and then provides charts that show memory consumption across phases.
Before You Begin
- Make sure there are available Intel Gaudi accelerators. In this tutorial, we use eight Intel Gaudi accelerators.
- To demonstrate the memory, add –dump-memory in the command line.
- To limit the training steps (for example, 4 steps), add –steps 4 in the command line.
Run minGPT with Different DeepSpeed Technologies
- Create a big model instead of the default gpt-nano model. This makes the memory variation more obvious during different phases.
To do so, change the model type from gpt-nano to gpt2.--- a/Gaudi-tutorials/PyTorch/Large_Model_DeepSpeed/demo_ds.py +++ b/Gaudi-tutorials/PyTorch/Large_Model_DeepSpeed/demo_ds.py @@ -146,7 +146,7 @@ for a, b in zip(x,y): from mingpt.model import GPT model_config = GPT.get_default_config() -model_config.model_type = 'gpt-nano' +model_config.model_type = 'gpt2' model_config.vocab_size = train_dataset.get_vocab_size() model_config.block_size = train_dataset.get_block_size()
- n minGPT with DeepSpeed ZeRO0.
cd /path/to/Gaudi-tutorials/PyTorch/Large_Model_DeepSpeed deepspeed demo_ds.py --deepspeed --deepspeed_config ds_config.json --use_hpu --steps 4 --dump-memory
The following table shows the memory consumption in different training phases and the maximum memory consumption (in MB).
Step
Before forward (M)
After forward (M)
Before backward (M)
After backward (M)
Before step (M)
After step (M)
Max memory (M)
0
328
328
328
1726 (max 1735)
1726
1402
1735
1
1726 (max 2700)
1726
1726
2051 (max 2384)
2051
2051
2700
2
2051
2051
2051
2051 (max 2384)
2051
1726
2384
3
1726
1726
1726
2051 (max 2384)
2051
1726
2384
-
Run minGPT with DeepSpeed ZeRO1
cd /path/to/Gaudi-tutorials/PyTorch/Large_Model_DeepSpeed deepspeed demo_ds.py --deepspeed --deepspeed_config ds_config_zero1.json --use_hpu --steps 4 --dump-memory
The following table shows the memory consumption in different training phases and the maximum memory consumption (in MB).
Step
Before forward (M)
After forward (M)
Before backward (M)
After backward (M)
Before step (M)
After step (M)
Max memory (M)
0
166
166
166
830 (max 1056)
830
835
1056
1
672
672
672
695 (max 997)
695
672 (max 857)
997
2
672
672
672
695 (max 997)
695
672 (max 857)
997
3
672
672
672
695 (max 997)
695
672 (max 857)
997
-
Run minGPT with DeepSpeed ZeRO1 and Activation Checkpointing.
cd /path/to/Gaudi-tutorials/PyTorch/Large_Model_DeepSpeed deepspeed demo_ds.py --deepspeed --deepspeed_config ds_config_zero1_ac.json --use_hpu --steps 4 --dump-memory --activation-checkpoint
The following table shows the memory consumption in different training phases and the maximum memory consumption (in MB).
Step
Before forward (M)
After forward (M)
Before backward (M)
After backward (M)
Before step (M)
After step (M)
Max memory (M)
0
166
166
166
581 (max 758)
581
423 (max 586)
758
1
423
423
423
446 (max 755)
446
423 (max 608)
755
2
423
423
423
446 (max 758)
446
423 (max 608)
758
3
423
423
423
446 (max 758)
446
423 (max 608)
758
-
Run minGPT with DeepSpeed ZeRO2.
cd /path/to/Gaudi-tutorials/PyTorch/Large_Model_DeepSpeed deepspeed demo_ds.py --deepspeed --deepspeed_config ds_config_zero2.json --use_hpu --steps 4 --dump-memory
The following table shows the memory consumption in different training phases and the maximum memory consumption (in MB).
Step
Before forward (M)
After forward (M)
Before backward (M)
After backward (M)
Before step (M)
After step (M)
Max memory (M)
0
166
166
166
660 (max 993)
660
682
993
1
520
520
520
663 (max 935)
663
523 (max 708)
935
2
523
523
523
568 (max 935)
568
523 (max 708)
935
3
523
523
523
568 (max 935)
568
523 (max 708)
935
Results
- ZeRO0 (basically the default DDP) takes the most memory.
- ZeRO1 and ZeRO2 take less memory than ZeRO0.
- With Activation Checkpointing, memory decreases even more.
Use ZeRO to Solve the Out-of-Memory Issue
Because they have limited memory, Intel Gaudi processors may fail to run big models with the default configuration (for example, ZeRO0).
- Create a very big model with minGPT.
To do so, change the model type from gpt-nano to gpt2-xl.--- a/PyTorch/examples/DeepSpeed/minGPT/demo_ds.py +++ b/PyTorch/examples/DeepSpeed/minGPT/demo_ds.py @@ -146,7 +146,7 @@ for a, b in zip(x,y): from mingpt.model import GPT model_config = GPT.get_default_config() -model_config.model_type = 'gpt-nano' +model_config.model_type = 'gpt2-xl' model_config.vocab_size = train_dataset.get_vocab_size() model_config.block_size = train_dataset.get_block_size()
- Run minGPT with DeepSpeed ZeRO0.
cd /path/to/Model-References/PyTorch/examples/DeepSpeed/minGPT deepspeed demo_ds.py --deepspeed --deepspeed_config ds_config.json --use_hpu --steps 4 --dump-memory
The following example shows what an out-of-memory error looks like.
... RuntimeError: FATAL ERROR :: MODULE:BRIDGE Exception in Launch thread... FATAL ERROR :: MODULE:DEVMEM Allocation failed for size::40960000 (39.0625)MB
- Run minGPT with DeepSpeed ZeRO1.
cd /path/to/Model-References/PyTorch/examples/DeepSpeed/minGPT deepspeed demo_ds.py --deepspeed --deepspeed_config ds_config_zero1.json --use_hpu --steps 4 --dump-memory
Applying ZeRO technology, for example, ZeRO1, ensures that the model can run successfully on Intel Gaudi processors