Detect Frequent Graph Recompilations

Optimize with Intel® Gaudi® AI Accelerators

Create new deep learning models or migrate existing code in minutes.
Deliver generative AI performance with simplified development and increased productivity.

Learn More

The Intel® Gaudi® software and PyTorch* bridge identify the subset of the framework’s computation graph that can be accelerated by Intel® Gaudi® technology. This bridge provides the functionality to compile an Intel® Gaudi® software graph and launch the resulting recipe in an asynchronous method. The bridge caches recipes to avoid recompilation of the same graph. This caching is done at an eager operation level as well as at a just-in-time (JIT) graph level. During training, the graph compilation is only required for the initial iteration; afterward, the same compiled recipe reruns for every iteration (with new inputs) unless there is a change in the operations being run.

In training workloads, graph recompilations can sometimes occur. Multiple graph recompilations can create system latency and slow down the overall training process. Some of the most common sources of graph recompilations occur when certain topologies generate variable output tensor shapes due to dynamic input data or operators. The term dynamic shapes is broadly used to describe this behavior. To find ways to eliminate these behaviors, see Handling Dynamic Shapes

The following example demonstrates how frequent graph recompilations can impact performance on Intel® Gaudi® software.

Start a Docker* Container

On a platform for Intel® Gaudi® accelerators, run and attach the platform to the latest Docker* image for the accelerators:

docker run -it -d --name GPT2-fine-tune --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.21.0/ubuntu22.04/habanalabs/pytorch-installer-2.6.0

docker exec -it GPT2-fine-tune /bin/bash

Prepare the Model

This tutorial uses a Modified National Institute of Standards and Technology (MNIST) example model that is available in the Model References repository. Clone this repository inside the container that you just started and set the PYTHONPATH to the top-level directory:

cd ~ && git clone https://github.com/HabanaAI/Model-References.git
cd Model-References && export PYTHONPATH=$PYTHONPATH:${PWD}

Navigate to the subdirectory containing the hello_world example:

cd PyTorch/examples/computer_vision/hello_world/

Training on a Single Intel® Gaudi® Processor

Start by running training on a single Intel® Gaudi® processor in BF16 with mixed precision enabled:

PT_HPU_LAZY_MODE=1 python mnist.py --batch-size=128 --epochs=1 --lr=1.0 --gamma=0.7 --hpu --autocast

The model outputs the following after completing the training:

....
Train Epoch: 1 [57600/60000.0 (96%)]    Loss: 0.111816
Train Epoch: 1 [58880/60000.0 (98%)]    Loss: 0.012939
Total test set: 10000, number of workers: 1
* Average Acc 98.350 Average loss 0.051

Detect Recompilations

Modify the code to enable the Metric APIs to detect frequent recompilations. To do this, edit the mnist.py file, adding the following code right above the line in the train subprocess that prints the Epoch training statistics:

....
        if batch_idx % args.log_interval == 0:
            from habana_frameworks.torch.hpu.metrics import metric_global
            gc_metric = metric_global("graph_compilation")
            print("graph_compilation: ", gc_metric.stats())
            print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
....

Rerun the mnist.py code. Statistics on graph compilations are reported:

....
graph_compilation:  [('TotalNumber', 2), ('TotalTime', 591406), ('AvgTime', 295703.0)]
Train Epoch: 1 [57600/60000.0 (96%)]    Loss: 0.111816
graph_compilation:  [('TotalNumber', 2), ('TotalTime', 591406), ('AvgTime', 295703.0)]
Train Epoch: 1 [58880/60000.0 (98%)]    Loss: 0.012939

Total test set: 10000, number of workers: 1
* Average Acc 98.430 Average loss 0.050

Introduce some artificial dynamicity to the model by adding the following code to the train subprocess's main for-loop:

....
    for batch_idx, (data, target) in enumerate(train_loader):
        index = max(-batch_idx -1, -args.batch_size + 1)
        data, target = data[:-index, :, :, :], target[:-index]
....

While rerunning the training code, graph compilation is much more frequent and model latency is much worse until all possible input shapes are cached:

graph_compilation:  [('TotalNumber', 0), ('TotalTime', 0), ('AvgTime', 0)]
Train Epoch: 1 [0/60000.0 (0%)] Loss: 2.437500
graph_compilation:  [('TotalNumber', 10), ('TotalTime', 2617815), ('AvgTime', 261781.5)]
Train Epoch: 1 [110/60000.0 (2%)]       Loss: 2.171875
graph_compilation:  [('TotalNumber', 20), ('TotalTime', 5318485), ('AvgTime', 265924.25)]
Train Epoch: 1 [420/60000.0 (4%)]       Loss: 1.585938
graph_compilation:  [('TotalNumber', 30), ('TotalTime', 8067065), ('AvgTime', 268902.1666666667)]`
....
graph_compilation:  [('TotalNumber', 90), ('TotalTime', 25365331), ('AvgTime', 281837.0111111111)]
Train Epoch: 1 [8190/60000.0 (19%)]     Loss: 0.373047
graph_compilation:  [('TotalNumber', 100), ('TotalTime', 28352504), ('AvgTime', 283525.04)]
Train Epoch: 1 [10100/60000.0 (21%)]    Loss: 0.294922
graph_compilation:  [('TotalNumber', 110), ('TotalTime', 31392682), ('AvgTime', 285388.0181818182)]
Train Epoch: 1 [12210/60000.0 (23%)]    Loss: 0.333984
graph_compilation:  [('TotalNumber', 120), ('TotalTime', 34436428), ('AvgTime', 286970.23333333334)]
Train Epoch: 1 [14520/60000.0 (26%)]    Loss: 0.235352
graph_compilation:  [('TotalNumber', 127), ('TotalTime', 36547638), ('AvgTime', 287776.6771653543)]
Train Epoch: 1 [16510/60000.0 (28%)]    Loss: 0.267578
graph_compilation:  [('TotalNumber', 127), ('TotalTime', 36547638), ('AvgTime', 287776.6771653543)]
....
graph_compilation:  [('TotalNumber', 127), ('TotalTime', 36547638), ('AvgTime', 287776.6771653543)]
Train Epoch: 1 [58420/60000.0 (98%)]    Loss: 0.013000

Total test set: 10000, number of workers: 1
* Average Acc 98.150 Average loss 0.053

As long as the batch size keeps changing, there are more recompilations and training is noticeably slower.

What’s Next?

Use the same technique to check the frequency of graph recompilations in other code. If needed, look for possible resolutions using the Handling Dynamic Shapes.

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in