author-image

By

This tutorial shows how to run the Intel® Gaudi® accelerator Profiler tool (habana_perf_tool) and the TensorBoard* plug-in. These tools provide valuable optimization tips and information to modify any model for better performance.

For information on how to set up the Profiler tool, see the Intel Gaudi Accelerator Profiler User Guide. For more information on other optimization techniques, see the Model Performance Optimization Guide.

Initial Setup

Start with a Docker* image for PyTorch* on an Intel Gaudi accelerator, and then run this notebook. This example uses the Swin Transformer model from the Hugging Face* repository running on the Hugging Face Optimum for Intel Gaudi (was Optimum Habana) library.

Install the Optimum library and the Hugging Face model examples:

python -m pip install optimum[habana]

git clone https://github.com/huggingface/optimum-habana





cd optimum-habana/examples/image-classification

pip install -r requirements.txt

The utils file now has profiling fully instrumented:

cat -n ../../optimum/habana/utils.py | head -n 254 | tail -n 10



245 schedule = torch.profiler.schedule(wait=wait, warmup=warmup, active=active, repeat=1)

246 activities = [torch.profiler.ProfilerActivity.CPU, torch.profiler.ProfilerActivity.HPU]

247

248 profiler = torch.profiler.profile(

249 schedule=schedule,

250 activities=activities,

251 on_trace_ready=torch.profiler.tensorboard_trace_handler(output_dir),

252 record_shapes=True,

253 with_stack=True,

254 )

Run the Model to Collect a Trace File (Unoptimized)

The Swin Transformer model serves as a general-purpose backbone for computer vision. The run_image_classification.py script showcases how to fine-tune Swin Transformer on Intel Gaudi accelerators. This example uses Swin Transformer.

Note the Intel Gaudi accelerator-specific commands:

  • --use_habana allows training to run on Intel Gaudi accelerators.
  • --use_hpu_graphs reduces recompilation by replaying the graph.
  • --gaudi_config_name Habana/swin maps to the Hugging Face Swin Transformer model configuration.

Note the torch.profiler-specific commands:

  • --profiling_warmup_steps 10 makes the Profiler tool wait for warmup steps.
  • --profiling_steps 3 records for the next active steps.

The collected trace files save to ./hpu_profile, but copies are moved to the ./swin_profile folder for reference.

python run_image_classification.py \

--model_name_or_path microsoft/swin-base-patch4-window7-224-in22k \

--dataset_name cifar10 \

--output_dir /tmp/outputs/ \

--remove_unused_columns False \

--do_train \

--learning_rate 3e-5 \

--num_train_epochs 1 \

--per_device_train_batch_size 64 \

--evaluation_strategy no \

--save_strategy no \

--load_best_model_at_end False \

--save_total_limit 3 \

--seed 1337 \

--use_habana \

--use_lazy_mode \

--use_hpu_graphs \

--gaudi_config_name Habana/swin \

--throughput_warmup_steps 2 \

--overwrite_output_dir \

--ignore_mismatched_sizes \

--profiling_warmup_steps 10 \

--profiling_steps 3

These are the results at the end of the run:

***** train metrics *****

epoch = 1.0

max_memory_allocated (GB) = 92.25

memory_allocated (GB) = 90.84

total_memory_available (GB) = 93.74

train_loss = 0.2722

train_runtime = 0:03:27.66

train_samples_per_second = 240.412

train_steps_per_second = 3.762

Two Ways to Use Performance Analysis Tools

Launch TensorBoard to see performance analysis results:

tensorboard --logdir xxx

Or, use habana_perf_tool (the Profiler tool) to see the console output analysis:

habana_perf_tool --trace xxx.trace.json

Both tools provide the same information.

Note these pieces of habana_perf_tool console output:

  • Device/Host ratio shows overall performance and device utilization.
  • Host Summary shows host-side performance of DataLoader, graph build, data copy, and compile.
  • Device Summary shows device-side performance of the matrix multiplication engine (MME), tensor processing cores (TPC), and DMA.
  • Host/Device Recommendations shows performance recommendations for model optimization.

Run the Profiler tool on the trace files from this first run. For guidance on how to improve, review the code output:

habana_perf_tool --trace ./swin_profile/unoptimized/UNOPT.pt.trace.json
2023-07-19 22:07:04,476 - pytorch_profiler - DEBUG - Loading ./swin_profile/unoptimized/UNOPT.pt.trace.json

Import Data (KB): 100%|█████████████| 200068/200068 [00:01<00:00, 101312.72it/s]

2023-07-19 22:07:07,468 - pytorch_profiler - DEBUG - Please wait for initialization to finish ...

2023-07-19 22:07:15,881 - pytorch_profiler - DEBUG - PT Track ids: BridgeTrackIds.Result(pt_bridge_launch='46,51,6', pt_bridge_compute='15', pt_mem_copy='6', pt_mem_log='', pt_build_graph='48,49,45,5')

2023-07-19 22:07:15,881 - pytorch_profiler - DEBUG - Track ids: TrackIds.Result(forward='4', backward='44', synapse_launch='0,47,50', synapse_wait='1,9', device_mme='40,41,42,43', device_tpc='16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39', device_dma='7,10,11,12,13,14')

2023-07-19 22:07:18,228 - pytorch_profiler - DEBUG - Device ratio: 61.66 % (288.393 ms, 467.734 ms)

2023-07-19 22:07:18,228 - pytorch_profiler - DEBUG - Device/Host ratio: 61.66% / 38.34%

2023-07-19 22:07:19,098 - pytorch_profiler - DEBUG - Host Summary Graph Build: 14.50 % (60.240976 ms, 415.491 ms)

2023-07-19 22:07:19,288 - pytorch_profiler - DEBUG - Host Summary DataLoader: 55.98 % (232.607 ms, 415.491 ms)

2023-07-19 22:07:19,565 - pytorch_profiler - DEBUG - Host Summary Input Time: 4.62 % (19.187 ms, 415.491 ms)

2023-07-19 22:07:19,772 - pytorch_profiler - DEBUG - Host Summary Compile Time: 1.52 % (6.31 ms, 415.491 ms)

2023-07-19 22:07:20,245 - pytorch_profiler - DEBUG - Device Summary MME Lower Precision Ratio: 77.08%

2023-07-19 22:07:20,245 - pytorch_profiler - DEBUG - Device Host Overlapping degree: 81.88 %

2023-07-19 22:07:20,245 - pytorch_profiler - DEBUG - Host Recommendations:

2023-07-19 22:07:20,245 - pytorch_profiler - DEBUG - This run has high time cost on input data loading. 55.98% of the step time is in DataLoader. You could use Habana DataLoader. Or you could try to tune num_workers on DataLoader's construction.

2023-07-19 22:07:20,245 - pytorch_profiler - DEBUG - Compile times per step : [2]. Compile ratio: 1.52% (total time: 6.31 ms)

2023-07-19 22:07:20,561 - pytorch_profiler - DEBUG - [Device Summary] MME total time 88.28 ms

2023-07-19 22:07:27,530 - pytorch_profiler - DEBUG - [Device Summary] MME/TPC overlap time 57.94 ms

2023-07-19 22:07:27,531 - pytorch_profiler - DEBUG - [Device Summary] TPC total time 165.36 ms

2023-07-19 22:07:29,530 - pytorch_profiler - DEBUG - [Device Summary] DMA total time 29.43 ms

2023-07-19 22:07:29,530 - pytorch_profiler - DEBUG - [Device Summary] Idle total time: 5.32 ms

In this case, the DataLoader is taking too much time in the host. The tool recommends using the Intel Gaudi accelerator DataLoader or increasing the number of workers used by the DataLoader.

Apply Optimization 1 (Tune num_workers)

Note the command for optimization:

--dataloader_num_workers 4 performs multiprocess data loading by setting num_workers to a positive integer.

python run_image_classification.py \

--model_name_or_path microsoft/swin-base-patch4-window7-224-in22k \

--dataset_name cifar10 \

--output_dir /tmp/outputs/ \

--remove_unused_columns False \

--do_train \

--learning_rate 3e-5 \

--num_train_epochs 1 \

--per_device_train_batch_size 64 \

--evaluation_strategy no \

--save_strategy no \

--load_best_model_at_end False \

--save_total_limit 3 \

--seed 1337 \

--use_habana \

--use_lazy_mode \

--use_hpu_graphs \

--gaudi_config_name Habana/swin \

--throughput_warmup_steps 2 \

--overwrite_output_dir \

--ignore_mismatched_sizes \

--dataloader_num_workers 4 \

--profiling_warmup_steps 10 \

--profiling_steps 3

These are the results at the end of the run:

***** train metrics *****

epoch = 1.0

max_memory_allocated (GB) = 92.25

memory_allocated (GB) = 90.84

total_memory_available (GB) = 93.74

train_loss = 0.2853

train_runtime = 0:02:43.50

train_samples_per_second = 322.011

train_steps_per_second = 5.039

Run the Profiler tool to see if the workload is better optimized:

habana_perf_tool --trace ./swin_profile/1st_optim_num_worker/1stOPT.pt.trace.json
2023-07-19 22:05:39,782 - pytorch_profiler - DEBUG - Loading ./swin_profile/1st_optim_num_worker/1stOPT.pt.trace.json

Import Data (KB): 100%|█████████████| 177474/177474 [00:01<00:00, 102009.17it/s]

2023-07-19 22:05:42,539 - pytorch_profiler - DEBUG - Please wait for initialization to finish ...

2023-07-19 22:05:49,949 - pytorch_profiler - DEBUG - PT Track ids: BridgeTrackIds.Result(pt_bridge_launch='9,54,49', pt_bridge_compute='18', pt_mem_copy='9', pt_mem_log='', pt_build_graph='8,48,51,52')

2023-07-19 22:05:49,950 - pytorch_profiler - DEBUG - Track ids: TrackIds.Result(forward='7', backward='47', synapse_launch='0,50,53', synapse_wait='1,12', device_mme='43,45,46,44', device_tpc='36,30,26,31,23,25,35,19,29,38,24,22,33,37,27,20,41,32,28,34,40,42,39,21', device_dma='10,17,15,13,14,16')

2023-07-19 22:05:52,033 - pytorch_profiler - DEBUG - Device ratio: 90.84 % (283.428 ms, 312.02 ms)

2023-07-19 22:05:52,033 - pytorch_profiler - DEBUG - Device/Host ratio: 90.84% / 9.16%

2023-07-19 22:05:52,798 - pytorch_profiler - DEBUG - Host Summary Graph Build: 28.77 % (59.886976 ms, 208.177 ms)

2023-07-19 22:05:52,939 - pytorch_profiler - DEBUG - Host Summary DataLoader: 1.56 % (3.249 ms, 208.177 ms)

2023-07-19 22:05:53,161 - pytorch_profiler - DEBUG - Host Summary Input Time: 11.58 % (24.109 ms, 208.177 ms)

2023-07-19 22:05:53,343 - pytorch_profiler - DEBUG - Host Summary Compile Time: 2.28 % (4.746 ms, 208.177 ms)

2023-07-19 22:05:53,810 - pytorch_profiler - DEBUG - Device Summary MME Lower Precision Ratio: 77.08%

2023-07-19 22:05:53,811 - pytorch_profiler - DEBUG - Device Host Overlapping degree: 86.27 %

2023-07-19 22:05:53,811 - pytorch_profiler - DEBUG - Host Recommendations:

2023-07-19 22:05:53,811 - pytorch_profiler - DEBUG - 11.58% H2D of the step time is in Input Data Time. Step call times: [28, 28, 28]. You could try to set non-blocking in torch.Tensor.to and pin_memory in DataLoader's construction to asynchronously convert CPU tensor with pinned memory to a HPU tensor.

2023-07-19 22:05:53,811 - pytorch_profiler - DEBUG - Compile times per step : [2]. Compile ratio: 2.28% (total time: 4.75 ms)

2023-07-19 22:05:54,126 - pytorch_profiler - DEBUG - [Device Summary] MME total time 88.26 ms

2023-07-19 22:06:01,047 - pytorch_profiler - DEBUG - [Device Summary] MME/TPC overlap time 57.95 ms

2023-07-19 22:06:01,049 - pytorch_profiler - DEBUG - [Device Summary] TPC total time 165.50 ms

2023-07-19 22:06:03,065 - pytorch_profiler - DEBUG - [Device Summary] DMA total time 26.40 ms

2023-07-19 22:06:03,065 - pytorch_profiler - DEBUG - [Device Summary] Idle total time: 3.26 ms

The result is much better now. The host ratio dropped to 9%, and throughput improved by 30%. However, the tool recommends using nonblocking data copy (asynchronous copy) to streamline running the code.

Apply Optimization 2 (Use Asynchronous Copy)

Note the command for optimization:

--non_blocking_data_copy True specifies the argument non_blocking=True. During copy operation, the Python* thread can continue to run other tasks while the copy occurs in the background.

python run_image_classification.py \

--model_name_or_path microsoft/swin-base-patch4-window7-224-in22k \

--dataset_name cifar10 \

--output_dir /tmp/outputs/ \

--remove_unused_columns False \

--do_train \

--learning_rate 3e-5 \

--num_train_epochs 1 \

--per_device_train_batch_size 64 \

--evaluation_strategy no \

--save_strategy no \

--load_best_model_at_end False \

--save_total_limit 3 \

--seed 1337 \

--use_habana \

--use_lazy_mode \

--use_hpu_graphs \

--gaudi_config_name Habana/swin \

--throughput_warmup_steps 2 \

--overwrite_output_dir \

--ignore_mismatched_sizes \

--dataloader_num_workers 4 \

--non_blocking_data_copy True \

--profiling_warmup_steps 10 \

--profiling_steps 3

These are the results at the end of the run:

***** train metrics *****

epoch = 1.0

max_memory_allocated (GB) = 92.25

memory_allocated (GB) = 90.84

total_memory_available (GB) = 93.74

train_loss = 0.2853

train_runtime = 0:02:43.38

train_samples_per_second = 330.061

train_steps_per_second = 5.165

To see if the workload is better optimized, run the Profiler tool one final time:

habana_perf_tool --trace ./swin_profile/2nd_optim_non_blocking/2ndOPT.pt.trace.json
2023-07-19 22:04:37,679 - pytorch_profiler - DEBUG - Loading ./swin_profile/2nd_optim_non_blocking/2ndOPT.pt.trace.json

Import Data (KB): 100%|█████████████| 177495/177495 [00:01<00:00, 102617.38it/s]

2023-07-19 22:04:40,426 - pytorch_profiler - DEBUG - Please wait for initialization to finish ...

2023-07-19 22:04:47,805 - pytorch_profiler - DEBUG - PT Track ids: BridgeTrackIds.Result(pt_bridge_launch='56,9,51', pt_bridge_compute='15', pt_mem_copy='9,58,13,57', pt_mem_log='', pt_build_graph='8,50,53,54')

2023-07-19 22:04:47,806 - pytorch_profiler - DEBUG - Track ids: TrackIds.Result(forward='7', backward='49', synapse_launch='0,52,55', synapse_wait='1,12', device_mme='45,47,48,46', device_tpc='29,31,26,32,41,25,27,21,36,28,30,24,35,43,39,22,44,38,34,42,33,37,40,23', device_dma='10,19,17,20,16,18')

2023-07-19 22:04:49,814 - pytorch_profiler - DEBUG - Device ratio: 91.74 % (280.442 ms, 305.698 ms)

2023-07-19 22:04:49,814 - pytorch_profiler - DEBUG - Device/Host ratio: 91.74% / 8.26%

2023-07-19 22:04:50,555 - pytorch_profiler - DEBUG - Host Summary Graph Build: 33.66 % (67.771976 ms, 201.314 ms)

2023-07-19 22:04:50,699 - pytorch_profiler - DEBUG - Host Summary DataLoader: 1.69 % (3.412 ms, 201.314 ms)

2023-07-19 22:04:50,915 - pytorch_profiler - DEBUG - Host Summary Input Time: 1.33 % (2.687 ms, 201.314 ms)

2023-07-19 22:04:51,093 - pytorch_profiler - DEBUG - Host Summary Compile Time: 2.31 % (4.652 ms, 201.314 ms)

2023-07-19 22:04:51,552 - pytorch_profiler - DEBUG - Device Summary MME Lower Precision Ratio: 77.08%

2023-07-19 22:04:51,552 - pytorch_profiler - DEBUG - Device Host Overlapping degree: 87.45 %

2023-07-19 22:04:51,552 - pytorch_profiler - DEBUG - Host Recommendations:

2023-07-19 22:04:51,552 - pytorch_profiler - DEBUG - Compile times per step : [2]. Compile ratio: 2.31% (total time: 4.65 ms)

2023-07-19 22:04:51,867 - pytorch_profiler - DEBUG - [Device Summary] MME total time 88.22 ms

2023-07-19 22:04:58,786 - pytorch_profiler - DEBUG - [Device Summary] MME/TPC overlap time 57.90 ms

2023-07-19 22:04:58,788 - pytorch_profiler - DEBUG - [Device Summary] TPC total time 165.44 ms

2023-07-19 22:05:00,819 - pytorch_profiler - DEBUG - [Device Summary] DMA total time 33.71 ms

Summary of Optimizations

First Run

Device use is at 61.6%; the host is heavy with DataLoader costs at 55.9%.

Recommendations: Tune num_workers or use the Intel Gaudi processor DataLoader.

Second Run (Tune num_workers)

Device use is up to 90.8%, but data copy costs 11.5% of host step time.

Recommendations: Try to set nonblocking in torch.Tensor.to and pin_memory in DataLoader.

Third Run (Use Asynchronous Copy)

Device use is up to 91.7%; the model is now highly optimized.

TensorBoard* Viewer

Finally, launch the TensorBoard Viewer for the last training run. The viewer shows three main sections: Intel Gaudi accelerator overview, Intel Gaudi accelerator kernel view, and memory profiling.

Intel Gaudi Accelerator Overview

The TensorBoard Viewer initial view includes a comprehensive summary of the Intel Gaudi accelerator with device runtime information and host CPU information. To guide performance optimization, the overview shows use information for both the host and device and debugging guidance at the bottom of the section.

Intel Gaudi Accelerator Kernel View

The Intel Gaudi accelerator kernel view provides specific details into the Intel Gaudi accelerator kernel, such as use in the TPC and MME.

Memory Profiling

To monitor Intel Gaudi accelerator memory during training, set the profile_memory argument to True in the torch.profiler.profile function.

For more information on instrumentation, see the Intel Gaudi Profiler User Guide.

load_ext tensorboard
tensorboard --logdir=./swin_profile/2nd_optim_non_blocking/ --port 6006

from IPython.display import Image
img_path = 'tensorboard.jpg'
display(Image(img_path))

screenshot of the tensorboard viewer dashboard