Optimize PyTorch* & TensorFlow* Models: Two On-demand, Hands-on Trainings

Get the Latest on All Things CODE

author-image

By

Deep learning frameworks offer researchers and developers the building blocks for designing and training deep neural networks quickly. This makes it easier for practitioners to apply deep learning to real-world problems.

PyTorch* and TensorFlow* are among the most popular open source deep learning frameworks. Although they have differences in how they run code, both are optimized tensor libraries used for deep learning applications on CPUs and GPUs.

This blog focuses on two recent trainings delivered at the oneAPI DevSummit for AI and HPC. It shares each on-demand session to help you acquire new (or reinforce existing) development skills:

  • Accelerate PyTorch Deep Learning Models on Intel® XPUs
  • Take Advantage of Default Optimizations in TensorFlow* from Intel

Let’s get started.

Workshop 1: Accelerate PyTorch* Deep Learning Models on Intel® XPUs

Session Overview

PyTorch is an open source deep learning framework based on Python*. It is built for both research and production deployment. In this session, Pramod Pai, Intel AI software and solutions engineer, begins with an overview about Intel® Optimization for PyTorch* as well as the newest optimizations and usability features that are first released in Intel® Extension for PyTorch* before they are incorporated into open source PyTorch.

a diagram gives an overview of what makes up the intel optimization for pytorch

Next, he explains major optimization methodologies:

  • Operators Optimization: Involves vectorization and parallelization to maximize the efficiency of CPU capability and use.
  • Graph Optimization: Constant folding and operator fusion are the two main components.
  • Runtime Extension: Overheads can be avoided by using thread affinity and tweaking memory allocation methodologies.

He then gives an overview about the structure of the Intel extension, including the following examples on how to use it for FP32 and bfloat16. Check out Intel Extension for PyTorch (GitHub*) and feel free to contribute to this project.

side by side comparisons of fp32 and bfloat16 code

Finally, Pramod shows the performance improvements using Intel Extension for PyTorch compared to the stock PyTorch. For more information, see Configurations.

a bar chart shows the fp32 offline inference throughput

a bar chart shows the fp32 real time inference throughput

Hands-On Lab

To access the code sample used in this session, go to the GitHub repository for the workshop. It includes:

  • Getting started with Intel Extension for PyTorch for sample computer vision and natural language processing (NLP) workloads
  • Loading two models from the PyTorch hub: Faster R-CNN and DistilBERT
  • Applying sequential optimizations from the Intel extension and examining performance gains for each incremental change
  • The steps for trying out the code sample in a Linux* environment and on Intel® Developer Cloud

Watch [25:19]

Watch the full session recording and download the Workshop 1 Presentation (PDF).

Workshop 2: Take Advantage of Default Optimizations in TensorFlow from Intel

Session Overview

TensorFlow is a free, open source library and end-to-end machine learning platform that makes machine learning and developing neural networks faster and easier. In this session, Sachin Muradi, Intel deep learning software engineer, begins with a synopsis of TensorFlow, followed by an introduction to Intel’s optimizations in the framework contributed by Intel® oneAPI Deep Neural Network Library (oneDNN). oneDNN is an open source cross platform library used for deep learning applications. It also supports FP32, FP16, bfloat16, and int8 datatypes. (Starting with TensorFlow 2.9, oneDNN optimizations are automatic. For TensorFlow v2.5 through v2.8, the optimizations can be enabled by setting the environment variable TF_ENABLE_ONEDNN_OPTS=1.)

When combined with oneDNN, TensorFlow provides a CPU performance boost as shown in the following graphs. For more information, see Configurations.

two bar charts show the speed ups on Amazon Web Services

Next, Sachin explains the bfloat16 data type, a floating-point format that occupies 16 bits of computer memory but represents the approximate dynamic range of 32-bit floating-point (FP) numbers. Bfloat16 format is as follows:

  • 1 bit - sign
  • 8 bits - exponent
  • 7 bits - fraction

Compared to FP32, bfloat16 delivers better performance and lower accuracy loss. Its performance enhancement is supported on 3rd gen Intel® Xeon® Scalable processors and with Intel AMX instructions on 4th gen Intel Xeon Scalable processors.

Hands-On Lab

Go to the GitHub repository for the workshop to access the code sample used in this session, which demonstrates three important concepts:

  • Benefits of using automatically mixed precision to accelerate tasks like transfer learning, with minimal changes to existing scripts
  • Importance of inference optimization on performance
  • Ease of using Intel® Optimization for TensorFlow* (which are enabled by default in 2.9.0 and newer)

The sample shows how to fine-tune a pretrained model for image classification using the TensorFlow Flowers dataset. Here, a batch of 512 images measuring 224 x 224 x 3 are used. The following steps are performed:

  • Transfer learning for image classification using the TensorFlow Hub's ResNet-50 v1.5 pretrained model
  • Export the fine-tuned model in the SavedModel format
  • Optimize the SavedModel for faster inference
  • Serve the SavedModel using TensorFlow Serving

Try out the code sample in a Linux environment and on the Intel Developer Cloud.

Sachin concludes the workshop by highlighting improved AI performance with new Intel AMX instructions on 4th gen Intel Xeon Scalable processors.

Watch [30:50]

Watch the full session recording and download the Workshop 2 Presentation (PDF).

What’s Next?

We encourage you to learn more about and incorporate Intel’s other AI and machine learning framework optimizations and end-to-end portfolio of tools into your AI workflow. Also, visit the AI and machine learning page that covers Intel’s AI software development resources for preparing, building, deploying, and scaling your AI solutions.

For more details about the new 4th gen Intel Xeon Scalable processors, visit AI Platform where you can learn how Intel is empowering developers to run end-to-end AI pipelines on these powerful CPUs.

About the Speakers

Pramod Pai, Intel software solutions engineer

Pramod helps customers optimize their machine learning workflows using solutions from Intel, such as Intel® AI Analytics Toolkit and Intel Extension for PyTorch. He holds a master's degree in information systems from Northeastern University in Massachusetts.

Sachin Muradi, Intel deep learning software engineer

Sachin is part of the Intel team focused on direct optimization for TensorFlow and oneDNN direct optimization, where he uses his expertise in performance libraries and compilers for deep learning accelerator hardware. He holds a master’s degree in electrical and computer engineering from Portland State University in Oregon.

Useful Resources

Configurations

FP32 Offline Inference Throughput

Testing Date: Performance results are based on testing by Intel as of January 10, 2022. Configuration Details and Workload Setup: Hardware Configuration: Intel® Xeon® Platinum 8380 CPU at 2.30 GHz, two sockets with 40 cores per socket, 256 GB RAM (16 slots/16 GB/3200 MHz), Hyperthreading: on; Operating System: Ubuntu* v18.04.5 LTS; Software Configuration: PyTorch v1.10.1, Intel Extension for PyTorch v1.10.100; Offline inference refers to running single instance inference with large batch using all cores of a socket.

FP32 Realtime Inference Throughput

Testing Date: Performance results are based on testing by Intel as of January 10, 2022. Configuration Details and Workload Setup: Hardware Configuration: Intel Xeon Platinum 8380 CPU at 2.30 GHz, two sockets with 40 cores per socket, 256 GB RAM (16 slots/16 GB/3200 MHz), Hyperthreading: on; Operating System: Ubuntu v18.04.5 LTS; Software Configuration: PyTorch v1.10.1, Intel Extension for PyTorch v1.10.100; Realtime inference refers to running a multi-instance single batch inference with four cores per instance.

Speedup from TensorFlow 2.8 to 2.9 on Amazon Web Services (AWS)* c6i-12xlarge

Testing Date: Performance results are based on testing by Intel as of May 19, 2021. Configuration Details and Workload Setup: Hardware Configuration: Intel Xeon Platinum 8375C CPU at 2.90 GHz, one socket, 96 GB RAM; Software Configuration: Operating System: Ubuntu v20.04.2; Kernel: 5.11.0-1019-aws x86_64; Model Zoo v2.6.

Latency Speedup from an Intel Optimization on AWS c6i-2Xlarge

Testing Date: Performance results are based on testing by Intel as of May 19, 2021. Configuration Details and Workload Setup: Hardware Configuration: Intel Xeon Platinum 8375C CPU at 2.90 GHz, one socket, 16 GB RAM; Software Configuration: Operating System: Ubuntu v20.04.2, Kernel: 5.11.0-1019-aws x86_64, Model Zoo v2.6.

Intel® oneAPI Base Toolkit

Develop high-performance, data-centric applications for CPUs, GPUs, and FPGAs with this core set of tools, libraries, and frameworks including LLVM-based compilers.

Get It Now

See All Tools