Speed Up Deep Learning Framework Performance on Intel® Processors

8/5/2024

Ramya Ravi, AI software marketing engineer, Intel, LinkedIn*

Urszula Zofia Gumińska, development tools software engineer, Intel, LinkedIn

Chandan Damannagari, director, AI software, Intel, LinkedIn

Overview

Deep learning is a complex structure of machine learning algorithms that enables the processing of unstructured data like images, text, and videos. Deep learning frameworks make it easier for data scientists and developers to collect, analyze, and interpret large amounts of data.

To optimize deep learning framework performance and build faster applications on various hardware architectures, Intel offers Intel® oneAPI Deep Neural Network Library (oneDNN).

Benefits

oneDNN is a performance library that provides highly optimized implementations of building blocks for deep learning applications and frameworks. It is an open source, cross-platform library that helps developers and data scientists use the same API for CPUs, GPUs, or both. The advantages are:

Improve the performance of frameworks that you already use, such as PyTorch*, TensorFlow*, AI Tools from Intel, and OpenVINO™ toolkit.
Build faster deep learning applications and frameworks using optimized building blocks.
Implement AI applications optimized across hardware architectures (including Intel CPUs and GPUs) without writing any target-specific code.

Features

The following image illustrates the primitive attributes and descriptors in the oneDNN programming model.

Image Source

An Abstract Programming Model

The key concepts of the oneDNN programming model are primitives, engines, streams, and memory objects.

Primitives: Any low-level operations from which more complex operations are constructed, such as convolution, data format reorder, and memory.
Engines: An abstraction of a computational device, such as a CPU or GPU.
Streams: A queue of primitive operations on an engine.
Memory objects: Handles to memory allocated on a specific engine, tensor dimensions, data type, and memory format.

Automatic Optimization

oneDNN enables the support of using existing deep learning frameworks. You can develop platform-independent deep learning applications and deploy instruction set architecture (ISA) with automatic detection.

AI Frameworks

Network Optimization

This library enables you to identify performance bottlenecks using Intel® VTune™ Profiler. Additionally, it allows you to use automatic memory format selection and propagation based on hardware and convolutional parameters.

Intel VTune Profiler

Optimized Implementations of Key Building Blocks

oneDNN supports primitives such as convolution, matrix multiplication, pooling, batch normalization, activation functions, recurrent neural network (RNN) cells, and long short-term memory (LSTM) cells.

Get Started

Installation

Binary distribution of oneDNN software can be installed in the following ways:

As a part of the Intel® oneAPI Base Toolkit
From Anaconda*
As a stand-alone version

If the configuration you need is unavailable, you can build a oneDNN library from the source. This library is optimized for use on Intel® architecture processors and Intel® Processor Graphics and to boost the performance of deep learning frameworks such as PyTorch and TensorFlow. Check the system requirements page and build options for more details about CPU and GPU runtimes.

Build from Source

System Requirements

Build Options

Code Example

In this C++ code example, we demonstrate the basics of oneDNN programming model:

Creating oneDNN memory objects and oneDNN primitives.
Running the primitives.

C++ Code Example

The first step is to create a getting_started_tutorial() function with all the steps needed to create a oneDNN programming model. In turn, this function is called from the main() function. The steps implemented in the code sample are:

Include public headers.

To use the oneDNN library, we must first include the dnnl.hpp header file in the program. We are also using dnnl_debug.h for debugging facilities.
```
"oneapi/dnnl/dnnl.hpp"
"oneapi/dnnl/dnnl_debug.h"
```
Create an engine and stream to run a primitive.

oneDNN primitives and memory objects are attached to a particular dnnl::engine and require a dnnl::stream for running. An engine requires dnnl::engine::kind and the index of the device of the given kind. A stream just needs an engine object, like the following:
```
eng(engine_kind, 0);
engine_stream(eng);
```

Prepare data.

Create a 4D tensor in NHWC format. Note that even though we work with one image only, the image tensor is still 4D. The extra dimension (here, N) corresponds to the batch and, in the case of a single image, is equal to 1. The prepared 4D tensor needs to be wrapped into a oneDNN memory object.


int N = 1, H = 13, W = 13, C = 3;

//Compute physical strides for each dimension
int stride_N = H * W * C;
int stride_H = W * C;
int stride_W = C;
int stride_C = 1;

// An auxiliary function that maps logical index to the physical offset
auto offset = [=](int n, int h, int w, int c) {
return n * stride_N + h * stride_H + w * stride_W + c * stride_C;
};
// The image size
const int image_size = N * H * W * C;

// Allocate a buffer for the image
std::vector<float> image(image_size);

// Initialize the image with some values
for (int n = 0; n < N; ++n)
for (int h = 0; h < H; ++h)
for (int w = 0; w < W; ++w)
for (int c = 0; c < C; ++c) {
int off = offset(
n, h, w, c); // Get the physical offset of a pixel
image [off] = -std::cos(off / 10.f);
}

Wrap data into a oneDNN memory object.

Wrap the prepared image in a dnnl::memory object, which allows us to pass it to oneDNN primitives. This can be performed in two steps:

a. Initialize the dnnl::memory::desc struct:


auto src_md = memory::desc(
{N, C, H, W}, // logical dims, the order is defined by a primitive
memory::data_type::f32, // tensor's data type
memory::/format_tab::nhwv // memory format, NHWC in this case
);

b. Create the dnnl::memory object itself:


// src_mem contains a copy of image after write_to_dnnl_memory function
auto src_mem = memory(src_md, eng);

// For dst_mem the library allocates buffer
auto dst_mem = memory(src_md, eng);

Create an ReLU primitive. This requires two steps:

a. Create an operation primitive descriptor that defines operation parameters and is a lightweight descriptor of the actual algorithm that implements the given operation:
```
auto relu_pd = eltwise_forward::primitive_desc(
eng, // an engine the primitive will be created for
prop_king::forward_inference, algorithm::eltwise_relu,
src_md, // source memory descriptor
src_md, // destination memory descriptor
0.f, // alpha parameter means negative slope in case of ReLU
0.f // beta parameter is ignored in case of ReLU
);
```
b. Create a primitive that can be run on memory objects to compute the operation:
```
auto relu = eltwise_forward(relu_pd);
```
Note Remember that primitive creation is an expensive operation, so consider creating it once and running it multiple times.
Run the ReLU primitive and wait for its completion.

Input and output memory objects are passed to the execute() method using a <tag, memory> map. A primitive runs in a stream. Depending on the stream kind, a run might be blocking or nonblocking. This means that we need to call dnnl::stream::wait before accessing the results.
```
// Execute ReLU (out-of-place)
relu.execute(engine_stream, // The execution stream
{
// A map with all inputs and outputs
{DNNL_ARG_SRC, src_mem}, // Source tag and memory obj
{DNNL_ARG_DST, dst_mem}, // Destination tag and memory obj
});
// Wait the stream to complete the execution
engine_stream.wait();
```
Obtain the result and validation.

The result is stored in the dst_mem memory object. It means that we need to receive it and cast it to float*. This is safe since we created dst_mem as an f32 tensor with a known memory format.
```
std::vector<float> relu_image(image_size);
read_from_dnnl_memory(relu_image.data(), dst_mem);
```

Call the prepared function in main().

Here, we can define additional error handling if needed.


int main(int argc, char **argv) {
engine_kind = parse_engine_kind(argc, argv);

getting_started_tutorial(engine_kind);

finalize();
return 0;
}

The getting_started.cpp code example highlights how to create and run oneDNN memory objects and primitives. Additionally, it demonstrates that these key concepts of oneDNN play a significant role in improving the deep learning performance on various hardware architectures.

Code Example

What's Next?

Adopt the oneDNN library to accelerate deep learning performance on various hardware architectures. Watch the get started video about oneDNN and learn how to develop high-performance, optimized deep learning applications on CPUs and GPUs.

Video

Learn about feature information and release downloads for the latest and previous releases of oneDNN on GitHub* and feel free to contribute to the project.

oneDNN Releases

Contribution Guidelines

We encourage you to also check out and incorporate Intel's other AI and machine learning framework optimizations and end-to-end portfolio of tools into your AI workflow. Learn about the unified, open, standards-based oneAPI programming model that forms the foundation of the Intel® AI Software Portfolio to help you prepare, build, deploy, and scale your AI solutions.

Get Started with AI Development

AI Framework Optimizations

AI Development Software

oneAPI Programming Model

Additional Resources

AI Frameworks
Overview of oneDNN
oneDNN Documentation
oneDNN Developer Guide and Reference
Optimized Machine Learning and Deep Learning with oneDNN
AI Concepts: Machine Learning | Inference
AI and Machine Learning Ecosystem: Developer Hub | Developer Resources
AI Tools Documentation

Featured Software

Download oneDNN as a part of the Intel® oneAPI Base Toolkit (Base Kit) or as a stand-alone version.

AI Code Samples

See All Code Samples

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

Accelerate Deep Learning Framework Performance on Intel® Processors

Overview

Benefits

Features

An Abstract Programming Model

Automatic Optimization

Network Optimization

Optimized Implementations of Key Building Blocks

Get Started

Installation

Code Example

What's Next?

Additional Resources

Featured Software

AI Code Samples

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

Accelerate Deep Learning Framework Performance on Intel® Processors

Overview

Benefits

Features

An Abstract Programming Model

Automatic Optimization

Network Optimization

Optimized Implementations of Key Building Blocks

Get Started

Installation

Code Example

What's Next?

Additional Resources

Featured Software

AI Code Samples

Product and Performance Information