Accelerate NPU-Bound AI PC Applications

Software Technical Consulting Engineer

Vishnu Naikawadi

Development Tools Software Architect

Robert Mueller-Albrecht

Developer Products Marketing Engineer

Intel Corporation

Intel Core Ultra processors consist of three important compute engines to deliver industry-leading end-user experience. Each of these three compute engines is central to accelerating AI workloads. The Neural Processing Unit (NPU) is an AI accelerator integrated into Intel Core Ultra processors, characterized by a unique architecture comprising compute acceleration and data transfer capabilities. Its compute acceleration is facilitated by Neural Compute Engines, which consist of hardware acceleration blocks for AI operation.

NPUs are great for sustained low-power AI-assisted workloads running over an extended time interval. For example, if you perform webcam tasks like background blur and image segmentation, execution on an NPU will provide better overall performance.

To maximize the performance of your AI application you need to ensure maximum utilization of the NPU compute and memory resources. In this brief tutorial, we will show you how this is done.

Using the Intel VTune Profiler, you can identify bottlenecks in the code running on NPU:

Understand how much data is transferred between NPU and DDR memory
Identify the most time-consuming tasks running on NPU
Visualize NPU utilization over time
Understand the NPU Bandwidth Utilization

This tutorial walks you through the steps of configuring VTune Profiler for NPU analysis and interpreting the results.

Analysis Flow

1. Create a Python Environment
2. Configure VTune for NPU Analysis
3. Interpret the Results
               -Summary View (NPU Exploration)
               -Bottom-up View
               -Timeline View
4. Optimize the Workload Based on the Results

Example Code

For this specific tutorial, we are using the Post-Training Quantization of MobileNet* v2 OpenVINO™ Model, which can be found on GitHub*. Please check the README for setup and configuration details.
It is an example of using the Post-Training Quantization API from Neural Network Compression Framework (NNCF) to quantize OpenVINO models on the example of MobileNet v2 quantization, pre-trained on an Imagenette* dataset.

1. Create a Python Environment

Create a Python Environment according to your workload.

cd nncf/examples/post_training_quantization/openvino/mobilenet_v2/
pip install -r requirements.txt

2. Configure VTune Profiler for NPU Analysis

For NPU Analysis, you need to choose the NPU Exploration from the list of Analysis Options.

There are two data collection modes, with different purposes

Time-based: collects system-wide metrics
- Use for larger workloads
- Less overhead
Query-based: collects metrics for each instance of a Level Zero instance
- Use for smaller workloads
- Higher overhead

Currently, the only Metric Set Supported is NOC (Network-on-Chip) fabric. The NPU goes through this fabric to read/write data from/to DDR memory. The sampling Interval can also be set based on your workload and required profiling granularity. Other detailed options can be set from the “How pane” for more fine-grained control.

You can also run this analysis from the command line. Here is a sample command line format for the NPU exploration analysis.

Common format:

vtune -collect npu [-knob <knob_name=knob_option>] -- <target> [target_options]

In this example, we will run a mobilenet-v2 model on NPU and profile the application with VTune’s NPU exploration analysis.

Original Sample Code

mobilnet_unoptimized.py:

import re
import subprocess
from pathlib import Path
from typing import List

import numpy as np
import openvino as ov
import torch
from fastdownload import FastDownload
from rich.progress import track
from sklearn.metrics import accuracy_score
from torchvision import datasets
from torchvision import transforms

import nncf

ROOT = Path(__file__).parent.resolve()
DATASET_PATH = Path().home() / ".cache" / "nncf" / "datasets"
MODEL_PATH = Path().home() / ".cache" / "nncf" / "models"
MODEL_URL = "https://huggingface.co/alexsu52/mobilenet_v2_imagenette/resolve/main/openvino_model.tgz"
DATASET_URL = "https://s3.amazonaws.com/fast-ai-imageclas/imagenette2-320.tgz"
DATASET_CLASSES = 10


def download(url: str, path: Path) -> Path:
    downloader = FastDownload(base=path.resolve(), archive="downloaded", data="extracted")
    return downloader.get(url)


def validate(model: ov.Model, val_loader: torch.utils.data.DataLoader) -> float:
    predictions = []
    references = []

    compiled_model = ov.compile_model(model, device_name="NPU")
    output = compiled_model.outputs[0]

    for images, target in track(val_loader, description="Validating"):
        pred = compiled_model(images)[output]
        predictions.append(np.argmax(pred, axis=1))
        references.append(target)

    predictions = np.concatenate(predictions, axis=0)
    references = np.concatenate(references, axis=0)
    return accuracy_score(predictions, references)


def run_benchmark(model_path: Path, shape: List[int]) -> float:
    cmd = ["benchmark_app", "-m", model_path.as_posix(), "-d", "GPU", "-api", "async", "-niter", "50000", "-shape", str(shape)]
    cmd_output = subprocess.check_output(cmd, text=True)  # nosec
    print(*cmd_output.splitlines()[-8:], sep="\n")
    match = re.search(r"Throughput\: (.+?) FPS", cmd_output)
    return float(match.group(1))


def get_model_size(ir_path: Path, m_type: str = "Mb") -> float:
    xml_size = ir_path.stat().st_size
    bin_size = ir_path.with_suffix(".bin").stat().st_size
    for t in ["bytes", "Kb", "Mb"]:
        if m_type == t:
            break
        xml_size /= 1024
        bin_size /= 1024
    model_size = xml_size + bin_size
    print(f"Model graph (xml):   {xml_size:.3f} Mb")
    print(f"Model weights (bin): {bin_size:.3f} Mb")
    print(f"Model size:          {model_size:.3f} Mb")
    return model_size


###############################################################################
# Create an OpenVINO model and dataset

dataset_path = download(DATASET_URL, DATASET_PATH)

normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
val_dataset = datasets.ImageFolder(
    root=dataset_path / "val",
    transform=transforms.Compose(
        [
            transforms.Resize(256),
            transforms.CenterCrop(224),
            transforms.ToTensor(),
            normalize,
        ]
    ),
)
val_data_loader = torch.utils.data.DataLoader(val_dataset, batch_size=1, shuffle=False)

path_to_model = download(MODEL_URL, MODEL_PATH)
ov_model = ov.Core().read_model(path_to_model / "mobilenet_v2_fp32.xml")

###############################################################################
# Quantize an OpenVINO model
#
# The transformation function transforms a data item into model input data.
#
# To validate the transform function use the following code:
# >> for data_item in val_loader:
# >>    model(transform_fn(data_item))


def transform_fn(data_item):
    images, _ = data_item
    return images

calibration_dataset = nncf.Dataset(val_data_loader, transform_fn)
ov_quantized_model = nncf.quantize(ov_model, calibration_dataset)

###############################################################################
# Benchmark performance, calculate compression rate and validate accuracy

fp32_ir_path = ROOT / "mobilenet_v2_fp32.xml"
ov.save_model(ov_model, fp32_ir_path, compress_to_fp16=False)
print(f"[1/7] Save FP32 model: {fp32_ir_path}")
fp32_model_size = get_model_size(fp32_ir_path)

print("[3/7] Benchmark FP32 model:")
fp32_fps = run_benchmark(fp32_ir_path, shape=[1, 3, 224, 224])

Sample Command Line:

vtune -c npu --  python mobilnet_unoptimized.py

3. Interpret the Results

You can see the NPU exploration results by launching the VTune Profiler GUI and opening the .vtune file.

Top Task Overview

Figure 1: Summary View of NPU Exploration Analysis

The NPU Device Load section shows the data transferred from DDR SDRAM to NPU through NOC. As can be seen from Figure 1, the amount of data transferred is 679.8 GB.

Also, a “NPU Top Compute Tasks” section lists the top computing tasks executing on the NPU side. In this example, zeAppendGraphExecute appends a graph execution command to the command list. These API calls take ~35s to execute any number of tasks, and the count is 50,001. In this context, a graph represents a sequence of operations that can be executed on the NPU.

Figure 2: List of Top Tasks on the Host Side

The ’Top Tasks’ section lists the most active tasks taking place on the host side execution of the application. This section helps identify which functions or code regions consume the most CPU time. The most time-consuming host tasks are zeFenceHostSynchronize API calls that take ~54s. These calls ensure the host waits until the fence is signaled or a specified timeout period expires. This list shows the total time taken by each task, the total task count, and the average time for each task.

A Deeper Look

Figure 3: Bottom-Up View of NPU and Host Tasks

The next step is to investigate each task more deeply. For that, you need to go to the Bottom-Up view. In this view, you can see the execution timeline and a bottom-up view of the functions and tasks based on specified Grouping.

The recommended grouping is Task Type/Function/Call Stack for understating the NPU and Host tasks.

Figure 4 (timeline view) shows the list of threads running on the CPU and NPU. You can also see the corresponding NPU Utilization and NPU Bandwidth throughout the timeline. Here, we select the region where the NPU was active and can see the duration for inference is ~54s.

Figure 4: Timeline view of the Application Execution

The rest of the execution time (first 35s) is spent on other CPU tasks (Statistical Collection, Fast Bias Correction). You can see more details on CPU activity by expanding the ‘Thread’ section.

As Figure 5 shows, CPU tasks like Memory Allocation and Queue Creation occur right before the inference starts.

Figure 5: Expanded view of the Thread section of Bottom-up viewpoint

If you see a dip in the NPU Utilization or NPU Bandwidth you can zoom into that time slice and see the corresponding CPU and NPU tasks to understand the API or the tasks causing the performance loss.

Whenever there is a dip in NPU utilization or NPU Bandwidth, a host-side task occurs (marked in purple). To better understand, you can expand the corresponding thread and hover over the task to see the details.

Figure 6: NPU DDR Bandwidth

From the timeline view, we can also see the NPU DDR Bandwidth utilization throughout the execution time. The Figure 6 shows that the peak NPU DDR Bandwidth is ~14 GB/s.

Typical NPU Utilization Bottlenecks

Bottlenecks Observed Using Intel VTune Profiler:

High DDR to NPU data transfer adding to the data transfer overhead
High Pressure on NPU Memory
Significant time spent on host-side synchronization
Dip in the NPU Utilization and NPU Bandwidth Utilization

Recommendation

When possible, use lower-precision data to reduce the amount of data transfer and pressure on the memory subsystem. Lower-precision data reduces latency in host-device communication, alleviating the synchronization overhead.

4. Optimize the Workload Based on the Results

In this step, we apply an optimization technique based on the bottlenecks we observed using Intel VTune Profiler. We will quantize the model using the Neural Network Compression Framework (NNCF) Post-Training Quantization algorithm. The goal of quantization is to reduce the pressure on the memory subsystem per the recommendation above.

Changes Made for Code Optimization:

Quantized the FP32 model using the nncf.quantize function with a calibration dataset.
Saved the quantized model after quantization.
Benchmarked the quantized model for NPU device.

import re
import subprocess
from pathlib import Path
from typing import List

import numpy as np
import openvino as ov
import torch
from fastdownload import FastDownload
from rich.progress import track
from sklearn.metrics import accuracy_score
from torchvision import datasets
from torchvision import transforms

import nncf

ROOT = Path(__file__).parent.resolve()
DATASET_PATH = Path().home() / ".cache" / "nncf" / "datasets"
MODEL_PATH = Path().home() / ".cache" / "nncf" / "models"
MODEL_URL = "https://huggingface.co/alexsu52/mobilenet_v2_imagenette/resolve/main/openvino_model.tgz"
DATASET_URL = "https://s3.amazonaws.com/fast-ai-imageclas/imagenette2-320.tgz"
DATASET_CLASSES = 10


def download(url: str, path: Path) -> Path:
    downloader = FastDownload(base=path.resolve(), archive="downloaded", data="extracted")
    return downloader.get(url)


def validate(model: ov.Model, val_loader: torch.utils.data.DataLoader) -> float:
    predictions = []
    references = []

    compiled_model = ov.compile_model(model, device_name="NPU")
    output = compiled_model.outputs[0]

    for images, target in track(val_loader, description="Validating"):
        pred = compiled_model(images)[output]
        predictions.append(np.argmax(pred, axis=1))
        references.append(target)

    predictions = np.concatenate(predictions, axis=0)
    references = np.concatenate(references, axis=0)
    return accuracy_score(predictions, references)


def run_benchmark(model_path: Path, shape: List[int]) -> float:
    cmd = ["benchmark_app", "-m", model_path.as_posix(), "-d", "GPU", "-api", "async", "-niter", "50000", "-shape", str(shape)]
    cmd_output = subprocess.check_output(cmd, text=True)  # nosec
    print(*cmd_output.splitlines()[-8:], sep="\n")
    match = re.search(r"Throughput\: (.+?) FPS", cmd_output)
    return float(match.group(1))


def get_model_size(ir_path: Path, m_type: str = "Mb") -> float:
    xml_size = ir_path.stat().st_size
    bin_size = ir_path.with_suffix(".bin").stat().st_size
    for t in ["bytes", "Kb", "Mb"]:
        if m_type == t:
            break
        xml_size /= 1024
        bin_size /= 1024
    model_size = xml_size + bin_size
    print(f"Model graph (xml):   {xml_size:.3f} Mb")
    print(f"Model weights (bin): {bin_size:.3f} Mb")
    print(f"Model size:          {model_size:.3f} Mb")
    return model_size


###############################################################################
# Create an OpenVINO model and dataset

dataset_path = download(DATASET_URL, DATASET_PATH)

normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
val_dataset = datasets.ImageFolder(
    root=dataset_path / "val",
    transform=transforms.Compose(
        [
            transforms.Resize(256),
            transforms.CenterCrop(224),
            transforms.ToTensor(),
            normalize,
        ]
    ),
)
val_data_loader = torch.utils.data.DataLoader(val_dataset, batch_size=1, shuffle=False)

path_to_model = download(MODEL_URL, MODEL_PATH)
ov_model = ov.Core().read_model(path_to_model / "mobilenet_v2_fp32.xml")

def transform_fn(data_item):
    images, _ = data_item
    return images


calibration_dataset = nncf.Dataset(val_data_loader, transform_fn)
ov_quantized_model = nncf.quantize(ov_model, calibration_dataset)

###############################################################################
# Benchmark performance, calculate compression rate and validate accuracy

int8_ir_path = ROOT / "mobilenet_v2_int8.xml"
ov.save_model(ov_quantized_model, int8_ir_path)
print(f"[2/7] Save INT8 model: {int8_ir_path}")
int8_model_size = get_model_size(int8_ir_path)

print("[4/7] Benchmark INT8 model:")
int8_fps = run_benchmark(int8_ir_path, shape=[1, 3, 224, 224])

After applying the optimization, we will inspect the following metrics using VTune Profiler:

Understanding the change in the DDR to NPU Data transferred.
Observing the inference time from the VTune’s timeline view.
Checking the peak NPU memory bandwidth.

Observed Execution Speedup

Figure 7: NPU Device Load and NPU Top Compute Tasks for Quantized Model

As can be seen from Figure 7, the NPU Device Load has dropped from ~680 GB to ~146 GB, which is almost a 78.5% improvement in the NPU DDR Data Transferred. Improving the Data Transfer Overhead should accelerate the inference time on NPU.

Also, the NPU Top Compute Tasks section shows that the top time-consuming task, zeAppendGraphExecute, now takes only ~22s, which is a 37.14% improvement.

Figure 8: Top Tasks on the Host side after quantization

We can also see significant differences in the Top tasks on the host side. For example, the zeFenceHostSynchronize and zeCommandQueueExecuteCOmmandLists now take much less time compared to the unquantized implementation of the model.

Figure 9: Inference Time of the quantized model (Timeline View)

As seen in Figure 9, the duration of the inference is now ~38s, which has sped up significantly (30%) from the previous implementation. You should be able to see the inference time by selecting the timeline while the NPU was active.

Figure 10: NPU Bandwidth of the quantized Model

As can be seen from Figure 10, the peak NPU DDR Bandwidth has dropped from ~14 GB/s to ~4 GB/s. That will also reduce the NPU Memory activity and DDR to NPU data transfer, which significantly improves the overall inference performance.

Test it Yourself

To walk through these steps yourself and apply them to your own application, you can simply do the following:

Set up your own environment as described at the beginning of this article.
Run any examples provided here, or use your own development project.
Profile the application using Intel VTune Profiler to detect and then fix any bottlenecks.

Get the Software

You can install Intel VTune Profiler as a part of the Intel® oneAPI Base Toolkit or download its stand-alone version for free!

Streamline your software setup with our toolkit selector to install full kits or new, right-sized sub-bundles containing only the needed components for specific use cases! Save time, reduce hassle, and get the perfect tools for your project with Intel® C++ Essentials, Intel® Fortan Essentials, and Intel® Deep Learning Essentials.

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

Accelerate NPU-Bound AI PC Applications using Intel® VTune™ Profiler on Intel® Core™ Ultra Processor

Get the Latest on All Things CODE