# Profile DPDK with Intel® VTune™ Amplifier

Published: 11/14/2018

Last Updated: 11/12/2018

## Overview

It is a good idea to profile your Data Plane Development Kit (DPDK) application at different stages in development. In this tutorial we show how to use VTune™ Amplifier 2019 to run two Data Plane Development Kit Test Suite (DPDK Test Suite) microbenchmarks, distributor_perf_autotest and ring_perf_autotest, using DPDK v16.11.8 LTS. We then analyze data collected during each of the profiling runs.

The system used in this tutorial is running Ubuntu* 16.04.5 LTS on an Intel® Xeon® processor E5-2699 v4 with two 10-Gigabit network interface cards (NICs). Each NIC has two 10 Gigabit ports; the NICs used in this tutorial are the Intel® 82599 Gigabit Ethernet Controller and Intel® Ethernet Controller X540-AT2.

Note: You must have root access on your test system to follow the steps in this tutorial.

## Install Kernel Debug Symbols

To get started, install the Linux* debug symbols and download the correct Linux source files. This step is required to enable the profiling report to display function names and corresponding source code.

The size of the Linux debug symbols package is roughly 600 MB.

echo "deb http://ddebs.ubuntu.com $(lsb_release -cs) main restricted universe multiverse" | tee -a /etc/apt/sources.list.d/ddebs.list echo "deb http://ddebs.ubuntu.com$(lsb_release -cs)-updates main restricted universe multiverse" | tee -a /etc/apt/sources.list.d/ddebs.list
apt-get update
apt-get install linux-image-\$(uname -r)-dbgsym
uname -r
apt install linux-source-4.15.0
cd /usr/src/linux-source-4.15.0
tar xf linux-source-4.15.0.tar.bz2


## Install DPDK

wget http://fast.dpdk.org/rel/dpdk-16.11.8.tar.xz tar xf dpdk-16.11.8.tar.xz

### Build

Install build tools, an ancillary library header, set the correct environment variables, and then build the DPDK.

cd
apt install build-essential
apt install libnuma-dev
export RTE_SDK=/home/dpdk/dpdk-stable-16.11.8
export RTE_TARGET=x86_64-native-linuxapp-gcc
export EXTRA_CFLAGS='-g'
make install T=x86_64-native-linuxapp-gcc DESTDIR=install

## Configure DPDK

After the DPDK is built, you’ll configure your system to use hugepages, then bind it to a specified NIC.

### Configure Hugepages

The Getting Started Guide for Linux at DPDK.org describes why hugepage support is needed: “Hugepage support is required for the large memory pool allocation used for packet buffers …By using hugepage allocations, performance is increased since fewer pages are needed, and therefore fewer Translation Lookaside Buffers (TLBs, high-speed translation caches), which reduce the time it takes to translate a virtual page address to a physical page address. Without hugepages, high TLB miss rates would occur with the standard 4k page size, slowing performance."

The Use of Hugepages in the Linux Environment section of the Getting Started guide mentioned above will guide you through configuring hugepages for your system.

### Bind DPDK to a NIC

To bind the DPDK to a NIC on this system, first load the correct drivers into the environment, then use the dpdk-devbind tool to bind the NIC to the DPDK driver. To achieve this, run the following commands:

cd
cd dpdk-stable-16.11.8
modprobe uio
insmod ./x86_64-native-linuxapp-gcc/kmod/igb_uio.ko
./tools/dpdk-devbind.py --status

This outputs the status of network and crypto devices on the system and the current driver it is using.

Figure 1. dpdk-devbind status

Bind the Network Device enp61s0f1 to the DPDK using the following command:

./tools/dpdk-devbind.py --bind=igb_uio enp61s0f1

Figure 2. dpdk-devbind bind

Now that the network device is bound to a DPDK compatible driver, configure the DPDK to use hugepages. Do this by running the dpdk-setup.sh tool.

./tools/dpdk-setup.sh

Since this example uses a two-socket NUMA system, we will select option 20 and allocate 10,000 hugepages, which are 2 MB in size.

Note: Use a lower number of hugepages for memory-constrained systems.

Figure 3. dpdk-setup

## Profiling DPDK Test Suite Microbenchmarks

Now that the DPDK is configured, we will use VTune Amplifier Hotspots Analysis Hardware Event-Based Sampling to profile two DPDK microbenchmark tests, distributor_perf_autotest and ring_perf_autotest, using VTune Amplifier.

### Configure and start VTune™ Amplifier

If you haven’t installed VTune Amplifier, do it now. For more information and installation instructions, read Get Started with VTune™ Amplifier 2019.

Type the two commands below to load environment variables and start the VTune Amplifier graphical user interface (GUI).

/opt/intel/vtune_amplifier/amplxe-vars.sh
/opt/intel/vtune_amplifier/bin64/amplxe-gui

Figure 4. The VTune Amplifier GUI

Select New Project and give it an appropriate name.

### Start the DPDK test suite

Open a new terminal window and use the commands below to start the DPDK Test Suite test application:

cd
cd dpdk-stable-16.11.8
./x86_64-native-linuxapp-gcc/app/test

Figure 5. Start the DPDK Test Suite application test

RTE>>? lists all the available benchmarks.

After you have created your project, do the following steps:

• Select Configure Analysis
• Run a default Hotspots analysis with Hardware Event-Based Sampling selected.

Figure 6

• Next, Attach VTune Amplifier to the running test process by specifying its Process ID (PID).

To determine the correct PID for the test process run the following:
ps ax | grep test

Figure 7

Figure 8

• Finally, set the path for the source files and binaries for the project. Under Configure Analysis click Search Sources/Binaries.

Figure 9

Figure 10

Place the following paths in the appropriate text box:

Binaries/Symbols: /root/dpdk-stable-16.11.8/x86_64-native-linuxapp-gcc/app

Sources: /usr/src/linux-source-4.15.0/linux-source-4.15.0

### Profiling distributor_perf_autotest

First, we’ll profile the distributor_perf_autotest microbenchmark, which measures the interprocessor communication of moving a cache line from one processor to another.

#### Test and analyze

Start the Hotspots analysis.

Figure 11

Once the Hotspots analysis has started, return to the terminal running the ./x86_64-native-linuxapp-gcc/app/test process, and run the distributor_perf_autotest microbenchmark.

Figure 12. distributor_perf_autotest benchmark

After the distributor_perf_autotest microbenchmark has completed, click the stop button for VTune Amplifier to end profiling. From there, VTune Amplifier will analyze the collection and output a report, as shown below.

Figure 13. distributor_perf_autotest profile

Observe on the Summary page in Figure 14 that the function that runs the longest in the distributor_perf_autotest microbenchmark is _mm_pause. Also, notice that CPU 0 is essentially the only CPU being utilized when running the benchmark. This graph shows if the workload is Parallelized.

Note: Not all workloads can be parallelized.

Figure 14

Figure 15

When navigating to the Bottom-up tab it is easy to see that _mm_pause takes up the majority of the CPU time. Also, view how the _mm_pause function is being called in the bottom-up stack.

start_thread -> eal_thread_loop -> handle_work -> rte_distributor_get_pkt -> rte_pause -> _mm_pause

VTune Amplifier also shows the source file of the function. Double-click the function to view the location of your source files and binaries, as shown below.

Figure 16

After running the microbenchmark, the results show that the majority of the time is spent in spin-wait loops.

### Profiling ring_perf_autotest

The communication between cores for interprocessor communication, as well as communication between cores and the NIC, happens through rings and descriptors.

While NIC hardware does optimizations in terms of report status (RS) bit and descriptor done (DD) bit in bunching the data size, DPDK also enhances bunching with amortization by offering an API for bulk communication through rings.

The ring tests show that single producer, single consumer (SP/SC) with bulk sizes both in enqueue and dequeue give the best performance, compared to multiple producers, multiple consumers (MP/MC).

#### Test and analyze

Run another Hotspots analysis. Return to the terminal running the test process and run the ring_perf_autotest micro benchmark; stop collection when the benchmark has finished.

Figure 17

Observe from the Summary page that the function that runs the longest in the ring_perf_autotest microbenchmark is __rte_ring_mc_do_dequeue. However, the function that runs the longest may not necessarily mean it is inefficient, as shown later on. Also, notice on the Summary page that the ring_perf_autotest micro benchmark utilizes only three CPUs.

Figure 18

Figure 19

Figure 20

On the Bottom-up tab notice the light red-shaded cells under the Microarchitecture Usage column. This metric estimates how effectively the code runs on the current microarchitecture. Within the Microarchitecture Usage column, there is the CPI Rate column. Cycles per Instructions Retired (CPI) rate is a fundamental performance metric as it measures how much time each instruction takes. In this analysis the rte_ring_mc_do_dequeue is the biggest offender, taking five cycles per instruction, making this a good place to optimize for performance.

## Summary

This tutorial showed how to configure your system to analyze DPDK using VTune Amplifier, then how to use VTune Amplifier Hotspots Analysis to profile two DPDK .Test Suite microbenchmarks. Use this article and the resources listed below to get started profiling your DPDK application with VTune Amplifier.

## Resources

The DPDK Cookbook

VTune™ Amplifier 2019

Get Started with VTune™ Amplifier 2019

Data Plane Development Kit (DPDK)

Data Plane Development Kit Test Suite (DPDK Test Suite)

#### Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.