It is a good idea to profile your Data Plane Development Kit (DPDK) application at different stages in development. In this tutorial we show how to use VTune™ Amplifier 2019 to run two Data Plane Development Kit Test Suite (DPDK Test Suite) microbenchmarks, distributor_perf_autotest and ring_perf_autotest, using DPDK v16.11.8 LTS. We then analyze data collected during each of the profiling runs.
The system used in this tutorial is running Ubuntu* 16.04.5 LTS on an Intel® Xeon® processor E5-2699 v4 with two 10-Gigabit network interface cards (NICs). Each NIC has two 10 Gigabit ports; the NICs used in this tutorial are the Intel® 82599 Gigabit Ethernet Controller and Intel® Ethernet Controller X540-AT2.
Note: You must have root access on your test system to follow the steps in this tutorial.
Install Kernel Debug Symbols
To get started, install the Linux* debug symbols and download the correct Linux source files. This step is required to enable the profiling report to display function names and corresponding source code.
The size of the Linux debug symbols package is roughly 600 MB.
echo "deb http://ddebs.ubuntu.com $(lsb_release -cs) main restricted universe multiverse" | tee -a /etc/apt/sources.list.d/ddebs.list echo "deb http://ddebs.ubuntu.com $(lsb_release -cs)-updates main restricted universe multiverse" | tee -a /etc/apt/sources.list.d/ddebs.list apt-get update apt-get install linux-image-$(uname -r)-dbgsym uname -r apt install linux-source-4.15.0 cd /usr/src/linux-source-4.15.0 tar xf linux-source-4.15.0.tar.bz2
Next, download and build the DPDK using the following commands.
wget http://fast.dpdk.org/rel/dpdk-16.11.8.tar.xz tar xf dpdk-16.11.8.tar.xz
Install build tools, an ancillary library header, set the correct environment variables, and then build the DPDK.
cd apt install build-essential apt install libnuma-dev export RTE_SDK=/home/dpdk/dpdk-stable-16.11.8 export RTE_TARGET=x86_64-native-linuxapp-gcc export EXTRA_CFLAGS='-g' make install T=x86_64-native-linuxapp-gcc DESTDIR=install
After the DPDK is built, you’ll configure your system to use hugepages, then bind it to a specified NIC.
The Getting Started Guide for Linux at DPDK.org describes why hugepage support is needed: “Hugepage support is required for the large memory pool allocation used for packet buffers …By using hugepage allocations, performance is increased since fewer pages are needed, and therefore fewer Translation Lookaside Buffers (TLBs, high-speed translation caches), which reduce the time it takes to translate a virtual page address to a physical page address. Without hugepages, high TLB miss rates would occur with the standard 4k page size, slowing performance."
The Use of Hugepages in the Linux Environment section of the Getting Started guide mentioned above will guide you through configuring hugepages for your system.
Bind DPDK to a NIC
To bind the DPDK to a NIC on this system, first load the correct drivers into the environment, then use the dpdk-devbind tool to bind the NIC to the DPDK driver. To achieve this, run the following commands:
cd cd dpdk-stable-16.11.8 modprobe uio insmod ./x86_64-native-linuxapp-gcc/kmod/igb_uio.ko ./tools/dpdk-devbind.py --status
This outputs the status of network and crypto devices on the system and the current driver it is using.
Figure 1. dpdk-devbind status
Bind the Network Device enp61s0f1 to the DPDK using the following command:
./tools/dpdk-devbind.py --bind=igb_uio enp61s0f1
Figure 2. dpdk-devbind bind
Now that the network device is bound to a DPDK compatible driver, configure the DPDK to use hugepages. Do this by running the dpdk-setup.sh tool.
Since this example uses a two-socket NUMA system, we will select option 20 and allocate 10,000 hugepages, which are 2 MB in size.
Note: Use a lower number of hugepages for memory-constrained systems.
Figure 3. dpdk-setup
Profiling DPDK Test Suite Microbenchmarks
Now that the DPDK is configured, we will use VTune Amplifier Hotspots Analysis Hardware Event-Based Sampling to profile two DPDK microbenchmark tests, distributor_perf_autotest and ring_perf_autotest, using VTune Amplifier.
Configure and start VTune™ Amplifier
If you haven’t installed VTune Amplifier, do it now. For more information and installation instructions, read Get Started with VTune™ Amplifier 2019.
Type the two commands below to load environment variables and start the VTune Amplifier graphical user interface (GUI).
Figure 4. The VTune Amplifier GUI
Select New Project and give it an appropriate name.
Start the DPDK test suite
Open a new terminal window and use the commands below to start the DPDK Test Suite test application:
cd cd dpdk-stable-16.11.8 ./x86_64-native-linuxapp-gcc/app/test
Figure 5. Start the DPDK Test Suite application test
RTE>>? lists all the available benchmarks.
After you have created your project, do the following steps:
- Select Configure Analysis
- Run a default Hotspots analysis with Hardware Event-Based Sampling selected.
- Next, Attach VTune Amplifier to the running test process by specifying its Process ID (PID).
To determine the correct PID for the test process run the following:
ps ax | grep test
- Finally, set the path for the source files and binaries for the project. Under Configure Analysis click Search Sources/Binaries.
Place the following paths in the appropriate text box:
First, we’ll profile the distributor_perf_autotest microbenchmark, which measures the interprocessor communication of moving a cache line from one processor to another.
Test and analyze
Start the Hotspots analysis.
Once the Hotspots analysis has started, return to the terminal running the ./x86_64-native-linuxapp-gcc/app/test process, and run the distributor_perf_autotest microbenchmark.
Figure 12. distributor_perf_autotest benchmark
After the distributor_perf_autotest microbenchmark has completed, click the stop button for VTune Amplifier to end profiling. From there, VTune Amplifier will analyze the collection and output a report, as shown below.
Figure 13. distributor_perf_autotest profile
Observe on the Summary page in Figure 14 that the function that runs the longest in the distributor_perf_autotest microbenchmark is _mm_pause. Also, notice that CPU 0 is essentially the only CPU being utilized when running the benchmark. This graph shows if the workload is Parallelized.
Note: Not all workloads can be parallelized.
When navigating to the Bottom-up tab it is easy to see that _mm_pause takes up the majority of the CPU time. Also, view how the _mm_pause function is being called in the bottom-up stack.
start_thread -> eal_thread_loop -> handle_work -> rte_distributor_get_pkt -> rte_pause -> _mm_pause
VTune Amplifier also shows the source file of the function. Double-click the function to view the location of your source files and binaries, as shown below.
After running the microbenchmark, the results show that the majority of the time is spent in spin-wait loops.
The communication between cores for interprocessor communication, as well as communication between cores and the NIC, happens through rings and descriptors.
While NIC hardware does optimizations in terms of report status (RS) bit and descriptor done (DD) bit in bunching the data size, DPDK also enhances bunching with amortization by offering an API for bulk communication through rings.
The ring tests show that single producer, single consumer (SP/SC) with bulk sizes both in enqueue and dequeue give the best performance, compared to multiple producers, multiple consumers (MP/MC).
Test and analyze
Run another Hotspots analysis. Return to the terminal running the test process and run the ring_perf_autotest micro benchmark; stop collection when the benchmark has finished.
Observe from the Summary page that the function that runs the longest in the ring_perf_autotest microbenchmark is __rte_ring_mc_do_dequeue. However, the function that runs the longest may not necessarily mean it is inefficient, as shown later on. Also, notice on the Summary page that the ring_perf_autotest micro benchmark utilizes only three CPUs.
On the Bottom-up tab notice the light red-shaded cells under the Microarchitecture Usage column. This metric estimates how effectively the code runs on the current microarchitecture. Within the Microarchitecture Usage column, there is the CPI Rate column. Cycles per Instructions Retired (CPI) rate is a fundamental performance metric as it measures how much time each instruction takes. In this analysis the rte_ring_mc_do_dequeue is the biggest offender, taking five cycles per instruction, making this a good place to optimize for performance.
This tutorial showed how to configure your system to analyze DPDK using VTune Amplifier, then how to use VTune Amplifier Hotspots Analysis to profile two DPDK .Test Suite microbenchmarks. Use this article and the resources listed below to get started profiling your DPDK application with VTune Amplifier.
Product and Performance Information
Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.