DPDK Event Device Profiling
Use
Intel® VTune™
to analyze the efficiency of DPDK Event Device pipeline utilization in your DPDK-based application and identify issues, such as inhomogeneous load distribution and worker core underutilization.
Profiler
Content Experts:
Eugeny Parshutin, Kurakin Ilia
The Data Plane Development Kit (DPDK) is a framework that consists of libraries that accelerate packet processing workloads running on a wide variety of CPU architectures. One of these libraries is the
eventdev
library that enables you to improve system load balancing by using an event-based model in your application. An event-based approach suggests that the work that needs to be done by the system is presented as separate units called events. One common example of using the event-based programming model in DPDK is the network packet processing pipeline, where each packet plays the role of an event.
This figure gives an example of an
eventdev
pipeline configuration:

Here, each block represents the following unit:
- Event Device– device with event scheduling capability, implemented either in hardware or software.
- Queue– logical stage of the processing pipeline that contains events of different flows associated with scheduling types (atomic, ordered, or parallel).
- Ports– points of contact between cores and theeventdevlibrary that are used to enqueue and dequeue events to and fromeventdevqueues.
- Worker Cores– CPU cores that are available to the application to perform work.
- Rx Core– CPU core that receives packets from the NIC.
- Tx Core– CPU core that transmits packets to the NIC.
- NIC– Network Interface Card.
This example demonstrates an Event Device that is configured to manage four atomic stages that are presented as four event queues:
- queue_0is dedicated to keep newly arrived packets. Only the Rx core enqueues packets (events) to this queue.
- queue_1andqueue_2are dedicated to some type of event processing stage, such as setting destination address, cryptography processing, or compression. Worker cores perform these tasks and transfer packets between queues 0, 1, 2, and 3.
- queue_3is dedicated to keep packets that are ready to be transmitted. Only the Tx core dequeues packets from this queue.
The dequeue operation is performed using the
rte_event_dequeue_burst()
routine in an endless loop. Thus, worker cores continuously poll Event Device ports, looking for a batch of events to be processed. The batch size depends on overall load and performance of different stages. The maximum batch size is defined by the workload.
Per-worker dequeue statistics provided by
Intel® VTune™
reveal the load balancing details and enable you to analyze pipeline configuration efficiency and identify pipeline bottlenecks.
Profiler
This recipe defines the following steps to analyze the efficiency of the pipeline processing model in DPDK-based applications:
Ingredients
This section lists the hardware and software tools used in this performance analysis scenario:
- Application:the DPDKeventdev_pipelineapplication demonstrates the usage of theeventdevAPI and shows how an application can configure a pipeline and assign a set of worker cores to perform event processing. The application is compiled with DPDK withVTunesupport enabled.Profiler
- Tools:
- DPDK, compiled withVTunesupport enabled. To enableProfilereventdevprofiling on DPDK side, you need to apply a patch and recompile DPDK and the target DPDK application.Use the following patches:
- Intel® VTune™2020: Input and Output analysis.Profiler
- Starting with the 2020 release, Intel® VTune™ Amplifier has been renamed toIntel® VTune™.Profiler
- Most recipes in theIntel® VTune™Performance Analysis Cookbook are flexible. You can apply them to different versions ofProfilerIntel® VTune™. In some cases, minor adjustments may be required.Profiler
- Get the latest version ofIntel® VTune™:Profiler
- From theIntel® VTune™product page.Profiler
- Download the latest standalone package from the Intel® oneAPI standalone components page.
- System Setup:
- Traffic generator:a system that generates traffic for the system being tested.
- System under test:a system running theeventdev_pipelineapplication for packet (event) processing andVTunefor performance data collection.Profiler
- CPU:Intel® Xeon® Platinum 8168 processor (formerly code named Skylake).
- Operating System:Linux* OS.
Run Input and Output Analysis
To collect DPDK
eventdev
dequeue statistics, use the Input and Output analysis of
VTune
.
Profiler
To run the analysis from the GUI:
- Launch theVTuneGUI and create a new projectProfiler
- In theHOWpane, select theInput and Outputanalysis
- InSelect IO API type to profile, selectDPDK IO API
- Click theStartbutton.
To run Input and Output analysis with DPDK profiling from the command line, use the following command:
vtune -collect io -knob kernel-stack=false -knob dpdk=true --target-process=eventdev_pipeline
Analyze Load per Stage
To get an overall characterization of DPDK
eventdev
pipeline utilization, start your investigation with the
Summary
tab and explore the
DPDK Events Dequeue Statistics
histogram:

This histogram represents the statistics for the number of dequeued events for each
eventdev
port, that is, for each worker thread that polls the event device. Explore the different areas of the histogram to identify inhomogeneous load distribution, oversubscribed, or underutilized workers.
If you identify any imbalance in worker thread load distribution, try to reconfigure your pipeline to avoid this an re-run the analysis.
Analyze CPU Utilization
To understand the CPU utilization for workers performing event dequeue operations, navigate to the
Platform
tab and explore the
DPDK Event Dequeue Spin Time
overtime metric attributed to worker threads.

The
DPDK Event Dequeue Spin Time
per-thread metric shows the ratio of empty dequeue cycles, which is the ratio of
rte_event_dequeue_burst()
calls that returned zero events with respect to the total number of dequeue calls. Explore this metric to estimate worker thread load and to decide whether the application underutilizes cores or needs more resources.