Announcing Numatop

Performance analysis engineers know that non-uniform memory access (NUMA) can seriously impact performance and that NUMA performance analysis can be challenging.

Linux* has a NUMA observation tool: numastat. It provides information about the ratio of local versus remote memory use and the overall memory configuration of all nodes. It also includes failed allocations of local memory (displayed in the numa_miss column) and allocations of remote memory (displayed in the numa_foreign column). However, numastat only accounts for the memory allocation and does not measure the real-time memory traffic of the CPU. Users need more. But, an easy-to-use tool that lets you observe NUMA-related issues and where the NUMA bottlenecks reside doesn't exist. It's difficult to locate these issues, especially in complex server environments.

This author needed a tool that follows typical steps used in NUMA analysis and provides a good starting point to dive in and fix NUMA-related bottlenecks.

The typical steps that the author uses are:

Find the memory-intensive applications with the poorest memory locality.
Determine the application's node affinity.
Explore the memory hot spots in the application and find the poorest latency.
Determine the locality of these memory hot spots.
Find the places in the application that access the memory hot spots, and then get a list of call chains to the code that accesses these memory hot spots.

NumaTOP is a new Linux observation tool that automates these steps. Different from numastat, it is an observation tool for characterizing runtime memory locality and analyzing processes and threads running on a NUMA system. It helps you:

Characterize the NUMA behavior of processes and threads
Identify where the NUMA-related performance bottlenecks reside

The tool uses performance counter-sampling technologies from Intel. It associates the performance data with Linux system runtime information to provide real-time analysis for production systems.

The following is a step-by-step example that uses NumaTOP to see how it helps you perform a NUMA analysis. Figure 1 is the NUMA topology of a two-socket platform, based on the Intel® Xeon® E5-2680 processor. A simple example program mgen demonstrates NumaTOP. This program generates guaranteed memory accesses (no LLC hit).

Figure 1

Step 1: Find Processes with the Poorest Memory Locality

Figure 2

The process mgen is memory intensive with the poorest memory locality.

Step 2: Determine the mgen Node Affinity

Figure 3

The process mgen is running on Node 1.

Step 3: Explore Memory Hot Spots

Find the hot spots in mgen with the poorest latency.

Figure 4

The mgen memory hot spot is the memory area, which is 256 MB.

Step 4: Determine the Locality

Figure 5

The memory hot spot is physically allocated on Node 0. The previous steps show how many remote memory accesses are generated. Let's go further, such as learning where in the code these accesses originate.

Step 5: What's Accessing Memory

Find the places in mgen that are accessing the memory hot spots and get a list of call chains to the code that accesses the memory hot spots.

Figure 6

The buf_read() is the key function that generates a huge number of memory accesses, showing how the memory was originally allocated to improve the mgen locality and overall performance.

NumaTOP can help you find out what you want to know about NUMA bottlenecks.

Learn More

Download

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

Announcing NumaTOP

Step 1: Find Processes with the Poorest Memory Locality

Step 2: Determine the mgen Node Affinity

Step 3: Explore Memory Hot Spots

Step 4: Determine the Locality

Step 5: What's Accessing Memory

Product and Performance Information