Use Intel® Resource Director Technology to Allocate Memory Bandwidth

Overview

Intel® Resource Director Technology (Intel® RDT) provides monitoring and control over shared platform resources. This helps ensure that administrator or application Quality of Service (QoS) targets can be met. With an Intel RDT enabled system, a system administrator or orchestrator can monitor and allocate memory bandwidth or Last Level Cache (LLC), per application, container, virtual machine (VM), or even per-thread if necessary. In this article we show how to allocate memory bandwidth on Intel systems via Intel RDT and demonstrate how Intel RDT can improve performance in densely consolidated environments where shared resources come under pressure with a Redis* in-memory data store workload.

Requirements

To follow along with this tutorial and reproduce the expected results, your system should include a second-generation Intel® Xeon® Scalable processor and an installed Linux* distribution with a kernel of 4.10 or greater, for which the resource control kernel interface is integrated. Going forward, it is assumed that you have ssh access and are logged in as root on the system.

The system demonstrated in this tutorial is running Ubuntu* 18.04.1 Long Term Support (LTS) on a two-socket system with an Intel® Xeon® Gold 6252 processor.

Configuring Ubuntu* Linux*

To configure our system, we edit the /etc/default/grub file to enable Memory Bandwidth Allocation (MBA) technology and isolate nine cores from the Linux scheduler. To do this, we will change the following line:

GRUB_CMDLINE_LINUX=""

to this:

GRUB_CMDLINE_LINUX="rdt=mba isolcpus=0-8"

In the line above we enabled the Intel RDT features of Memory Bandwidth Monitoring (MBM) and MBA technology. As well, we isolated CPUs 0-8 from the kernel. Next, we update our grub config with the following command, and on the next reboot the options will be in effect.

update-grub
reboot

Finally, we install the following:

apt update
apt install git
apt install build-essential
git clone https://github.com/jeffhammond/STREAM.git # STREAM BENCHMARK
apt install redis-server

This clones the STREAM repository onto our system. More information on the STREAM benchmark can be found on the University of Virginia STREAM FAQ page.

Also, we edit the /etc/redis/redis.conf file to allow the init system systemd to manage our redis server. Change the configuration file to the following:

# If you run Redis from upstart or systemd, Redis can interact with your
# supervision tree. Options:
#   supervised no      - no supervision interaction
#   supervised upstart - signal upstart by putting Redis into SIGSTOP mode
#   supervised systemd - signal systemd by writing READY=1 to $NOTIFY_SOCKET
#   supervised auto    - detect upstart or systemd method based on
#                        UPSTART_JOB or NOTIFY_SOCKET environment variables
# Note: these supervision methods only signal "process is ready."
#       They do not enable continuous liveness pings back to your supervisor.
supervised systemd

We must restart redis-server for the changes to take effect:

systemctl restart redis.service

Working with Intel® RDT

To work with Intel RDT, mount the resource control virtual file system with the following command:

mount -t resctrl resctrl /sys/fs/resctrl

Once the virtual file system is mounted, the directory structure will be as follows:

tree /sys/fs/resctrl

/sys/fs/resctrl/
├── cpus
├── cpus_list
├── info
│   ├── L3
│   │   ├── cbm_mask
│   │   ├── min_cbm_bits
│   │   ├── num_closids
│   │   └── shareable_bits
│   ├── L3_MON
│   │   ├── max_threshold_occupancy
│   │   ├── mon_features
│   │   └── num_rmids
│   ├── last_cmd_status
│   └── MB
│       ├── bandwidth_gran
│       ├── delay_linear
│       ├── min_bandwidth
│       └── num_closids
├── mon_data
│   ├── mon_L3_00
│   │   ├── llc_occupancy
│   │   ├── mbm_local_bytes
│   │   └── mbm_total_bytes
│   └── mon_L3_01
│       ├── llc_occupancy
│       ├── mbm_local_bytes
│       └── mbm_total_bytes
├── mon_groups
├── schemata
└── tasks

As shown above, resctrl-based enabling for Intel RDT supports all of the Intel RDT features, such as CMT, CAT, MBA, and MBM. In this case, the operating system handles management of per-thread/app/process tags, such as the resource monitoring IDs (RMIDs) and Classes of Service (CLOS or COS) defined by the Intel RDT architecture.

The root of /sys/fs/resctrl is the default CLOS. Let's change directories to the root of Resource Control:

cd /sys/fs/resctrl

By printing the schemata file to standard output we get the following:

cat /sys/fs/resctrl/schemata
    L3:0=7ff;1=7ff
    MB:0=100;1=100

In the MB row we see that both socket 0 and socket 1 have a value of 100. Thus, each socket has 100 percent usage of the available memory bandwidth.

Now, let’s look at the info directory by changing directories to the following:

cd /sys/fs/resctrl/info
tree

├── L3
│   ├── cbm_mask
│   ├── min_cbm_bits
│   ├── num_closids
│   └── shareable_bits
├── L3_MON
│   ├── max_threshold_occupancy
│   ├── mon_features
│   └── num_rmids
├── last_cmd_status
└── MB
    ├── bandwidth_gran
    ├── delay_linear
    ├── min_bandwidth
    └── num_closids

We will keep the scope of this article to the MB directory. By printing out the num_closids file we see into how many CLOS we can partition our MB.

cat /sys/fs/resctrl/info/MB/num_closids
8

We observe that each MB listed in the schemata can be partitioned into 8 CLOS.

Also, by printing out the bandwidth_gran and min_bandwidth files, we see the bandwidth granularity allowed and the minimum bandwidth allowed that is available for the MB resource.

cat /sys/fs/resctrl/info/MB/bandwidth_gran
10
cat /sys/fs/resctrl/info/MB/min_bandwidth
10
cat /sys/fs/resctrl/info/MB/delay_linear
1

We can have a granular stepping of increments and decrements of 10, and a minimum bandwidth allocation of 10. Finally, delay_linear specifies whether our scaling is linear (1 for linear, 0 for non-linear).

Other CLOS groups may be created by using the mkdir command in the root directory.

cd /sys/fs/resctrl
mkdir test

We created a CLOS called test that we can now define. A new resource group is created; the cpus, tasks, and schemata files are automatically generated by the virtual file system.

cd /sys/fs/resctrl/low_priority
tree

├── cpus
├── cpus_list
├── mon_data
│   ├── mon_L3_00
│   │   ├── llc_occupancy
│   │   ├── mbm_local_bytes
│   │   └── mbm_total_bytes
│   └── mon_L3_01
│       ├── llc_occupancy
│       ├── mbm_local_bytes
│       └── mbm_total_bytes
├── mon_groups
├── schemata
└── tasks

To define the attributes of this CLOS, or any other CLOS, we will edit the cpus, or cpus_list, and schemata files. As well, the system admin can specify tasks and PIDs that belong to a CLOS, instead of cpus.

Tasks:

A list of tasks that belongs to this group. Tasks can be added to a group by writing the task ID (PID) to the tasks file (which will automatically remove them from the previous group to which they belonged). New tasks created by fork (2) and clone (2) are added to the same group as their parent. If a PID is not in any sub partition, it is in root partition (that is, default partition).

CPUs:

A bitmask of logical CPUs assigned to this group. Writing a new mask can add or remove CPUs from the group. Added CPUs are removed from their previous group. Removed ones are placed in the default (root) group. You cannot remove CPUs from the default group.

Schemata:

A list of all the resources available to this group. Each resource has its own line and format. The format consists of a mask that controls the resources access. For example, a schemata for L3 cache will have a mask representing the cache ways available.

For more information regarding the /sys/fs/resctrl file structure, follow this GitHub* Resctrl page.

Configuring STREAM

The STREAM benchmark measures memory transfer rates in megabytes per second (MB/s) for simple computational kernels coded in C.

Now, let us cd into the cloned STREAM repository and edit the Makefile as follows:

CC = gcc
CFLAGS = -DSTREAM_ARRAY_SIZE=99999999 -DNTIMES=1000 -O2 -fopenmp


all: stream_c.exe


stream_c.exe: stream.c
        $(CC) $(CFLAGS) stream.c -o stream_c

clean:
        rm -f stream_c *.o

# an example of a more complex build line for the Intel icc compiler
stream.icc: stream.c
        icc -O3 -xCORE-AVX2 -ffreestanding -qopenmp -DSTREAM_ARRAY_SIZE=80000000 -DNTIMES=20 stream.c -o stream.omp.AVX2.80M.20x.icc

We defined the array to have a size of 99999999 and ran each kernel 1000 times.

make all

After we compile our program we will have an executable named stream_c.

MBA Allocation in Action

In this demonstration, we will run redis-benchmark when the system is idle, then in an environment with noisy neighbors, and finally, add those noisy neighbors to a low-priority CLOS to control their interference.

Before we run our benchmark, let us pin the redis-server process and threads to isolated core 0.

# ps ax | grep redis
 2031 ?        Ssl   10:50 /usr/bin/redis-server 127.0.0.1:6379
 6327 pts/0    S+     0:00 grep --color=auto redis
# taskset -acp 0 2031
pid 2031's current affinity list: 0
pid 2031's new affinity list: 0
pid 2051's current affinity list: 0
pid 2051's new affinity list: 0
pid 2052's current affinity list: 0
pid 2052's new affinity list: 0
pid 2053's current affinity list: 0
pid 2053's new affinity list: 0

Benchmark on Idle System

Let’s run our benchmark on isolated core 1 when the system is idle, to observe the most optimal performance (running alone).

Note Performance numbers may vary based on system configuration, BIOS settings, operating system tuning, benchmark tuning, and many other parameters, so these numbers should be considered as an example only, rather than a demonstration of the performance capabilities of the system.

Note Before each redis-benchmark run, a redis-cli flushall is executed.

taskset -ac 1 redis-benchmark -d 500000 -r 10000 -n 100000 -t set,get -q
SET: 4140.79 requests per second
GET: 5265.65 requests per second

The redis-benchmark parameters set in our benchmark:

-d <size> Data size of SET/GET value in bytes (default 2)
-r <keyspacelen> Use random keys for SET/GET/INCR, random values for SADD
-n <requests> Total number of requests (default 100000)
-t <tests> Only run the comma separated list of tests. The test names are the same as the ones produced as output.
-q Quiet. Just show query/sec values

On our idle system, we observe that we can achieve 4140 SET requests per second, and 5265 GET requests per second.

Benchmark with Noisy Neighbor

We will run our noisy neighbors in the background on isolated cores 2-8, and rerun our benchmark. Performance scores degrade due to the presence of the noisy neighbors on the platform, which compete for LLC capacity and memory bandwidth.

cd
cd STREAM
taskset -ac 2 ./stream_c &
taskset -ac 3 ./stream_c &
taskset -ac 4 ./stream_c &
taskset -ac 5 ./stream_c &
taskset -ac 6 ./stream_c &
taskset -ac 7 ./stream_c &
taskset -ac 8 ./stream_c &
taskset -ac 1 redis-benchmark -d 500000 -r 10000 -n 100000 -t set,get -q
SET: 3551.64 requests per second
GET: 3778.29 requests per second

Our high-priority application in this case has reduced performance due to the shared resource contention in the system.

approximately 15 percent decrease in SET requests per second
approximately 30 percent decrease in GET requests per second

Low-Priority CLOS

To configure a CLOS for the low-priority applications, do the following:

cd /sys/fs/resctrl
mkdir low_priority
cd low_priority
echo "1fc" > cpus

Edit the schemata file in our low_priority CLOS, to look like the following configuration:

L3:0=7ff;1=7ff
MB:0= 10;1=100

We have assigned isolated cores 2-8 to the low_priority CLOS and set the memory bandwidth of the CLOS to the lowest value of 10 on socket 0.

We will rerun our noisy neighbors in the low_priority CLOS, then benchmark our mission critical Redis workload.

taskset -ac 2 ./stream_c &
taskset -ac 3 ./stream_c &
taskset -ac 4 ./stream_c &
taskset -ac 5 ./stream_c &
taskset -ac 6 ./stream_c &
taskset -ac 7 ./stream_c &
taskset -ac 8 ./stream_c &
taskset -ac 1 redis-benchmark -d 500000 -r 10000 -n 100000 -t set,get -q
SET: 3989.79 requests per second
GET: 4744.73 requests per second

After dampening our noisy neighbors by setting them to an MBA throttling level of 10 percent, the performance of our application improves substantially versus the contended run, as follows:

approximately 12 percent increase in SET requests per second
approximately 25 percent increase in GET requests per second

By placing our noisy neighbors in a low-priority CLOS, we are able to achieve near-optimal performance. Other techniques, such as using Cache Allocation Technology (CAT), could also help reduce the impact of the contention.

Summary

In this article, we demonstrated how to set up a system to take advantage of Intel RDT MBA technology, one of the advanced features available on the latest Intel® Xeon® processors. In addition to more traditional controls through an operating system or Virtual Machine Manager (VMM), such as control over the given number of CPU cores, memory, and storage to an app/container/VM, with Intel RDT, a system admin or orchestrator can have granular control of how much LLC and memory bandwidth a workload can consume. The Intel RDT monitoring features such as CMT and MBM, though not covered here, can also provide insight into how shared system resources are being used, and can help find noisy neighbor applications present on the system.

Notices

No computer system can be absolutely secure.

Intel technologies may require enabled hardware, specific software, or services activation. Performance varies depending on system configuration. Check with your system manufacturer or retailer.

Performance results are based on testing as of March 2019 and may not reflect all publicly available security updates. See configuration disclosure for details. No product can be absolutely secure.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

Per node using 2 socket second-generation Intel® Xeon® Scalable processor Gold 6252 @ 2.1Ghz. Total memory 12 slots/ 32GB/ 2666 MT/s DDR4 DIMM. Hyper Threading: Disabled, Turbo: Enable. Storage INTEL SSDSC2BA80 800GBs SATA 3.0 6Gb/s. Network Adapters: two Intel® Ethernet Connection X722 for 10GBASE-T. Redis server version 4.0.9, STREAM version 5.10 OS: Ubuntu 18.04.1 LTS, Kernel: Linux 4.15.0-46-generic

For more information go to the Intel Product Performance page.

Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. Check with your system manufacturer or retailer or learn more at Intel.

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in