Use Intel® Resource Director Technology to Allocate Last Level Cache (LLC)

Published: 03/25/2019  

Last Updated: 03/25/2019

Overview

Intel® Resource Director Technology (Intel® RDT) is all about determinism and Quality of Service (QoS). With an Intel RDT enabled system, a system administrator can monitor and allocate memory bandwidth and/or last level cache (LLC) per application, container, or virtual machine (VM). In this article, we demonstrate how to allocate LLC on our system via Intel RDT and show how this can increase QoS for our compression workload.

Requirements

To follow along with this tutorial, you must have an Intel® Xeon® processor E family or a 2nd generation Intel® Xeon® Scalable processor. The installed Linux* distribution must have a kernel of 4.10 or greater, for which the Resource Control kernel interface is standard. It is assumed that you have ssh access and are logged in as root on the system.

The system demonstrated in this tutorial is running Ubuntu* 18.04.1 Long Term Support (LTS) on a four-socket system with an Intel® Xeon® E7-8890 processor version 4.

Configuring Ubuntu* Linux*

To configure our system, we will edit the /etc/default/grub file to enable Intel RDT on our system and isolate four cores from the Linux schedule. To do this, we will change the following line:

GRUB_CMDLINE_LINUX=""

to this:

GRUB_CMDLINE_LINUX=" rdt=cmt,l3cat isolcpus=0-3"

In the line above, we enabled the Intel RDT features of Cache Monitoring Technology (CAT) and L3 Cache Allocation Technology. We also isolated CPUs 0-3 from the kernel scheduler. Next, we will update our grub configuration with the following command, and on the next reboot the options will be in effect.

update-grub
reboot

Finally, we install the following:

apt update
apt-get install stress-ng

This installs stress-ng, a tool we will use to create our noisy neighbor. More information on stress-ng can be found on the Ubuntu manuals page.

Working with Intel® RDT

To work with Intel RDT, all a user needs to do is mount the resource control virtual file system using the command below:

mount -t resctrl resctrl /sys/fs/resctrl

Once the virtual file system is mounted, you’ll see the following directory structure:

tree /sys/fs/resctrl

/sys/fs/resctrl/
├── cpus
├── cpus_list
├── info
│   ├── L3
│   │   ├── cbm_mask
│   │   ├── min_cbm_bits
│   │   ├── num_closids
│   │   └── shareable_bits
│   ├── L3_MON
│   │   ├── max_threshold_occupancy
│   │   ├── mon_features
│   │   └── num_rmids
│   ├── last_cmd_status
│   └── MB
│       ├── bandwidth_gran
│       ├── delay_linear
│       ├── min_bandwidth
│       └── num_closids
├── mon_data
│   ├── mon_L3_00
│   │   ├── llc_occupancy
│   │   ├── mbm_local_bytes
│   │   └── mbm_total_bytes
│   └── mon_L3_01
│       ├── llc_occupancy
│       ├── mbm_local_bytes
│       └── mbm_total_bytes
├── mon_groups
├── schemata
└── tasks

The root of /sys/fs/resctrl is the default Class of Service (CoS). Let's change directories to the root of Resource Control.

cd /sys/fs/resctrl

By printing the schemata file to the standard output, we get the following:

cat /sys/fs/resctrl/schemata

    L3:0=fffff;1=fffff;2=fffff;3=fffff
    MB:0=100;1=100;2=100;3=100

In the L3 row above, we see that sockets 0, 1, 2, and 3 have a bitmask of fffff. Thus, each socket has 20 cache ways. Each processor on our system has an L3 cache of 60 megabytes (MBs) and divided into 20 cache ways, each  cache way ~= 3 MBs. The MB memory bandwidth row states that cpus, or tasks on sockets 0, 1, 2, and 3 in the root/default group, have access to 100 percent of the memory bandwidth available.

Now, take a look at the info directory by changing directories to the following:

cd /sys/fs/resctrl/info
tree

├── L3
│   ├── cbm_mask
│   ├── min_cbm_bits
│   ├── num_closids
│   └── shareable_bits
├── L3_MON
│   ├── max_threshold_occupancy
│   ├── mon_features
│   └── num_rmids
├── last_cmd_status
└── MB
    ├── bandwidth_gran
    ├── delay_linear
    ├── min_bandwidth
    └── num_closids

We’ll focus on the L3 directory. By printing out the num_closids file we see how many CoSs into which we can partition our L3 cache.

cat /sys/fs/resctrl/info/L3/num_closids

16

We see that each L3 cache can be partitioned into 16 CoSs.

Also, by printing out the cbm_mask and min_cbm_bits files, we see the maximum and minimum bitmask available for the L3 resource.

cat /sys/fs/resctrl/info/L3/cbm_mask

fffff

cat /sys/fs/resctrl/info/L3/min_cbm_bits

1

Other CoS groups may be created by using the mkdir command in the root directory.

cd /sys/fs/resctrl
mkdir my_special_cos

We created a CoS called my_special_cos that we will now define. A new resource group is created, and the cpus, tasks, and schemata files are automatically generated by the virtual file system.

cd /sys/fs/resctrl/my_special_cos
tree

├── cpus
├── cpus_list
├── mon_data
│   ├── mon_L3_00
│   │   ├── llc_occupancy
│   │   ├── mbm_local_bytes
│   │   └── mbm_total_bytes
│   └── mon_L3_01
│       ├── llc_occupancy
│       ├── mbm_local_bytes
│       └── mbm_total_bytes
├── mon_groups
├── schemata
└── tasks

To define the attributes of our CoS, we’ll edit the cpus, or cpus_list, and schemata files. The system admin can specify tasks and PIDs that belong to a CoS, instead of cpus.

tasks:

A list of tasks that belongs to this group. Tasks can be added to a group by writing the task ID (PID) to the "tasks" file (which will automatically remove them from the previous group to which they belonged). New tasks created by fork(2) and clone(2) are added to the same group as their parent. If a PID is not in any subpartition, it is in root partition (that is, default partition).

cpus:

A bitmask of logical CPUs assigned to this group. Writing a new mask can add/remove CPUs from this group. Added CPUs are removed from their previous group. Removed ones are given to the default (root) group. You cannot remove CPUs from the default group.

schemata:

A list of all the resources available to this group. Each resource has its own line and format. The format consists of a mask that controls the resources access. For example, a schema for an L3 cache will have a mask representing the cache ways available.

For more information regarding the /sys/fs/resctrl file structure, follow this GitHub* Resctrl page.

LLC Allocation in Action

In this demonstration, we will run a contrived compression benchmark when the system is idle, then in our high priority CoS, and finally, add a noisy neighbor to the CoS to showcase a rogue app that decreases CoS performance.

First, to observe the most optimal performance we run our benchmark when the system is idle. We download a sample file to use in our benchmark.

wget http://downloads.dvdloc8.com/trailers/divxdigest/rogue_one_a_star_wars_story-trailer2.zip
apt install unzip
unzip rogue_one_a_star_wars_story-trailer2.zip
mv H.264 Rogue\ One\ -\ A\ Star\ Wars\ Story\ -\ Trailer\ 2.mp4  ~/r1.mp4

Benchmark on Idle System

time tar -czf test.tar.gz r1.mp4

real    0m4.428s
user    0m4.224s
sys     0m0.473s

The best-case scenario for our compression benchmark is about 4.4 seconds.

Benchmark in the CoS Group

First, cd into my_special_cos resource group and set its attributes.

cd /sys/fs/resctrl/my_special_cos
echo "f" > cpus
echo "L3:0=1;1=fffff;2=fffff;3=fffff" > schemata

Set the bitmask that defines which CPUs belong to this CoS to cpus 0-3. Then, set the bitmask defining how many cache ways this CoS can use to 0x1 on socket 0. Thus, CPUs 0-3 can only consume ~3.1 MBs (1 cache way * ~3.1 MBs) of LLC.

Next, open another terminal logged in as root, and learn how much LLC our process consumes. To view the LLC being used in a CoS, print the monitor data resource file located at /sys/fs/resctrl/<cos_name>/mon_data/mon_L3_00/ llc_occupancy.

Now, we’ll run our compression benchmark. First, we pin the process and its threads to isolated core 0.

Note We use the taskset utility to pin our process to a specified core. More information on taskset can be read on the Linux manuals page.

taskset -ac 0 time tar -czf test.tar.gz r1.mp4

5.41user 0.24system 0:05.66elapsed 99%CPU (0avgtext+0avgdata 3268maxresident)k
0inputs+288520outputs (0major+418minor)pagefaults 0swaps

We now have a real run time of 5.66 seconds, a 28 percent increase in time compared to an optimal run.

We are not near optimal performance. To remedy this, we will allow cores 0-3 to use more cache by editing our CoS, my_special_cos. Let us change our L3 attribute to:

my_special_cos.cd /sys/fs/resctrl/my_special_cos
echo "L3:0=3;1=fffff;2=fffff;3=fffff" > schemata

This time we set the cache ways bitmask for this CoS to 0x3, on socket 0. Thus, CPUs 0-3 can only consume ~6.2 MBs (2 cache ways * ~3.1 MBs) of LLC.

Rerun the benchmark.

taskset -ac 0 time tar -czf test.tar.gz r1.mp4

4.12user 0.25system 0:04.37elapsed 99%CPU (0avgtext+0avgdata 3272maxresident)k
0inputs+288520outputs (0major+420minor)pagefaults 0swaps

We achieved the optimal performance of running our benchmark when increasing our CoS to 6 MBs, with a real run time of 4.37 seconds, which is a bit faster than our optimal benchmark.

Benchmark with Noisy Neighbor

This is where things get interesting. We will add noisy neighbors on cores 1-3, and rerun our benchmark. Then run each noisy neighbor in their respective terminal.

taskset -ac 1 stress-ng --cpu 1 --cache 16 --cache-level 3
taskset -ac 2 stress-ng --cpu 1 --cache 16 --cache-level 3
taskset -ac 3 stress-ng --cpu 1 --cache 16 --cache-level 3
taskset -ac 0 time tar -czf test.tar.gz r1.mp4

4.94user 0.38system 0:05.33elapsed 99%CPU (0avgtext+0avgdata 3200maxresident)k
0inputs+288520outputs (0major+419minor)pagefaults 0swaps

We see our mission critical application's performance is decreasing. We observe a real run time of 5.33 seconds, a 20 percent increase in time compared to an optimal run, when noisy neighbors are present.

We can remedy this by creating a CoS that we will deem low priority and place our noisy neighbors in that resource group.

To do this, do the following:

cd /sys/fs/resctrl
mkdir low_priority
cd low_priority
echo "e" > cpus
echo "L3:0=8;1=fffff;2=fffff;3=fffff" > schemata

cd /sys/fs/resctrl/my_special_cos
echo "1" > cpus

We placed the noisy neighbors in the low priority CoS that contains cpus 1-3 and has a maximum LLC of 1 cache way (~3.1 MBs), a bitmask of 0x8 hex, and 0b1000 binary representation. Then, we modified the my_special_cos CoS to contain CPU 0, and still have maximum cache of 2 cache ways (~6.2 MBs), a bitmask of 0x3 hex, and 0b0011 binary representation.

Warning: Do not overlap LLC Occupancy when partitioning CoS. For example, a CoS with an L3 bitmask of 0x3 will overlap a CoS with an L3 bitmask of 0x1.

Finally, let us rerun our noisy neighbors in the low_priority CoS. Then we’ll benchmark our mission critical app in the my_special_cos CoS, and see if we get back our optimal performance.

Low Priority CoS

taskset -ac 1 stress-ng --cpu 1 --cache 16 --cache-level 3
taskset -ac 2 stress-ng --cpu 1 --cache 16 --cache-level 3
taskset -ac 3 stress-ng --cpu 1 --cache 16 --cache-level 3

Note Run the noisy neighbors in their respective terminals.

My Special CoS

taskset -ac 0 time tar -czf test.tar.gz r1.mp4

4.54user 0.26system 0:04.81elapsed 99%CPU (0avgtext+0avgdata 3276maxresident)k
0inputs+288520outputs (0major+414minor)pagefaults 0swaps

Just like that, our mission critical app is back to near-optimal performance. We are still 8 percent slower than an optimal run, with a real-time run of 4.81 seconds versus the optimal run time of 4.428 seconds. By using Intel RDT and CAT, we successfully dampened our noisy neighbors from affecting our mission critical app by placing them in the low_priority CoS.

Summary

In this article we set up our system to take advantage of Intel RDT and CAT. Not only can you control the number of CPUs, memory, and storage assigned to an app/container/VM, but with Intel RDT you have granular control of how much LLC and memory bandwidth a workload can consume. It is now up to you to determine how you will define your CoSs, and which apps/containers/VMs will be assigned to them, depending on their importance.

 

Notices

No computer system can be absolutely secure.

Intel technologies may require enabled hardware, specific software, or services activation. Performance varies depending on system configuration. Check with your system manufacturer or retailer.

Performance results are based on testing as of March 2019 and may not reflect all publicly available security updates. See configuration disclosure for details. No product can be absolutely secure.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

Per node using 4 socket Intel® Xeon® E7-8890 processor version 4 @ 2.20GHz. Total memory 512GBs 64 slots/ 8GB/ 1600 MT/s DDR4 DIMM. Hyper Threading: Disabled, Turbo: Enable. Storage INTEL SSDSC2BA40 400GBs SATA 3.0 6Gb/s. Network Adapters: 2x Intel Corporation Ethernet Controller 10-Gigabit X540-AT2 (rev 01). stress-ng, version 0.09.25, tar (GNU tar) 1.29. OS: Ubuntu 18.04.1 LTS, Kernel: Linux 4.15.0-45-generic

§ For more information go to the Intel Product Performance page.

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.