Intel® TSX exposes a speculative execution mode to the programmer to improve locking performance.. Tuning speculation requires heavily on a PMU profiler. This document describes TSX profiling using the Linux perf) (or “perf events”) profiler, that comes integrated with newer Linux systems. More details on TSX are available at ACM queue, Wikipedia, at LWN for the Linux glibc implementation and in the specification (chapter 8)
Preliminaries
The techniques described in this guide need an updated perf version with TSX and Haswell support. Perf is integrated into the Linux kernel, so this requires updating to a new kernel.The 3.13 kernel includes all the features described here. Please get it from /pub/linux/kernel/v3.x/ .At the time of this writing a RC pre-release is available. The earlier 3.11 and 3.12 kernels contain a subset. For best results please use 3.13 or later.
After downloading the kernel tree it should be installed and booted. In addition the perf binary needs to be build (in tools/perf) and installed. Some perf features may require additional library packages to be installed. The build procedure suggests the package names.
Some Linux distribution may also provide updated kernel and perf packages. For example Fedora 20 with the latest updates has full support.
Basic Cycle Sampling to Understand the Program
Then the program has to be enabled initially for TSX. When it is running with TSX it can be profiled using perf.
Normal cycle sampling aborts transactions, so may impact the TSX performance. Some initial cycle sampling (perf record ; perf report) is still a good idea to get a basic overview of the expensive parts of the workload, but it should be understood that it affects the TSX performance.
perf record -e cycles:pp -g program
perf report
Use the interactive browser or generate a report in a file (perf report > file)
When sampling TSX it is important to use the :pp qualifier for cycles to enable PEBS, otherwise the sampling instruction pointer will always be in the abort handler (or near the lock instruction for HLE) when a sample hits a transaction.
Measuring Basic Transactional Success with perf stat -T
The first step after the program is running with TSX to use perf stat -T to measure the basic transactional success.
perf stat -T program
or if the program is long running in a steady state run in parallel from another terminal
perf stat -T -a sleep 1
Using -a may require being root or setting /proc/sys/kernel/perf_paranoid to -1 first. With -a the complete machine will be measured. Alternatively it's also possible to attach to specific pids with -p.
The -T option reports the number of transactional cycles. When the number is low the program may not spend much time in locks or the locks are not enabled for TSX lock elision.
-T also reports the aborted cycles, that is cycles spent in doomed transactions that did not commit. The goal of TSX tuning is normally to make that number as small as possible, that is to make the commit rate of transactions as large as possible.
These numbers should be only trusted for relatively long running processes. At startup there are typically various transient abort causes (for example faulting in the working set) that will disappear later. If the startup phase of the application is very expensive it is preferable to use -a or -p in parallel to only measure when the program is past the start-up phase.
Newer version of perf also have a -I option to enable interval sampling. For programs with very different phases this can be useful to use with -T to get separate measurements for different phases.
In addition -T reports the number of transactions; separated for HLE (el) and RTM (tx) and their average length. In general it is preferable if transactions are not too short.
The overhead of perf stat -T counting is normally low, it should not affect the run time of the program significantly. perf stat -T counts both kernel and user transactions. When a RTM enabled kernel is used, but only the user program should be measured it is possible to specify the events used by -T manually using -e to perf stat, with an additional :u qualifier to only count them for ring 3. The computations for the various ratios will be still done.
Profiling for Missing Locks
When the number of transactional cycles reported by -T is low, not all locks may be elided. To find all locks in a program and make sure they are elided, it's useful to count the MEM_UOPS_RETIRED.LOCK_LOADS event in comparison with RTM_RETIRED.START or HLE_RETIRED.START.
perf stat -e '{r21d0,tx-start,el-start,cycles}' program
When the number of lock loads is significantly higher than the number of started transactions it may be possible that not all locks are elided. In this case sample for the common locks and make sure the common ones are all elided
perf record -g -e r21d0:p …
perf report
Profiling Abort Causes with perf record Sampling
When the number of aborted cycles reported by stat -T is high the location of aborts should be profiled using sampling. Abort sampling does not affect the transaction commit rate, because the transactions have already aborted when sampled (it however still adds some overhead)
HLE and RTM use different sampling events. perf stat -T reports whether HLE or RTM are used (el-starts or tx-starts). When el-starts is high el-aborts is the abort even, when tx-starts is high tx-aborts is the abort event. It is also possible to sample for both at the same time, but it is recommended to not specify a event that is not needed. This example below samples for RTM aborts.
perf record -e cpu/tx-abort/pp program # measure program
perf record -e cpu/tx-abort/pp -a sleep 1 # measure whole system for 1 second
perf report # display samples
PEBS monitoring needs to be enabled explicitely with pp (otherwise only the abort handler is sampled). This is required for correct abort locations. When the event is specified in another way precise needs to be 2 (either with :pp or with precise=2). The PEBS record contains additional information that needs to be explicitly enabled. The important information for aborts is the weight (--weight), the transaction flags (--transaction) and often the call chain (-g).
The program should have debugging information and a symbol table available. This can be either done by compiling with -g and have the object files available, or for a distribution program install the debuginfo package. This allows to display the symbols and also browse sample results for individual lines. When the source code is available on the system perf is also able to report samples in a source listing.
By default the assembler code is displayed in AT&T syntax. Intel assembler syntax can be enabled for “perf record” with -M intel. The overview displays the samples by symbols. When srcline is added to the sort argument below it can be also reported by source line.
The additional information needs to be explicitly specified for perf report using the --sort command, so that it is displayed
perf record -g --transaction --weight -e cpu/tx-abort/pp program
perf report --sort symbol,transaction,weight
The transaction weight is the global cost of cycles the transaction spent before aborting.Aborts with a high weight are more expensive.
Note that the current perf version does not sort on weight, that is the top entries are not necessarily the most expensive.
There is a global_weight which is the global sum of the weight and local_weight which is the average weight of the sample. The default weight is global.
perf currently splits samples by weight, which may lead to a lot of entries. After weight has been examined it is sometimes useful to remove it from the sort keys to collapse the output.
The transaction flags describe the type of abort. The tuning strategy varies depending on the type.
Name |
Description |
---|---|
EL |
Abort in a HLE transaction. |
TX |
Abort in a RTM transaction. |
SYNC |
Synchronous abort. The abort was caused by the specified instruction, for example a system call or other unfriendly instruction (see 14.3.8.1 in the specification) |
ASYNC |
Asynchronous abort. The abort was caused by another thread and the instruction pointed to just happened to be running at that time. |
RETRY |
The transaction is retry-able. |
CON |
Conflict: The abort was caused by a write conflict, typically caused by another thread. The location can be random in the transaction, but is often near a conflict causing pattern. |
CAP-WRITE |
Capacity: the abort was caused by the local transaction exceeding the maximum write buffering capacity of the transaction. |
CAP-READ |
Capacity: the abort was caused by the local transaction exceeding the maximum read buffer capacity of a transaction. Rarely this can be also caused by other abort types. |
:NUMBER |
Abort code. The program explicitly aborted with the XABORT instruction with code NUMBER. A common code is 0xff for lock busy in the lock library, that is the lock being not elided for a long time. |
Call chains are often needed to understand the context in the program. The record call chain option (-g) requires compiling the program and the used libraries with -fno-omit-frame-pointer or using -g dwarf in perf when dwarf2 unwind information is available. Using dwarf2 is slower than using frame pointers. It also requires compiling the perf tool with the unwind library.
The call chains of an abort is only recorded after the transaction has aborted, which is typically in the lock library The sample hit is the actual abort point. So the callchain is discontinuous, it starts with the abort point and continues with the call graph of the lock library lock function. This is important to keep in mind when looking at the call graphs, as it differs from samples not hitting transactions.
For asynchronous (and to a lesser degree capacity) aborts it is often more useful to look at the whole critical section, than the specific abort point. The abort is triggered by another CPU and appears randomly in the transaction, so the instruction reported in the sample may have nothing to do with the abort cause. The memory access of the whole transaction needs to be examined to minimize read-write or write-write memory sharing. This can be done by sampling for the return IP of the abort, either by using tx-aborts-count or el-aborts-count with callgraph. For non inlined locks typically the first caller outside the locking library defines the critical section. The non PEBS *-count events do not support weight and transaction flag. If those are needed with return ip the non eventingrip PEBS version of these abort events can be used: r4c8:p (HLE abort) or r4c9:p (RTM abort) They currently only exist in raw form, but this may change in future perf versions.
Last Branch Records (LBRs) to Look Inside the Transaction
To see the control flow inside a transaction that lead to an abort LBRs can be used. When enabled the CPU stores the last 16 branches in the LBR registers. Perf record can sample them for aborts with the -b option. The default display uses basic block histograms and collapses all paths and is often not very useful to analyze individual abort samples. A workaround is to use perf report -D and extract them manually from the samples and translate the addresses with addr2line. Future perf versions will hopefully improve this cumbersome procedure.
Update: an experimental patchkit for perf report to enable with with --branch--call-graph is available.
Additional TSX Events and Qualifiers
Perf has some additional builtin TSX events. The counting events are separate for HLE and RTM All the available builtin events can be listed with “perf list”. Except for the abort events these events are not precise. When used with perf report and they hit a transaction they will report an instruction after the abort, which is often not useful.
RTM event |
HLE event |
Description |
---|---|---|
cpu/tx-abort/pp |
cpu/el-abort/pp |
Precise abort event for sampling. Use this to profile for abort locations, |
tx-abort |
el-abort |
Abort event for counting, not using PEBS). This event should be used with perf stat instead of the precise version. When sampled it will sample the point of the abort handler or original lock for HLE, and not the abort point. |
tx-capacity |
el-capacity | Transactions that exceeded the buffering capacity. Not precise, use for counting. |
tx-commit |
el-commit |
Successful transaction commits |
tx-start |
el-start |
Transaction starts. Only use for counting. |
tx-conflict |
el-conflict |
Abort due to a memory conflict. Only use for counting |
cycles-t |
Transaction cycles. Only use for counting. |
|
cycles-ct |
Transactional cycles minus cycles of aborted transactions (committed cycles). Only use for counting. |
|
cpu/instructions,in_tx=1/ |
Transactional instructions. Only use for counting. |
|
cpu/instructions,in_tx=1,in_tx_cp=1/ |
Transactional instructions minus aborted transactions (committed instructions). Only use for counting. |
For analyzing capacity and conflict aborts it is usually preferable to sample aborts with --transaction and examine the transaction abort types in the sample browser. Most of these events are more useful for counting (perf stat) which does not cause additional aborts.
Specifying raw perf events
Perf has only a limited builtin event list. Additional events can be specified in a raw hex form:
rUUEE
where EE is the hex event modifier and UU is the unit mask. See the SDM for a full list of valid events on Haswell. Additional qualifiers can be added to the mask, for example 0x100000000 to count the event only inside a transaction and 0x200000000 to set a checkpointed event (see section 18.10.5 in the SDM for more details). The raw mask is directly mapped to the control register of the performance counter.
Some alternative frontends for perf provide a full symbolic event list (for example ocperf in pmu-tools), which avoids this complicated procedure. ocperf can also generate raw event strings for later use.
Perf supports multiplexing events to run more than 4 (or 8 for non HyperThreading) counters in parallel. When events are used in equations that depend on each other it is important to run all the events in a equation at the same time. This can be done by specifying event groups with {}. It is valid to run multiple groups at a time and then will be multiplexed.
perf stat -e '{cycles,cycles-t,cycles-ct}' program
perf stat -e '{r154,r254,r21d0},{r1c9,r1c8,4c8,r4c9}' -p PID
Newer perf version also have an alternative more verbose syntax:
perf stat -e 'cpu/event=0x12,umask=0x34,intx=1,intx_cp=1/' -a sleep 1
When the events are post processed by a program it is useful to enable CSV mode for perf stat (-x,)
Additional raw TSX Events Diagnosing Various Abort Conditions
Most of these events are speculative and may over-count. They are not precise and will only report an IP after abort. They can be used to count specific abort reasons. In most cases it is a better alternative to try analyze aborts from tx-abort/el-abort PEBS sampling, as that can be done by source line. The HLE events may be useful to confirm specific HLE commit problems when debugging a new HLE enabled lock library.
SDM Name |
Perf raw event |
Description |
---|---|---|
tx_exec.misc4 |
r85d |
HLE executed inside RTM and other rare abort causes |
tx_exec.misc3 |
r45d |
Transaction nesting limit overflow and other rare abort causes |
tx_mem.abort_hle_elision_buffer_full |
r4054 |
Too deep nesting for HLE |
tx_mem.abort_hle_elision_buffer_unsupported_alignment |
r2054 |
Read from lock value inside HLE region using unsupported alignment. |
tx_mem.abort_hle_elision_buffer_mismatch |
r1054 |
XRELEASE address or value did not match XACQUIRE |
tx_mem.abort_hle_elision_buffer_not_empty |
r854 |
HLE XRELEASE without matching XACQUIRE |
tx_mem.abort_hle_store_to_elided_lock |
r454 |
Store to lock inside HLE region |
Additional useful perf events. Using the software trace points may require running as root or making /proc/sys/kernel/debug/tracing world readable first. The kernel also needs to support system call tracing. For more available events see “perf list”
Name |
Perf event |
Description |
---|---|---|
syscalls:sys_enter_futex |
syscalls:sys_enter_futex |
Count futex syscalls, which give a rough estimate of how often a sleeping futex lock (e.g. a pthread mutex) blocked. This also counts lock wake-ups on unlock when another thread is waiting. For adaptive pthread mutexes it will under-count contention. |
mem_uops_retired.lock_loads |
r21d0 |
Count atomic operations. Useful to find locks that are not elided. Can be also used as a PEBS event (:p) for sampling. Should not be used with multiplexing. |
Updated 09/13 for latest perf code
Updated 09/25 to fix some mistakes and clarify some descriptions.
Updated 11/07 to report merge status
Updated 12/15 to remove mention of git tree, just point to released kernels
Updated 01/11 to fix a broken reference, point to rawhide and add pointer to LBR callgraph patch.
Updated 04/29 to fix another broken reference, add a pointer to ACM queue article and update FC20 reference.