Intel recently released the 4th Generation Intel® Core™ processors, which have Intel® Transactional Synchronization Extensions (Intel® TSX) enabled. Intel TSX can improve the performance of applications that use lock-based synchronization to protect data structure updates. This feature allows multiple non-conflicting lock-protected changes to data to occur in parallel.
I’ve found over the years one of the best ways to improve parallel performance is to get rid of coarse-grained locks, replacing them with multiple fine-grained locks to protect the structure being modified. While this technique can yield better performance, it is difficult to get right, and increases the number of locks that must be acquired to perform an operation. There may also be subtle performance problems if multiple locks occupy the same cache line; though the changes themselves may not overlap, locking adjacent mutexes in the same cache line will result in false sharing.
Transactional memory addresses the problem a different way, by allowing multiple threads to access or update the protected data, and guaranteeing the updates appear atomically to all other threads. This gives some of the benefits of fine-grained locking without having to make changes to the code beyond replacing the locks.
The two interfaces for Intel TSX are Hardware Lock Elision (HLE) and Restricted Transactional Memory (RTM).
- HLE is compatible with older Intel processors. Prefixes on the lock/unlock pair mark the lock as eligible for elision. On processors that support Intel TSX, the lock is not modified in memory, and the processor continues execution until the unlock is reached. If no conflicts are detected, the changes are atomically committed to memory. The code compiled to use HLE will execute on older processors, the only difference is the lock will not be taken speculatively.
- RTM is not compatible with older processors. The new instructions are XBEGIN, XEND, XTEST and XABORT. The transactional region is surrounded by an XBEGIN/XEND pair. If a conflict is detected the results of the computation will be discarded and the abort reason will be returned in EAX, with control passing to an abort handler. The flags will tell if the transaction may succeed if retried, and the abort code if an XABORT ends the transaction.
For more details on this technology please see the Intel Architecture Instruction Set Extensions Programming Reference.
In Intel® Threading Building Blocks (Intel® TBB) 4.2 release, a new mutex class, the speculative_spin_mutex, is implemented with the use of HLE interface. This lock is very simple to use; when a speculative_spin_mutex is locked, with HLE the processor starts a transaction. The visible state of the lock is not modified, and any locations in memory that are modified will not appear changed to other processors until the mutex is unlocked. When the unlock happens all the changes made under the lock are committed atomically, i.e. they all appear in memory simultaneously as far as other processors can tell. If a conflict occurs, either because the code under the lock sees a change to data it has read or written or the lock is taken non-speculatively by another thread, the changes are discarded and execution starts over at the point the lock was taken. In a simple use of HLE the hardware would then take the lock non-speculatively, but our story is a little different.
The speculative_spin_mutex is implemented as a fetch-and-store. If a transaction is aborted the thread will go back to the lock acquisition step. The thread will attempt to acquire the lock non-speculatively and if the acquire succeeds it will continue executing non-speculatively, i.e. all other threads will see the lock as unavailable. This is the standard hardware implementation. However, the thread could find that another thread already holds the lock. In this case, the Intel® TBB implementation will begin a spin-wait on the lock and when it becomes available the thread will re-attempt speculative execution.
The HLE interface does not require an explicit fallback path; the code protected by the lock can be executed speculatively or under a “taken” lock. The results will be the same. This also dictates the limitation that the state of the lock must be the same after the lock is released as before it is taken. This is the case for a spin_mutex. Other kinds of locks such as ticket locks are not reset to their state prior to locking, so they are difficult to implement with HLE.
One concern for speculative mutexes is false sharing. Concurrent writes to the cache line occupied by a speculative lock will abort its transactions. This is true even if the write is to some other variable that shares the cache line with the lock.
To avoid this problem, speculative_spin_mutex occupies two cache lines, because memory allocators and stack maintenance do not guarantee where in a cache line an object falls. The spin_mutex used by speculative_spin_mutex is guaranteed to fall at the start of a cache line, and no other object will occupy that cache line.
For processor architectures not compatible with HLE, the speculative_spin_mutex is implemented as tbb::spin_mutex with cache line padding as I just described.
As always with a new feature, your mileage may vary. Careful performance testing will tell whether speculative_spin_mutex is right for you.
The new mutex class with lock elision begins the story of transactional memory support in Intel TBB. Future updates will add more; e.g. soon we plan to release the RTM-based speculative_spin_rw_mutex as a Community Preview Feature. I’ll talk about speculative_spin_rw_mutex in another post once it is available.
For help optimizing your program with Intel TSX, you should consult the Intel® 64 and IA-32 Architectures Optimization Reference Manual, Chapter 12.