To use HLE/RTM to improve lock scalability the lock library needs to be enabled. If you already have an enabled lock library, like glibc on Linux, you can just use normal locking with that library. If the lock library doesn't support it or you have your own lock the library needs to be enabled, like discussed in Roman's blog post. Enabling the lock library requires using the RTM instructions or the HLE prefixes. Newer compilers -- like gcc 4.8 -- support intrinsics for HLE and RTM.
"RTM
#include <immintrin.h> if (_xbegin() == _XBEGIN_START) { /* transaction */ } else { /* fallback path -- take lock */ }
"
HLE
while (__atomic_exchange_n(lock, 1, __ATOMIC_ACQUIRE|__ATOMIC_HLE_ACQUIRE) != 0) { int val; /* Wait for lock to become free again before retrying. */ do { _mm_pause(); /* Abort speculation */ __atomic_load_n(lock, &val, __ATOMIC_CONSUME); } while (val == 1); }
Now older compilers do not support these intrinsics directly. tsx-tools has some compatibility headers that implement them with older gcc and compatible (clang, icc, sun cc etc.) compilers. This allows to use TSX without updating the compiler.
rtm.h
Provides the standard TSX intrinsics _xbegin(), _xend(), _xtest(), _xabort()
hle-official.h
Same as rtm.h (compat name)
rtm-goto.h
Alternative unofficial RTM intrinsics implementation for gcc 4.6 with asm goto support (Fedora) or gcc 4.7+ This saves a few instruction for every transaction setup by exposing the jump to the abort handler to the programmer. Useful for people who care about micro optimizations.
hle-emulation.h
An emulation of the gcc 4.8+ HLE atomic intrinsics These are similar in spirit, but do not fully match the intrinsics.
gcc 4.8+ implements HLE as an additional memory ordering model for the C11+ atomic intrinsics. gcc has its own flavour which are similar to C11, but use a different naming convention. We cannot directly emulate the full memory model.
So the operations are mapped to __hle_acquire_ and __hle_release_ without an explicit memory model parameter.
The other problem is that C11 atomics use argument overloading to support different types. While that would be possible to emulate it would generate very ugly macros. We instead add the type size as a postfix.
So for example:
int foo; __atomic_or_fetch(&foo, 1, __ATOMIC_ACQUIRE|__ATOMIC_HLE_ACQUIRE)
becomes
__hle_acquire_or_fetch4(&foo, 1);
Also C11 has some operations that do not map directly to x86 atomic instructions. Since HLE requires that a single instruction starts a transaction, we omit those. That includes nand, xor, and, or. While they could be mapped to CMPXCHG this would require a spin loop, which is better not done implicitely. There is also no HLE load.
x86 supports HLE prefixes for all atomic operations, but not all can currently be generated in this scheme, as many operations have no support for fetch.
A real compiler could generate them by detecting that the fetch value is not used, but we don't have this luxury. For this we have non _fetch variants. These also support and, or, xor (but not nand), as a extension.
Intrinsics for sbb, adc, neg, btr, bts, btc are not supported.
We also don't implement the non _n generic version of some operations.
"Available Operations
(8 only valid on 64bit)
__hle_{acquire,release}_add_fetch{1,2,4,8} __hle_{acquire,release}_sub_fetch{1,2,4,8} __hle_{acquire,release}_fetch_add{1,2,4,8} __hle_{acquire,release}_fetch_sub{1,2,4,8} __hle_{acquire,release}_{add,sub,or,xor,and}{1,2,4,8} (extension) __hle_{acquire,release}_store_n{1,2,4,8} __hle_{acquire,release}_clear{1,2,4,8} __hle_{acquire,release}_exchange_n{1,2,4,8} __hle_{acquire,release}_compare_exchange_n{1,2,4,8} __hle_{acquire,release}_test_and_set{1,2,4,8} (sets to 1)
hle-ms.h
An emulation of the Microsoft compiler HLE intrinsics for gcc.
"