One of the most popular ad-hoc functional debugging techniques is to use the printf or fprintf functions to display the state of variables. However, if these functions are used inside an Intel® Transactional Synchronization Extensions (Intel® TSX) transaction they can cause transaction aborts. The reason is that flushing the print output buffer involves an operating system call and an I/O operation: operations that cannot be roll backed by Intel® TSX. That means that the (f)printf output from transaction may be lost due to the machine state roll-back as a result of the transaction abort caused by the attempt to flush the I/O buffer inside the transaction. If the flush happens after a committed transaction then the printf output won’t be lost.1
In general, any transaction abort handler needs to use a fall-back synchronization mechanism that does not involve Intel TSX. It should, therefore, be possible to see the problem that is being debugged there where printf works as expected. However, what can you do if, for some reason, the problem is not reproducible in the fall-back execution? So far I haven’t had this problem, but if you do please consider the trick shown below.
The idea is to use Intel® Processor Trace (Intel® PT) technology. This records an instruction trace that includes execution inside transactions, but it doesn’t record any data values. To overcome this limitation I map a data value into a code execution path that I can then see in the captured trace.
The basic building block of the method is function that “prints” a char (1 byte) into the processor instruction trace. It takes the character (in the %rdi register) and jumps to that offset in a special code block defined by character’s value. The code block contains 256 1-byte “ret” instructions (return from the function).
The decoder of the processor trace dump can then read the records indicating control transfer from the jmp *%r8 instruction offset to a position in the ‘ret’ table (tsx_print_char_0 label) and interpret them as characters. Using Linux perf this can be done as follows ($7 and $11 are fields in the perf Intel PT dump that contain source and destination of an execution control transfer, 3 is the length of the “jmp *%r8” instruction encoding):
Using this primitive one can built a function that is compatible with the normal printf:
Here is a performance micro-test for the tsx_printf function:
Compiling and running the test:
The output:
The table below shows cycles per iteration when the test runs 10000 iterations on an Intel® Core™ i7-6700 CPU @ 3.40GHz (“Skylake”) with Linux kernel 4.4 and “performance” power governor (“cpupower frequency-set -g performance”). The output of the normal printf (stdout) is redirected to a file to use buffering and improve speed (flushed just once before the end of the test).
printf variant | cycles per iteration |
cycles per iteration with Intel PT collection running |
---|---|---|
no printf | 495 | 583 |
buffered printf | 697 | 848 |
tsx_printf | 1015 | 1356 |
As one can see the overhead of tsx_printf is similar to that of the normal buffered printf.
The source code is available under conditions of the Intel Sample Source Code license.
Web Resources about Intel® Transactional Synchronization Extensions
1 If the functional issue one wants to debug were transaction aborts in the first place then all printf output before the debugged abort point will be lost. However, the performance monitoring unit is able to provide a good guidance on the type of instruction abort (TSX-unfriendly instruction, access to TSX-incompatible memory type) and its location in the code.
.
"