I. How long does record/replay take?
Record/replay overhead is a function of number of memory accesses and the amount of sharing in the test program.
1. Time for recording/replaying a 'region':
Source : CGO2014 paper on DrDebug
2. Slow-down for whole-program recording.
Source: Measured with PinPlay kit 2.0. (we are continuously looking to improve these)
|Benchmark/Input|| How recorded/replayed
(pin -t pinplay-driver.so ...)
|SPEC2006/'ref'||-log:mt 0 / -replay:addr_trans 0||98x||11x|
|PARSEC/'native' >=4T||-log:mt 1 / -replay:addr_trans 0||197x||37x|
II. Why does PinPlay have a high overhead (especially for recording)?
The design goals of PinPlay were:
- No special HW requirement (including no reliance on HW performance counters).
- No special operating system requirement (including no virtual machine or no modified kernel).
- Complete and faithful reproduction of multi-threaded schedules.
- Portability (small size, OS-independence) of recording ("pinball").
- No program source needed. No re-compilation/re-linking required.
As a result, PinPlay works on multiple operating systems 'out of the box' and provides the guarantee that a bug once captured will not escape. However, that comes with a high overhead, especially during recording.
There are two major sources of slow-down in PinPlay (we are continuously looking to improve these):
1. System call side-effect analysis.
A shadow memory is implemented during recording. All real memory writes observed in the program are replicated on the shadow memory. Memory reads lead to a comparison of 'real' memory values and 'shadow' memory values and mismatch/missing value leads to an injection being emitted in the *.sel file. At replay time, all memory reads are monitored and recorded memory values are injected if present. The details are described in our SIGMETRICS 2006 paper "Automatic Logging of Operating System Effects to Guide Application-Level Architecture Simulation".
The overhead of this technique is proportional to the number of memory accesses in the program.
2. Shared memory access order analysis.
During recording, all memory accesses are monitored and a cache coherency protocol is simulated including maintenance of last reader/writer for each shared memory access. A subset of detected read-after-write, write-after-read, and write-after-write dependences is recorded in the *.race file. During replay, all memory accesses are monitored and a thread is delayed if it tries to access a shared memory location out of order.
The overhead of this technique is proportional to the number of shared memory accesses in the program.