Tracing Failing MPI Applications
$ export LD_PRELOAD=libVTfs.so $ mpirun -n 4 ./myApp
$ mpiicc -profile=vtfs myApp.c -o myApp
> echo SET PROFILE_PRELIB=%VT_ROOT%\lib\VTfs.lib > %I_MPI_ROOT%\lib\VTfs.conf > mpiicc -profile=VTfs myApp.c > mpiexec -n 4 myApp.exe
Includes events inside the application like segmentation faults and floating point errors, and also abort signals sent from outside, like
SIGKILLwill abort the application without writing a trace because it cannot be caught.
One or more processes exit without calling
MPI detects certain errors itself, like communication problems or invalid parameters for MPI functions.
If Intel® Trace Collector observes no progress for a certain amount of time in any process, it assumes that a deadlock has occurred, stops the application and writes a trace file.
You can configure the timeout with
DEADLOCK-TIMEOUT. "No progress" is defined as "inside the same MPI call". This is only a heuristic and may fail to lead to both false positives and false negatives.
If the application polls for a message that cannot arrive with
MPI_Test()or a similar non-blocking function, Intel® Trace Collector still assumes that progress is made and does not stop the application.
To avoid this, use blocking MPI calls in the application, which is also better for performance.
If all processes remain in MPI for a long time due to a long data transfer for instance, then the timeout might be reached. Because the default timeout is five minutes, this is very unlikely. After writing the trace
libVTfswill try to clean up the MPI application run by sending all processes in the same process group an
INTsignal. This is necessary because certain versions of MPICH* may have spawned child processes which keep running when an application aborts prematurely, but there is a certain risk that the invoking shell also receives this signal and also terminates. If that happens, then it helps to invoke
mpiruninside a remote shell:
MPI errors cannot be ignored by installing an error handler.
libVTfsoverrides all requests to install one and uses its own handler instead. This handler stops the application and writes a trace without trying to proceed, otherwise it would be impossible to guarantee that any trace will be written at all.
On Windows* OS, not all features of POSIX* signal handling are available. Therefore,
VTfson Windows* OS uses some heuristics and may not work as reliably as on Linux* OS. It is not possible to stop a Windows* application run and get a trace file by sending a signal or terminating the job in the Windows task manager.