Supported Errors
Errors fall into two different categories:
- Local errors that need only the information available in the process itself and do not require additional communication between processes
- Global errors that require information from other processes
Another aspect of errors is whether the application can continue after they
occurred. Minor problems are reported as warnings and allow the application to
continue, but they lead to resource leaks or portability problems. Real errors are
invalid operations that can only be skipped to proceed, but this either changes the
application semantic (for example, transmission errors) or leads to follow-up errors
(for example, skipping an invalid send can lead to a deadlock because of the missing
message). Fatal errors cannot be resolved at all and require an application
shutdown.
Problems are counted separately per process. Disabled errors are neither reported
nor counted, even if they still happen to be detected. The application will be
aborted as soon as a certain number of errors are encountered: obviously the first
fatal error always requires an abort. Once the number of errors reaches
CHECK-MAX-ERRORS
or the total
number of reports (regardless whether they are warnings or errors) reaches
CHECK-MAX-REPORTS
(whatever
comes first), the application is aborted. These limits apply to each process
separately. Even if one process gets stopped, the other processes are allowed to
continue to see whether they run into further errors. The whole application is then
aborted after a certain trace period. This timeout can be set through CHECK-TIMEOUT
.The default for
CHECK-MAX-ERRORS
is 1
so that the first error
already aborts, whereas CHECK-MAX-REPORTS
is at 100
and thus that many warnings errors are allowed. Setting both
values to 0
removes the limits.
Setting CHECK-MAX-REPORTS
to
1
turns the first warning into a
reason to abort.When using an interactive debugger the limits can be set to 0 manually and thus
removed, because the user can decide to abort using the normal debugger facilities
for application shutdown. If he chooses to continue then Intel® Trace Collector will
skip over warnings and non-fatal errors and try to proceed. Fatal errors still force
Intel® Trace Collector to abort the application.
See the lists of supported errors (the description provides just a few keywords for
each error, a more detailed description can be found in the following sections).
Local Errors
Error Name | Type | Description |
---|---|---|
| Fatal | Process terminated by fatal signal |
| Fatal | Process exits without calling MPI_Finalize() |
| Depends on MPI and error | MPI itself or wrapper detects an error |
| Warning | Multiple MPI operations are started using the same memory |
| Error | Data modified while owned by MPI |
| Error | Buffer given to MPI cannot be read or written |
| Error | Read or write access to memory currently owned by MPI |
| Error | Distributed memory checking |
| Error | Invalid sequence of calls |
| Warning | Program creates suspiciously high number of requests or exits with
pending requests |
| Warning | An active request has been freed |
| Warning | Program creates high number of data types |
| Warning | Not enough space for buffered send |
Global Errors
Error Name | Type | Description |
---|---|---|
| Error | The type signature does not match |
| Error | Data modified during transmission |
| Warning | Program terminates with unreceived messages |
| Fatal | A cycle of processes waiting for each other |
| Fatal a | A cycle of processes, one or more in blocking send |
| Warning | Warning when application might be stuck |
| Error | Processes enter different collective operations |
| Error | More or less data than expected |
| Error | Reduction operation inconsistent |
| Error | Root parameter inconsistent |
| Error | Invalid parameter for collective operation |
| Warning | MPI_Comm_free() must be
called collectively |
a
if check is enabled, otherwise it depends on the MPI
implementation