A newer version of this document is available. Customers should click here to go to the newest version.
IMB-IO Non-blocking Benchmarks
Intel(R) MPI Benchmarks implements blocking and nonblocking modes of the IMB-IO benchmarks as different benchmark flavors. The Read and Write components of the blocking benchmark name are replaced for nonblocking flavors by IRead and IWrite, respectively.
The definitions of blocking and nonblocking flavors are identical, except for their behavior in regard to:
- Aggregation. The nonblocking versions only run in the non-aggregate mode. 
- Synchronism. Only the meaning of an elementary transfer differs from the equivalent blocking benchmark. 
Basically, an elementary transfer looks as follows:
time = MPI_Wtime()
for ( i=0; i<n_sample; i++ )
{
    Initiate transfer
    Exploit CPU
    Wait for the end of transfer
}
time = (MPI_Wtime()-time)/n_sample 
  The Exploit CPU section in the above example is arbitrary. Intel(R) MPI Benchmarks exploits CPU as described below.
Exploiting CPU
Intel(R) MPI Benchmarks uses the following method to exploit the CPU. A kernel loop is executed repeatedly. The kernel is a fully vectorizable multiplication of a 100x100 matrix with a vector. The function is scalable in the following way:
IMB_cpu_exploit(float desired_time, int initialize);
The input value of desired_time determines the time for the function to execute the kernel loop, with a slight variance. At the very beginning, the function is called with initialize=1 and an input value for desired_time. This determines an Mflop/s rate and a timing t_CPU, as close as possible to desired_time, obtained by running without any obstruction. During the actual benchmarking, IMB_cpu_exploit is called with initialize=0, concurrently with the particular I/O action, and always performs the same type and number of operations as in the initialization step.
Displaying Results
Three timings are crucial to interpret the behavior of nonblocking I/O, overlapped with CPU exploitation:
- t_pure is the time for the corresponding pure blocking I/O action, non-overlapping with CPU activity 
- t_CPU is the time the IMB_cpu_exploit periods (running concurrently with nonblocking I/O) would use when running dedicated 
- t_ovrl is the time for the analogous nonblocking I/O action, concurrent with CPU activity (exploiting t_CPU when running dedicated) 
A perfect overlap means: t_ovrl = max(t_pure,t_CPU)
No overlap means: t_ovrl = t_pure+t_CPU.
The actual amount of overlap is:
overlap=(t_pure+t_CPU-t_ovrl)/min(t_pure,t_CPU)(*)
The Intel(R) MPI Benchmarks result tables report the timings t_ovrl, t_pure, t_CPU and the estimated overlap obtained by the (*) formula above. At the beginning of a run, the Mflop/s rate is corresponding to the t_CPU displayed.