Intel(R) MPI Benchmarks implements blocking and nonblocking modes of the IMB-IO benchmarks as different benchmark flavors. The Read and Write components of the blocking benchmark name are replaced for nonblocking flavors by IRead and IWrite, respectively.
The definitions of blocking and nonblocking flavors are identical, except for their behavior in regard to:
Basically, an elementary transfer looks as follows:
time = MPI_Wtime()
for ( i=0; i<n_sample; i++ )
Wait for the end of transfer
time = (MPI_Wtime()-time)/n_sample
The Exploit CPU section in the above example is arbitrary. Intel(R) MPI Benchmarks exploits CPU as described below.
Intel(R) MPI Benchmarks uses the following method to exploit the CPU. A kernel loop is executed repeatedly. The kernel is a fully vectorizable multiplication of a 100x100 matrix with a vector. The function is scalable in the following way:
IMB_cpu_exploit(float desired_time, int initialize);
The input value of desired_time determines the time for the function to execute the kernel loop, with a slight variance. At the very beginning, the function is called with initialize=1 and an input value for desired_time. This determines an Mflop/s rate and a timing t_CPU, as close as possible to desired_time, obtained by running without any obstruction. During the actual benchmarking, IMB_cpu_exploit is called with initialize=0, concurrently with the particular I/O action, and always performs the same type and number of operations as in the initialization step.
Three timings are crucial to interpret the behavior of nonblocking I/O, overlapped with CPU exploitation:
t_pure is the time for the corresponding pure blocking I/O action, non-overlapping with CPU activity
t_CPU is the time the IMB_cpu_exploit periods (running concurrently with nonblocking I/O) would use when running dedicated
t_ovrl is the time for the analogous nonblocking I/O action, concurrent with CPU activity (exploiting t_CPU when running dedicated)
A perfect overlap means: t_ovrl = max(t_pure,t_CPU)
No overlap means: t_ovrl = t_pure+t_CPU.
The actual amount of overlap is:
The Intel(R) MPI Benchmarks result tables report the timings t_ovrl, t_pure, t_CPU and the estimated overlap obtained by the (*) formula above. At the beginning of a run, the Mflop/s rate is corresponding to the t_CPU displayed.