Introduction
by Michael Stoner
If you are an application performance engineer, you might be familiar with the process of developing a benchmark to measure the execution time of the piece of code you are optimizing. This is essential to determine whether your code changes are helping or hurting performance. Most tuning experts accomplish this using assembly code to access the processor's internal time-stamp counter, a fairly simple task on the IA-32 platform running the Windows* operating system. However, what happens when you try to do this on Linux*, or the Itanium® 2 processor? If these platforms constitute unknown territory, don't worry — all of the functionality you need is available in the "IAPerf.h" performance measurement macros for Intel® architecture.
How to use "IAPerf.h"
Appendix A contains the code making up the "IAperf.h" header. Actually, you can copy and paste it into a file and call it whatever you like, but "IAperf.h" seems suitable.
Figure 1 outlines how the macros can be used. The first macro, PERFINITMHZ, takes the clock-speed of the machine in megahertz as input. This declares the variables used by the timing code and sets the clockspeed variable so that time can be reported in seconds. (An obvious limitation, since there are ways to obtain the clock speed at run-time). The macros PERFSTART and PERFSTOP both record the value of the time-stamp counter and are to be used as bookends around the code you want to evaluate. Finally, the PERFREPORT macro uses printf() to display the elapsed time between PERFSTART and PERFSTOP.
#include "IAperf.h"
__..
Foo(_)
{
[initialization
code, variable declarations, etc.]
PERFINITMHZ(500)
// initialize performance measurement
macros for a 500 Mhz machine
[more
set up code - not under test]
&nb
sp; PERFSTART
// record time-stamp counter prior to entering
performance-critical code
[performance-critical
code under examination by
the user]
PERFSTOP //
record time-stamp counter after leaving
performance-critical code
PERFREPORT
// report the length of time spent in the performance-critical
code
[more code not
under test]
}
|
Figure 1. "IAPerf.h" example usage.
When optimizing a function within a large application, you may desire to break that function into a small console application to facilitate faster recompilation and easier performance timing. This is an ideal framework for "IAPerf.h" because it can simply measure the time it takes to loop the function for a fixed number of iterations (typically, the number of times necessary to get a workload that runs for several seconds).
Often, though, you will want to measure some part of an application that cannot be easily separated into a "micro-benchmark". The metric of choice for this is usually the average time spent passing through the region of interest, assuming the code runs repeatedly during the application workload. The "IAPerf.h" macros will require some modification to gather the average time over many passes through the code, but this should be a simple task for an experienced C programmer. For a GUI application without console output, the reporting mechanism will need to either write to the screen or to a file.
Many more macros could be written to manage the timing data in different ways, but it would be difficult to provide a method that would suit every user. The intent of "IAPerf.h" is not to provide a black box solution for all performance measurement activities, but rather to abstract the task of reading the processor time-stamp counter in different environments.
Supported Environments
The environments supported by "IAPerf.h" are categorized by processor architecture, operating system, and compiler, as shown in Table 1. The macros may also work on UNIX* operating systems running on Intel Architecture, but proper
results have not been verified.
| Architecture |
Windows* |
Linux* |
| Intel® IA-32 architecture |
Microsoft Visual C++* 6.0/7.0
Intel C/C++ Compiler 5.0/6.0
|
GCC
Intel C/C++ Compiler 5.0/6.0
|
| Itanium® architecture |
Microsoft Visual C++ 6.0/7.0
Intel C/C++ Compiler 5.0.6.0
|
GCC
Intel C/C++ Compiler 5.0/6.0
|
Table 1. Supported Environments.
Some general comments on the code that makes up "IAPerf.h":
- The PERFSTART and PERFSTOP macros both call the rdtsc() function, which uses #define's to compile the code compatible with the targeted platform. It seems logical to reduce call/return overhead by inlining rdtsc(), but this may have adverse effects on IA-32 architecture where inline assembly is used to access the time-stamp counter. Certain compiler optimizations may be turned off whenever inline assembly is present in a function, and inlining the rdtsc() function will be equivalent to putting inline assembly in the code you are timing.
- If you choose to edit the macros, make sure not to leave any extraneous tabs or spaces after the '' line continuation characters. This is an easy mistake to make, yet very hard to debug if you have not seen this kind of error before.
- The "volatile" keyword is critical for some compilers to indicate that the value of the time-stamp counter is constantly changing. For example, early versions of the Microsoft 64-bit compiler assumed that the value of the ar.itc register was static since the program never writes a new value into it.. The compiler would read ar.itc once during PERFSTART and then reuse the same value for PERFSTOP, giving a result of zero seconds elapsed.
- On IA-32 architecture, the CPUID instruction is inserted prior to reading the time-stamp counter. This instruction serializes the out-of-order execution engine so that all instructions ahead of CPUID retire before the time-stamp counter is read. Likewise, none of the instructions following CPUID will be executed until the CPUID instruction has retired. Serialization at such fine granularity is not that critical unless you are measuring a very short time window.
About the Author

Mike Stoner is a Senior Applications Engineer with Intel's Softwa
re Solutions Group. He has been with Intel since 1996, mainly in the role of helping independent software vendors develop optimized code for Intel platforms. Prior to joining Intel Mike received Bachelors and Masters degrees in Electrical Engineering from The Ohio State University. He can be reached at
michael.stoner@intel.com.
Appendix A - IAPerf.h code
// IAPERF.H
//
#include <stdio.h>
// RDTSC functions
|
#ifdef __linux__
#ifdef
__ia64__
#ifdef
__INTEL_COMPILER
#ifdef
__cplusplus
extern
"C" { unsigned __int64 __getReg(int whichReg); }
#else
unsigned
__int64 __getReg(int whichReg);
#endif
#pragma
intrinsic(__getReg)
#define
INL_REGID_APITC 3116
unsigned
__int64
rdtsc()
{
volatile
unsigned __int64 temp;
temp
= __getReg(INL_REGID_APITC);
return
temp;
}
#else //
__INTEL_COMPILER
//
Assume 64-bit
gcc
unsigned
long rdtsc()
{
volatile
unsigned long temp;
__asm__
__volatile__("mov
%0=ar.itc" : "=r"(temp) :: memory");
return
temp;
}
#endif // __INTEL_COMPILER
#else // __ia64__
// Assume
IA32, gcc
or Intel Compiler
unsigned long
long rdtsc()
{
unsigned
long long temp;
__asm__
__volatile__
(
"cpuid
"
"rdtsc
"
"leal
%0, %%ecx
"
"movl
%%eax, (%%ecx)
"
"movl
%%edx, 4(%%ecx)
"
:
:
"m" (temp)
:
"%eax", "%ebx", "%ecx", "%edx");
return
temp;
}
#endif // __ia64__
#else // __linux__
#ifdef WIN64
// Assume
Microsoft or Intel Compiler
#ifdef
__cplusplus
extern
"C" { unsigned __int64 __getReg(int whichReg); }
#else
unsigned
__int64 __getReg(int whichReg);
#endif
#pragma intrinsic(__getReg)
#define INL_REGID_APITC 3116
unsigned
__int64
rdtsc()
{
 
; volatile
unsigned __int64 temp;
temp
= __getReg(INL_REGID_APITC);
return
temp;
}
#else // WIN64
// Assume
WIN32 with
Microsoft or Intel Compiler
unsigned
__int64 rdtsc()
{
volatile
unsigned __int64 temp;
_asm
cpuid
_asm
rdtsc
_asm
lea ecx, temp
_asm
mov [ecx], eax
_asm
mov [ecx+4], edx
return
temp;
}
#endif // WIN64
#endif // __linux__
// The IAperf Macros
#ifdef __linux__
#define PERFINITMHZ(clkspd)
unsigned long
long clocks;
double
clockspeed = (unsigned long long)clkspd * 1000000;
#define PERFREPORT
printf("time elapsed = %f sec
",
((double)clocks)/clockspeed);
#else
#define
PERFINITMHZ(clkspd)
unsigned
__int64 clocks;
double
clockspeed = (double)1000000 * clkspd;
#define
PERFREPORT
printf("time elapsed = %f sec
",
((double)clocks)/clockspeed);
#endif
#define PERFSTART clocks = rdtsc();
#define PERFSTOP clocks
= rdtsc() - clocks;
|