Portable Performance Measurement Macros for Intel® Architecture

Submit New Article


Last Modified On :   May 1, 2008 11:18 PM PDT
Rate
 


Introduction
by Michael Stoner

If you are an application performance engineer, you might be familiar with the process of developing a benchmark to measure the execution time of the piece of code you are optimizing. This is essential to determine whether your code changes are helping or hurting performance. Most tuning experts accomplish this using assembly code to access the processor's internal time-stamp counter, a fairly simple task on the IA-32 platform running the Windows* operating system. However, what happens when you try to do this on Linux*, or the Itanium® 2 processor? If these platforms constitute unknown territory, don't worry — all of the functionality you need is available in the "IAPerf.h" performance measurement macros for Intel® architecture.
How to use "IAPerf.h"
Appendix A contains the code making up the "IAperf.h" header. Actually, you can copy and paste it into a file and call it whatever you like, but "IAperf.h" seems suitable.

Figure 1 outlines how the macros can be used. The first macro, PERFINITMHZ, takes the clock-speed of the machine in megahertz as input. This declares the variables used by the timing code and sets the clockspeed variable so that time can be reported in seconds. (An obvious limitation, since there are ways to obtain the clock speed at run-time). The macros PERFSTART and PERFSTOP both record the value of the time-stamp counter and are to be used as bookends around the code you want to evaluate. Finally, the PERFREPORT macro uses printf() to display the elapsed time between PERFSTART and PERFSTOP.

 #include "IAperf.h"

 __..


 Foo(_)

 {

        [initialization code, variable declarations, etc.]

        PERFINITMHZ(500) // initialize performance measurement
macros for a 500 Mhz machine


        [more
set up code - not under test]


  &nb sp;     PERFSTART
// record time-stamp counter prior to entering performance-critical code

        [performance-critical code under examination by
the user]

        PERFSTOP // record time-stamp counter after leaving
performance-critical code


        PERFREPORT
// report the length of time spent in the performance-critical code

        [more code not under test]

 }



Figure 1. "IAPerf.h" example usage.
When optimizing a function within a large application, you may desire to break that function into a small console application to facilitate faster recompilation and easier performance timing. This is an ideal framework for "IAPerf.h" because it can simply measure the time it takes to loop the function for a fixed number of iterations (typically, the number of times necessary to get a workload that runs for several seconds).

Often, though, you will want to measure some part of an application that cannot be easily separated into a "micro-benchmark". The metric of choice for this is usually the average time spent passing through the region of interest, assuming the code runs repeatedly during the application workload. The "IAPerf.h" macros will require some modification to gather the average time over many passes through the code, but this should be a simple task for an experienced C programmer. For a GUI application without console output, the reporting mechanism will need to either write to the screen or to a file.

Many more macros could be written to manage the timing data in different ways, but it would be difficult to provide a method that would suit every user. The intent of "IAPerf.h" is not to provide a black box solution for all performance measurement activities, but rather to abstract the task of reading the processor time-stamp counter in different environments.

Supported Environments
The environments supported by "IAPerf.h" are categorized by processor architecture, operating system, and compiler, as shown in Table 1. The macros may also work on UNIX* operating systems running on Intel Architecture, but proper results have not been verified.

Architecture Windows* Linux*
Intel® IA-32 architecture Microsoft Visual C++* 6.0/7.0

Intel C/C++ Compiler 5.0/6.0

GCC

Intel C/C++ Compiler 5.0/6.0

Itanium® architecture Microsoft Visual C++ 6.0/7.0

Intel C/C++ Compiler 5.0.6.0

GCC

Intel C/C++ Compiler 5.0/6.0


Table 1. Supported Environments.

Some general comments on the code that makes up "IAPerf.h":
  • The PERFSTART and PERFSTOP macros both call the rdtsc() function, which uses #define's to compile the code compatible with the targeted platform. It seems logical to reduce call/return overhead by inlining rdtsc(), but this may have adverse effects on IA-32 architecture where inline assembly is used to access the time-stamp counter. Certain compiler optimizations may be turned off whenever inline assembly is present in a function, and inlining the rdtsc() function will be equivalent to putting inline assembly in the code you are timing.
  • If you choose to edit the macros, make sure not to leave any extraneous tabs or spaces after the '' line continuation characters. This is an easy mistake to make, yet very hard to debug if you have not seen this kind of error before.
  • The "volatile" keyword is critical for some compilers to indicate that the value of the time-stamp counter is constantly changing. For example, early versions of the Microsoft 64-bit compiler assumed that the value of the ar.itc register was static since the program never writes a new value into it.. The compiler would read ar.itc once during PERFSTART and then reuse the same value for PERFSTOP, giving a result of zero seconds elapsed.
  • On IA-32 architecture, the CPUID instruction is inserted prior to reading the time-stamp counter. This instruction serializes the out-of-order execution engine so that all instructions ahead of CPUID retire before the time-stamp counter is read. Likewise, none of the instructions following CPUID will be executed until the CPUID instruction has retired. Serialization at such fine granularity is not that critical unless you are measuring a very short time window.
About the Author

Mike Stoner is a Senior Applications Engineer with Intel's Softwa re Solutions Group. He has been with Intel since 1996, mainly in the role of helping independent software vendors develop optimized code for Intel platforms. Prior to joining Intel Mike received Bachelors and Masters degrees in Electrical Engineering from The Ohio State University. He can be reached at michael.stoner@intel.com.

Appendix A - IAPerf.h code
 

 

 

 


// IAPERF.H

//


#include <stdio.h>


// RDTSC functions





#ifdef __linux__


    #ifdef
__ia64__


        #ifdef __INTEL_COMPILER


            #ifdef __cplusplus

                extern "C" { unsigned __int64 __getReg(int whichReg); }

            #else

                unsigned __int64 __getReg(int whichReg);

             #endif


            #pragma intrinsic(__getReg)

            #define INL_REGID_APITC 3116


            unsigned __int64
rdtsc()

            {

                volatile unsigned __int64 temp;


                temp = __getReg(INL_REGID_APITC);


                return temp;

            }


        #else // __INTEL_COMPILER


            // Assume 64-bit
gcc

            unsigned long rdtsc()

            {

                volatile unsigned long temp;


                __asm__ __volatile__("mov
%0=ar.itc" : "=r"(temp) :: memory");


                return temp;

            }


    #endif // __INTEL_COMPILER


    #else // __ia64__


        // Assume IA32, gcc
or Intel Compiler

        unsigned long long rdtsc()

        {

            unsigned long long temp;


            __asm__ __volatile__
(

                    "cpuid "

                    "rdtsc "

                    "leal %0, %%ecx "

                    "movl %%eax, (%%ecx) "

                    "movl %%edx, 4(%%ecx) "

                    :

                    : "m" (temp)

                    : "%eax", "%ebx", "%ecx", "%edx");


            return temp;

        }


    #endif // __ia64__


#else // __linux__


    #ifdef WIN64



        // Assume Microsoft or Intel Compiler

        #ifdef __cplusplus

            extern "C" { unsigned __int64 __getReg(int whichReg); }

        #else

            unsigned __int64 __getReg(int whichReg);

        #endif


    #pragma intrinsic(__getReg)

    #define INL_REGID_APITC 3116


        unsigned __int64
rdtsc()

        {

  ;           volatile unsigned __int64 temp;


            temp = __getReg(INL_REGID_APITC);


            return temp;

        }


    #else // WIN64


        // Assume WIN32 with
Microsoft or Intel Compiler

        unsigned __int64 rdtsc()

        {

            volatile unsigned __int64 temp;


            _asm cpuid

            _asm rdtsc



            _asm lea ecx, temp

            _asm mov [ecx], eax

            _asm mov [ecx+4], edx


            return temp;

        }


    #endif // WIN64


#endif // __linux__





// The IAperf Macros


#ifdef __linux__


    #define PERFINITMHZ(clkspd)


        unsigned long long clocks;

        double clockspeed = (unsigned long long)clkspd * 1000000;


    #define PERFREPORT
printf("time elapsed = %f sec ", ((double)clocks)/clockspeed);


    #else


        #define PERFINITMHZ(clkspd)


          unsigned __int64 clocks;

          double clockspeed = (double)1000000 * clkspd;


        #define PERFREPORT
printf("time elapsed = %f sec ", ((double)clocks)/clockspeed);


#endif




#define PERFSTART clocks = rdtsc();


#define PERFSTOP clocks
= rdtsc() - clocks;







Comments (0)



Leave a comment

Name (required)

Email (required; will not be displayed on this page)

Your URL (optional)


Comment*