Intel VTune™ Performance Analyzer
Intel VTune Performance Analyzer is a powerful tool for analyzing the performance of an application.
VTune analyzer has two main modes, sampling and call graph mode [20].
Sampling mode is a non-intrusive way to profile the entire system. In sampling
mode, statistics are rolled up to the application level. The user can then
delve down to the function level. The user can double-click on a function and
get instruction-level statistics as annotations in the source files [20].

Figure 15: VTune Performance Analyzer sampling mode output [20]
click image for larger view
Call graph mode is more invasive and slows the program under test. It does,
however, enable a graphic view of calling sequences of the different procedures
within the program. It further identifies critical paths in the program.
Statistics are provided for each procedure, including "wait time,"
the amount of time the function spends waiting for an event to occur.

Figure 16: VTune Performance Analyzer call graph mode output [20]
click image for larger view
Beyond CPU utilization and wait time, VTune Performance Analyzer collects important statistics such
as cache misses, clock ticks per instruction, and thread utilization.
Under Windows, another tool, VTune Performance Analyzer Tuning Assistant, makes recommendations on
improving the performance of certain functions.
Use of VTune Performance Analyzer can be thought of as a three-step process: set up, find an area to
improve, and improve. Finding an area to improve, given VTune analyzer's intuitive user
interface, usually takes no more than an hour.
VTune analyzer's output is function-centric. In some cases, it is nice to abstract
function-level results into higher-level module results. A parsing utility can
easily be used to aggregate function-level statistics into module-level
statistics.
VTune Performance Analyzer should not be used as a primary tool for the measurement of performance.
VTune analyzer's role is in improving performance. However, VTune analyzer can be useful for
understanding the performance characteristics of one release over another. For
instance, suppose we upgrade a software release from A to B. If, as a matter of
course, we measure A and B with VTune analyzer, we can compare their performance
characteristics. Are we seeing the same relative performance of our code? Has
one module significantly decreased in performance? Does this make sense? While
use of this A-B sanity check is not strictly necessary, it can help rapidly
identify probable design or coding inefficiencies.
VTune Performance Analyzer is an important tool for getting maximum performance out of the Intel
Architecture. Intel NetStructure® Host Media Processing Software, in particular,
benefits from regular inspection with VTune analyzer.
Intel C++ Compiler
The Intel C++ compiler is one of the software development tools available to
accelerate software performance on Intel platforms. It is available for a
number of different operating systems, including Windows and Linux. The Intel
C++ Compiler 9.0 for Linux provides outstanding application performance for
software running on Intel processors [25]. It includes advanced optimization
features such as full support of multi-core processors with capabilities
including Auto- Parallelization, Optimized floating point instruction
throughput, Interprocedural Optimization (IPO), Profile-Guided Optimization
(PGO), and Data prefetching. It includes a Compiler Code-Coverage tool and a
Compiler Test- Prioritization tool. Optimizations specific to the IA-32
architecture are provided, such as full support for Streaming SIMD Extensions 3
(SSE3), Automatic vectorizer, and Processor Dispatch, as well as support for
Intel Extended Memory 64 Technology. More details are available in "Intel
C++ Compiler for Linux" [25]. The compiler also includes an enhanced
debugger that allows debugging of optimized code, as well as support for stack
frame runtime error checking to help reduce buffer overrun security exploits.
Various case studies are presented in "Intel Software Tools Case
Studies" [27], highlighting performance improvements due to Intel software
tools. Table 9 highlights the performance improvement provided by Intel
Compiler and IPP in one particular case studyH.263 Video Encoding [26].
Table 9: H.263 Encode Performance Improvement due to Intel Compiler and IPP
[26]
|
ImageCom PC Encoder Configuration
|
Intel Pentium 4 processor-based system
|
|
Intel IPP
|
Intel Compiler
|
Encode Time for 80-sec clip
|
Percentage Improvement
|
|
|
|
123 sec
|
0%
|
|
Y
|
|
84 sec
|
32%
|
|
|
Y
|
69 sec
|
44%
|
|
Y
|
Y
|
57 sec
|
54%
|
|