Intel® Advisor User Guide

ID 766448
Date 12/16/2022

A newer version of this document is available. Customers should click here to go to the newest version.

Document Table of Contents


Amdahl's law: A theoretical formula for predicting the maximum performance benefits of parallelizing application programs. Amdahl's law states that run-time execution time speedup is limited by the part of the program that is not parallelized (executes serially). To achieve results close to this potential, overhead must be minimized and all cores need to be fully utilized. See also Use Amdahl's Law and Measuring the Program.

annotation: A method of conveying information about proposed parallel execution. In the Intel® Advisor, you create annotations by adding macros or function calls. These annotations are used by Intel Advisor tools to predict parallel execution. For example, the C/C++ ANNOTATE_SITE_BEGIN(sitename) macro identifies where a parallel site begins. Later, to allow this code to execute in parallel, you replace the annotations with code needed to use a parallel framework. See also parallel framework and Annotation Types Summary.

atomic operation: An operation performed by a thread on a memory location(s) that is guaranteed not to be interfered with by other threads. See also synchronization.

chunking: The ability of a parallel framework to aggregate multiple instances of a task into groups for more efficient parallel processing. For tasks that do small amounts of computation and many iterations, task chunking can minimize task overhead. You can also restructure a single loop into an inner and outer loop (strip-mining). See also task and Enable Task Chunking.

code region: A subtree of loops/functions in a call tree. Synonym whole Loopnest.

critical section: A synchronization construct that allows only one thread to enter its associated code region at a time. Critical sections enforce mutual exclusion on enclosed regions of code. With Intel Advisor, mark critical sections by using ANNOTATE_LOCK_ACQUIRE() and ANNOTATE_LOCK_RELEASE() annotations.

data race: When multiple threads share (read/write) a memory location, if the program does not implement controls to manage the sequence of concurrent memory accesses, one thread can inadvertently overwrite data written by another thread, or otherwise read or write stale data. This can produce execution errors that are difficult to detect and reproduce, such as obtaining different calculated results when the same  executable is run on different systems. To prevent data races, you can add data synchronization constructs that restrict shared memory access to one thread at a time, or you might eliminate the sharing.

data parallelism: Occurs when a single portion of code is paired with multiple portions of  data, and each pairing executes as a task. For example, tasks are made by pairing a loop body with each element of an array iterated by the loop, and the tasks execute in parallel. See also Task Patterns. Contrast task parallelism.

data set: A set of data to be used as input or with an interactive application the way you interact with the application to cause a portion of the application to be executed. Because the Dependencies tool watches each memory access in a parallel site in great detail, the parallel site's code takes much longer to run than usual. To limit the time needed to run Dependencies analysis, reduce the data (such as the number of loop iterations) and when using an interactive program, create a very small test case. See also Choose a Small, Representable Data Set for the Dependencies Tool.

deadlock: A situation where a set of threads have each acquired some locks and are waiting for other locks to be released. All threads in the set are waiting for a lock held by a different thread, and since none can proceed and release their lock(s), they all remain waiting.

dynamic extent: All code that may possibly be executed by a parallel site or task. For example, a dynamic extent might include a loop, all functions called from the loop, all functions the called functions may in turn call, and so on. Contrast static extent. See also Task Organization and Annotations.

false positive: When viewing the Dependencies Report, a problem reported by the Dependencies tool that is not an actual problem.

framework: See parallel framework

head: A loop or function at the top of a subtree, which contains one or more child loops/functions.

hotspot: A small code region that consumes much of the program's run time. Hotspots can be identified by a profiler, such as the Intel Advisor Survey tool. See also Use Amdahl's Law and Measuring the Program.

Intel® oneAPI Threading Building Blocks (oneTBB) : A C++ template library for writing programs that take advantage of multiple cores. You can use this library to write scalable programs that specify tasks rather than threads, emphasize data parallel programming, and take advantage of concurrent collections and parallel algorithms. This is provided as an Intel® software product - Intel® oneAPI Threading Building Blocks (oneTBB) - as well as open source. Intel® oneAPI Threading Building Blocks (oneTBB) is one of several parallel frameworks. Abbreviation oneTBB .

load balancing: The equal division of work among cores. If the load is balanced, the cores are busy most of the time.

lock: A synchronization mechanism that allows one thread to wait until another thread allows it to continue. A lock can be used to synchronize threads accessing a specific memory location. See also synchronization and nested lock.

multi-core: A processor that combines two or more independent cores. Although each core shares interconnection to the rest of the system, it executes instructions independently by using its dedicated CPU, architectural state, and interrupt controllers, as well as private and/or shared cache. Most multi-core systems use identical cores. The number of cores used determines whether it is called dual-core (2), quad-core (4), or many-core system.

multithreaded processing: See parallel processing

mutual exclusion: A type of locking typically used to prevent actions occurring at the same time. Abbreviation mutex. See also synchronization

nested lock: A type of lock that can be locked again by a task when the task already owns the lock. Nested locks are convenient when several inter-related functions use the same lock. See also synchronization and lock

node: A loop or function.

oneTBB : See IIntel® oneAPI Threading Building Blocks (oneTBB)

OpenMP*: A high-level parallel framework and language extension designed to support shared-memory parallel programming that consists of compiler directives (C/C++ pragmas and Fortran directives), library functions, and environment variables. The OpenMP specification was developed by multiple hardware and software vendors to provide a scalable, portable interface for parallel programming on a variety of platforms. OpenMP is one of several parallel frameworks. See also

parallel framework: A combination of libraries, language features, or other software techniques that enable code for a program to execute in parallel. Examples include OpenMP, Intel® oneAPI Threading Building Blocks (oneTBB) , Message Passing Interface (MPI), Intel® Concurrent Collections for C/C++, Microsoft Task Parallel Library* (TPL), and low-level, basic threading APIs, like POSIX* threads (Pthreads). Some parallel frameworks support shared-memory parallel processing, while others like MPI support non-shared-memory parallel processing. See also Intel® oneAPI Threading Building Blocks (oneTBB) and Parallel Frameworks Overview.

parallel processing: The use of multiple threads during execution of a program. Intel Advisor focuses on parallel processing for shared-memory systems. There are other types of parallel processing, such as for clusters or grids and vector processing. Shortened version is parallelism. See also hotspot and thread.

parallel region:Offload Modeling term. A code region that starts with a specific parallel framework construction. Intel® oneAPI Threading Building Blocks (oneTBB),Intel® oneAPI Data Analytics Library (oneDAL), OpenMP*, SYCL parallel frameworks are supported.

parallel site: A region of code that contains tasks that can execute in parallel. See also annotation and Task Organization and Annotations

pipeline: An approach to organizing task computations that uses both data parallelism and task parallelism, and organizes the computation into stages that run in a predetermined order.

self time: In the Survey Report window, how much time was spent in a particular function or loop.

site: See parallel site

shared-memory parallelism: See parallel processing

static extent: The code between a site's or a task's _BEGIN and _END annotations. A static extent might not be lexically paired; for example, a parallel site may have one _BEGIN point, but may require multiple independent _END exit points. Contrast with dynamic extent. See also annotation, parallel site, and Task Organization and Annotations.

synchronization: Coordinating the execution of multiple threads. In some cases, you can provide synchronization within a task by using a private memory location instead of a shared memory location. In other cases, a lock or mutex can be used to restrict access to a shared data. See also Data Sharing Problem Types.

task: A portion of code and its data that can be given to a thread to execute. See also Task Organization and Annotations, Choosing the Tasks, and chunking.

task parallelism: Occurs when two different portions of the code are made into tasks and execute in parallel. For example, a task is made by pairing a display algorithm with the state to display, another task by pairing a compute-next-state algorithm with the same state, and the two tasks execute in parallel. See also Task Patterns. Contrast data parallelism

thread: A thread executes instructions within a process. Each process has one or more threads active at a time. Threads share the address space of the process, but have their own stack, program counters, and other registers.

total time: In the Survey Report window, how much time was spent in a particular function or loop, plus the time spent by anything that entity calls.

vector processing: A form of parallel processing where multiple data items are packed together in vector registers to allow vector instructions to operate on the packed data with a single instruction. Reducing the number of instructions needed to process the packed vector data minimizes memory use and latency, and provides good locality of reference and data cache utilization. Vector instructions are Single Instruction Multiple Data (SIMD) instructions. Some SIMD vector instructions support large register sizes to accommodate more packed data, such as Intel® Advanced Vector Extensions (Intel® AVX).