Development Reference Guides

Contents

Worksharing Using OpenMP*

To get the maximum performance benefit from a processor with multi-core and Intel® Hyper-Threading Technology (Intel® HT Technology), an application needs to be executed in parallel. Parallel execution requires threads, and threading an application is not a simple thing to do; using OpenMP* can make the process a lot easier. Using the OpenMP pragmas, most loops with no loop-carried dependencies can be threaded with one simple statement. This topic explains how to start using OpenMP to parallelize loops, which is also called worksharing.
Options that use OpenMP are available for both Intel® and non-Intel microprocessors, but these options may perform additional optimizations on Intel® microprocessors than they perform on non-Intel microprocessors. The list of major, user-visible OpenMP constructs and features that may perform differently on Intel® microprocessors than on non-Intel microprocessors includes: locks (internal and user visible), the SINGLE construct, barriers (explicit and implicit), parallel loop scheduling, reductions, memory allocation, and thread affinity and binding.
Most loops can be threaded by inserting one pragma immediately prior to the loop. Further, by leaving the details to the
Intel® oneAPI
DPC++/C++
Compiler
and OpenMP, you can spend more time determining which loops should be threaded and how to best restructure the algorithms for maximum performance. The maximum performance of OpenMP is realized when it is used to thread hotspots, the most time-consuming loops in your application.
The power and simplicity of OpenMP is demonstrated by looking at an example. The following loop converts a 32-bit RGB (red, green, blue) pixel to an 8-bit gray-scale pixel. One pragma, which has been inserted immediately before the loop, is all that is needed for parallel execution.
#pragma omp parallel for for (i=0; i < numPixels; i++) { pGrayScaleBitmap[i] = (unsigned BYTE) (pRGBBitmap[i].red * 0.299 + pRGBBitmap[i].green * 0.587 + pRGBBitmap[i].blue * 0.114); }
First, the example uses worksharing, which is the general term used in OpenMP to describe distribution of work across threads. When worksharing is used with the
for
construct, as shown in the example, the iterations of the loop are distributed among multiple threads so that each loop iteration is executed exactly once with different iterations executing if there is more than one available threads. The
for
construct on its own only distributes the loop iterations among existing threads. The example uses a
parallel for
construct, which combines
parallel
and
for
constructs to first create a team of threads and then distribute the loop iterations among the threads. Since there is no explicit
num_threads
clause, OpenMP determines the number of threads to create and how to best create, synchronize, and destroy them. OpenMP places the following five restrictions on which loops can be threaded:
  • The loop variable must be of type signed or unsigned integer, random access iterator, or pointer.
  • The comparison operation must be in the form
    loop_variable <, <=, >, >=, or != loop_invariant_expression
    of a compatible type.
  • The third expression or increment portion of the
    for
    loop must be either addition or subtraction by a loop invariant value.
  • If the comparison operation is < or <=, the loop variable must increment on every iteration; conversely, if the comparison operation is > or >=, the loop variable must decrement on every iteration.
  • The loop body must be single-entry-single-exit, meaning no jumps are permitted from inside to outside the loop, with the exception of the
    exit
    statement that terminates the whole application. If the statements
    goto
    or
    break
    are used, the statements must jump within the loop, not outside it. Similarly, for exception handling, exceptions must be caught within the loop.
Although these restrictions might sound somewhat limiting, non-conforming loops can frequently be rewritten to follow these restrictions.

Basics of Compilation

Using the OpenMP pragmas requires an OpenMP-compatible compiler and thread-safe libraries. Adding the
/Qopenmp
(Windows*) or
-qopenmp
(Linux*) option to the compiler instructs the compiler to pay attention to the OpenMP pragmas and to generate multi-threaded code. If you omit the
/Qopenmp
(Windows) or
-qopenmp
(Linux) option, the compiler will ignore OpenMP pragmas, which provides a very simple way to generate a single-threaded version without changing any source code. To compile programs containing target and related constructs for offloading to a GPU, the
-fopenmp-targets=spir64
and
/Qopenmp-targets:spir64
flags are needed on Linux and Windows respectively.
For conditional compilation, the compiler defines the
_OPENMP
macro. If needed, the macro can be tested as shown in the following example.
#ifdef _OPENMP fn(); #endif

A Few Simple Examples

The following examples illustrate how simple OpenMP is to use. In common practice, additional issues need to be addressed, but these examples illustrate a good starting point.
In the first example, the loop clips an array to the range from 0 to 255.
// clip an array to 0 <= x <= 255 for (i=0; i < numElements; i++) { if (array[i] < 0) array[i] = 0; else if (array[i] > 255) array[i] = 255; }
You can thread it using a single OpenMP pragma; insert the pragma immediately prior to the loop:
#pragma omp parallel for for (i=0; i < numElements; i++) { if (array[i] < 0) array[i] = 0; else if (array[i] > 255) array[i] = 255; }
In the second example, the loop generates a table of square roots for the numbers from 0 to 100.
double value; double roots[100]; for (value = 0.0; value < 100.0; value ++) { roots[(int)value] = sqrt(value); }
Thread the loop by changing the loop variable to a signed integer or unsigned integer and inserting a
#pragma omp parallel for
pragma.
int value; double roots[100]; #pragma omp parallel for for (value = 0; value < 100; value ++) { roots[value] = sqrt((double)value); }

Avoid Data Dependencies and Race Conditions

When a loop meets all five loop restrictions (listed above) and the compiler threads the loop, the loop still might not work correctly due to the existence of data dependencies.
Data dependencies exist when different iterations of a loop (more specifically a loop iteration that is executed on a different thread) read or write the same location in shared memory. Consider the following example that calculates factorials.
// Each loop iteration writes a value that a different iteration reads. #pragma omp parallel for for (i=2; i < 10; i++) { factorial[i] = i * factorial[i-1]; }
The compiler will thread this loop, but the threading will fail because at least one of the loop iterations is data-dependent upon a different iteration. This situation is referred to as a race condition. Race conditions can only occur when using shared resources (like memory) and parallel execution. To address this problem either rewrite the loop or pick a different algorithm, one that does not contain the race condition.
Race conditions are difficult to detect because, for a given case or system, the threads might win the race in the order that happens to make the program function correctly. Because a program works once does not mean that the program will work under all conditions. Testing your program on various machines, some with Intel® Hyper-Threading Technology and some with multiple physical processors, is a good starting point to help identify race conditions.
Traditional debuggers are useless for detecting race conditions because they cause one thread to stop the race while the other threads continue to significantly change the runtime behavior; however, thread checking tools can help.

Manage Shared and Private Data

Nearly every loop (in real applications) reads from or writes to memory; it's your responsibility, as the developer, to instruct the compiler what memory should be shared among the threads and what memory should be kept private. When memory is identified as shared, all threads access the same memory location. When memory is identified as private, however, a separate copy of the variable is made for each thread to access in private. When the loop ends, the private copies are destroyed. By default, all variables are shared except for the loop variable, which is private.
Memory can be declared as private in two ways:
  • Declare the variable inside the loop-really inside the parallel OpenMP pragma-without the static keyword.
  • Specify the private clause on an OpenMP pragma.
The following loop fails to function correctly because the variable
temp
is shared. It should be private.
// Variable temp is shared among all threads, so while one thread // is reading variable temp another thread might be writing to it #pragma omp parallel for for (i=0; i < 100; i++) { temp = array[i]; array[i] = do_something(temp); }
The following two examples both declare the variable
temp
as private memory, which solves the problem.
#pragma omp parallel for for (i=0; i < 100; i++) { int temp; // variables declared within a parallel construct // are, by definition, private temp = array[i]; array[i] = do_something(temp); }
The
temp
variable can also be made private in the following way:
#pragma omp parallel for private(temp) for (i=0; i < 100; i++) { temp = array[i]; array[i] = do_something(temp); }
Every time you use OpenMP to parallelize a loop, you should carefully examine all memory references, including the references made by called functions. Variables declared within a parallel construct are defined as private except when they are declared with the
static
declarator, because static variables are not allocated on the stack.

Reductions

Loops that accumulate a value are fairly common, and OpenMP has a specific clause to accommodate them. Consider the following loop that calculates the sum of an array of integers.
sum = 0; for (i=0; i < 100; i++) { sum += array[i]; // this variable needs to be shared to generate // the correct results, but private to avoid // race conditions from parallel execution }
The variable sum in the previous loop must be shared to generate the correct result, but it also must be private to permit access by multiple threads. OpenMP provides the
reduction
clause that is used to efficiently combine the mathematical reduction of one or more variables in a loop. The following example demonstrates how the loop can use the
reduction
clause to generate the correct results.
sum = 0; #pragma omp parallel for reduction(+:sum) for (i=0; i < 100; i++) { sum += array[i]; }
In the case of the example listed above, the reduction provides private copies of the variable
sum
for each thread, and when the threads exit, it adds the values together and places the result in the one global copy of the variable.
The following table lists the possible reduction operations, along with their initial values (mathematical identity values).
Operation
private
Variable Initialization Value
+ (addition)
0
- (subtraction)
0
* (multiplication)
1
& (bitwise and)
~0
| (bitwise or)
0
^ (bitwise exclusive or)
0
&& (conditional and)
1
|| (conditional or)
0
Multiple reductions in a loop are possible by specifying comma-separated variables and operations on a given
parallel
construct. Reduction variables must meet the following requirements:
  • can be listed in just one reduction.
  • cannot be declared constant.
  • cannot be declared private in the
    parallel
    construct.

Load Balancing and Loop Scheduling

Load balancing, the equal division of work among threads, is among the most important attributes for parallel application performance. Load balancing is extremely important, because it ensures that the processors are busy most, if not all, of the time. Without a balanced load, some threads may finish significantly before others, leaving processor resources idle and wasting performance opportunities.
Within loop constructs, poor load balancing is often caused by variations in compute time among loop iterations. It is usually easy to determine the variability of loop iteration compute time by examining the source code. In most cases, you will see that loop iterations consume a uniform amount of time. When that is not true, it may be possible to find a set of iterations that consume similar amounts of time. For example, sometimes the set of all even iterations consumes about as much time as the set of all odd iterations. Similarly, it might be the case that the set of the first half of the loop consumes about as much time as the second half. In contrast, it might be impossible to find sets of loop iterations that have a uniform execution time. Regardless of the case, you should provide this extra loop scheduling information to OpenMP so it can better distribute the iterations of the loop across the threads (and therefore processors) for optimum load balancing.
If you know that all loop iterations consume roughly the same amount of time, the OpenMP
schedule
clause should be used to distribute the iterations of the loop among the threads in roughly equal amounts via the scheduling policy. In addition, you need to minimize the chances of memory conflicts that may arise because of false sharing due to using large chunks. This behavior is possible because loops generally touch memory sequentially, so splitting up the loop in large chunks— like the first half and second half when using two threads— will result in the least chance for overlapping memory. While this may be the best choice for memory issues, it may be bad for load balancing. Unfortunately, the reverse is also true; what might be best for load balancing may be bad for memory performance. You must strike a balance between optimal memory usage and optimal load balancing by measuring the performance to see what method produces the best results.
Use the following general form on the
parallel
construct to schedule an OpenMP loop:
#pragma omp parallel for schedule(kind [, chunk size])
Four different loop scheduling types (kinds) can be provided to OpenMP, as shown in the following table. The optional parameter (chunk), when specified, must be a positive integer.
Kind
Description
static
Divide the loop into equal-sized chunks or as equal as possible in the case where the number of loop iterations is not evenly divisible by the number of threads multiplied by the chunk size. By default, chunk size is
loop_count
/
number_of_threads
.
Set chunk to 1 to interleave the iterations.
dynamic
Use the internal work queue to give a chunk-sized block of loop iterations to each thread. When a thread is finished, it retrieves the next block of loop iterations from the top of the work queue.
By default, the chunk size is 1. Be careful when using this scheduling type because of the extra overhead involved.
guided
Similar to dynamic scheduling, but the chunk size starts off large and decreases to better handle load imbalance between iterations. The optional chunk parameter specifies them minimum size chunk to use.
By default the chunk size is approximately
loop_count
/
number_of_threads
.
auto
When schedule (auto) is specified, the decision regarding scheduling is delegated to the compiler. The programmer gives the compiler the freedom to choose any possible mapping of iterations to threads in the team.
runtime
Uses the
OMP_SCHEDULE
environment
variable to specify which one of the three loop-scheduling types should be used.
OMP_SCHEDULE
is a string formatted exactly the same as would appear on the parallel construct.
Assume that you want to parallelize the following loop.
for (i=0; i < NumElements; i++) { array[i] = StartVal; StartVal++; }
As written, the loop contains a data dependency, making it impossible to parallelize without a change. The new loop, shown below, fills the array in the same manner, but without data dependencies. The new loop benefits from using the SIMD instructions generated by the compiler.
#pragma omp parallel for for (i=0; i < NumElements; i++) { array[i] = StartVal + i; }
Observe that the code is not 100% identical because the value of variable
StartVal
is not incremented. As a result, when the parallel loop is finished, the variable will have a value different from the one produced by the serial version. If the value of
StartVal
is needed after the loop, the additional statement, shown below, is needed.
// This works and is identical to the serial version. #pragma omp parallel for for (i=0; i < NumElements; i++) { array[i] = StartVal + i; } StartVal += NumElements;

OpenMP Tasking Model

The OpenMP tasking model enables parallelization of a large range of applications. A task is an instance of executable code and its data environment that can be scheduled for execution by threads.
The
task
Construct
The
task
construct defines an explicit
task
region as shown in the following example:
void test1(LIST *head) { #pragma omp parallel shared(head) { #pragma omp single { LIST *p = head; while (p != NULL) { #pragma omp task firstprivate(p) { do_work1(p); } p = p->next; } } }
The binding thread set of the
task
region is the current parallel team. A
task
region binds to the innermost enclosing
parallel
region. When a thread encounters a
task
construct, a task is generated from the structured block enclosed in the construct. The encountering thread may immediately execute the task, or defer its execution. A
task
construct may be nested inside an outer task, but the
task
region of the inner task is not a part of the
task
region of the outer task.
Use Clauses with the
task
Construct
The
task
construct can take optional clauses. The data environment of the task is created according to the data-sharing attribute clauses on the
task
construct and any defaults that apply. The example below shows a way to generate
N
tasks with one thread and execute the generated tasks with the threads in the parallel team:
double data[N]; int i; #pragma omp parallel shared(data) { #pragma omp single private(i) { for (i=0, i<N; i++) { #pragma omp task firstprivate(i) shared(data)) { do_work(data, i); } } } }
Task Scheduling
When a thread reaches a task scheduling point, it may perform a task switch, suspending the current task and beginning or resuming execution of a different task bound to the current team. Refer to the OpenMP 5.1 specifications for the full list of task scheduling point locations. Some examples include:
  • the point where a task is explicitly generated.
  • the point immediately following the generation of an explicit task.
  • after the last instruction of a
    task
    region.
  • in a
    taskwait
    region.
  • in a
    taskyield
    region.
  • in implicit and explicit barrier regions.
Task scheduling points dynamically divide
task
regions into parts. Each part is executed from start to finish without interruption. Different parts of the same
task
region are executed in the order in which they are encountered. In the absence of task synchronization constructs, the order in which a thread executes parts of different schedulable tasks is unspecified. A correct program must behave correctly and consistently with all conceivable scheduling sequences.
The
taskwait
Construct
The
taskwait
construct specifies a wait on the completion of child tasks generated since the beginning of the current task. A
taskwait
region binds to the current
task
region. The binding thread set of the
taskwait
region is the current team.
The
taskwait
region includes an implicit task scheduling point in the current
task
region. The current
task
region is suspended at the task scheduling point until execution of all its child tasks generated before the
taskwait
region is completed.
#pragma omp task // TASK1 { ... #pragma omp task // TASK 2 (child of TASK1) { do_work1(); } #pragma omp task // TASK3 (child of TASK 1) { ... #pragma omp task // TASK4 (child of TASK3, not TASK1) { do_work2(); } ... } #pragma omp taskwait // suspend TASK1; wait for TASK2 and TASK3 to complete ... }
The
taskyield
Construct
The
taskyield
construct specifies that the current task can be suspended at that point and the thread may switch to the execution of a different task. You can use this construct to provide an explicit task scheduling point at a particular point in the task.

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.