Exploit Nested Parallelism with OpenMP* Tasking Model

Published: 11/07/2017  

Last Updated: 11/07/2017

By Yuan Chen

The new generation, Intel® Xeon® processor Scalable family (formerly code-named Skylake-SP), Intel’s most scalable processor has up to 28 processor cores per socket with options to scale from 2 to 8 sockets. Intel® Xeon PhiTM processor provides massive parallelism with up to 72 cores per unit. More and more parallelism capabilities are introduced by hardware that requires software to exploit.

This is not always easy however in cases like lacking enough parallel tasks, huge temporary memory expansion when thread number grows, load imbalance etc. In these cases, nested parallelism can be helpful to scale parallel task number at multiple levels. It can also help inhibit temporary space explosion by share memory and parallelize at certain level while enclosed by another parallel region.

There are actually two ways to enable nested parallelism with OpenMP*. One is explicitly documented in OpenMP spec by setting OMP_NESTED environment variable or calling “omp_set_nested” runtime routines. There are some good examples and explanations on this topic from online tutorial: OpenMP Lab on Nested Parallelism and Task.

The other one is using OpenMP Tasking model. Comparing to other Worksharing constructs supported in OpenMP, tasking construct provides more flexibility in supporting various kinds of parallelism. It can be nested inside a parallel region, other task constructs, or other Worksharing constructs. With introducing taskloop reduction and taskgroup, this become more useful.

Here we use an example to demonstrate how to apply nested parallelism in different ways.

void fun1()
{
    for (int i=0; i<80; i++)
        ...
}


void main()
{
#pragma omp parallel
   {
#pragma omp for 
       for (int i=0; i<100; i++)
           ...

#pragma omp for
       for (int i=0; i<10; i++)
           fun1();
    } 
}

In the above example, the 2nd loop in main has a small trip count that can only be distributed to 10 threads with omp for. However there are 80 loop iterations in Fun1 which will be called 10 times in main loop. The product of loop trip count in Fun1 and the main loop will yield 800 iterations in total! This gives much more parallelism potential if parallelism can be added in both levels.

Here is how nested parallel regions work:

void fun1()
{
#pragma omp parallel for
    for (int i=0; i<80; i++)
        ...
}

void main
{
#Pragma omp parallel
    {
        #pragma omp for
        for (int i=0; i<100; i++)
            …

        #pragma omp for 
        for (int i=0; i<10; i++)
            fun1();
    }
}

The problem with this implementation is you may either have insufficient threads for the 1st main loop as it has larger loop count, or create exploded number of threads for the 2nd main loop when OMP_NESTED=TRUE. The simple solution is to split the parallel region in main and create separate ones for each loop with a distinct thread number specified.

In contrast, here's how omp tasking works:

void fun1()
{
#pragma omp taskloop
     for (int I = 0; i<80; i++)
         ...
}

void main
{
#pragma omp parallel
    {
#pragma omp for
        for (int i=0; i<100; i++)
            ...
       
#pragma omp for
        for (int i=0; i<10; i++)
            fun1();
    } 
}

As you can see, you don't have to worry about the thread number changes in 1st and 2nd main loops. Even though you still have a small amount of (10) threads allocated for 2nd main loop, the rest available threads will be able to be distributed through omp taskloop in fun1.

In general, OpenMP nested parallel regions is a way to distribute tasks by creating/forking more threads. In OpenMP, parallel region is the only construct determines execution thread number and controls thread affinity. Using nested parallel regions means each thread in parent region will yield multiple threads in enclosed regions, which in turn create a product of thread number.

Omp tasking shows another way to explore parallelism by adding more tasks, instead of threads. Though the thread number is unchanged as specified at the entry of the parallel region, the increased tasks from the nested tasking constructs can be distributed and executed by any available/idle threads in the current team of the same parallel region. This gives opportunities to fully use all threads’ capability, and improve balance of workloads automatically.

With the introducing of omp taskloop, omp taskloop reduction, omp taskgroup, omp taskgroup reduction, OpenMP tasking model becomes a more powerful resolution supporting nested parallelism. For more details on these new features in OpenMP 5.0TR, please refer to OpenMP* 5.0 support in Intel® Compiler 18.0.

Please note that we also received some known issue regarding nested parallelism with reduction clause in 18.0 initial version. This issue is expected to be fixed in 2018 Update 1 which will be available soon.

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.