Programming Principles

Developer Guide

oneAPI GPU Optimization Guide

Download PDF

ID 771772

Date 3/31/2023

Version

Public

A newer version of this document is available. Customers should click here to go to the newest version.

Programming Principles

To achieve good performance with implicit scaling, cross-stack memory accesses must be minimized but it is not required to eliminate all cross-stack accesses. A certain amount of cross-stack traffic can be handled by stack-to-stack interconnect if performed concurrently with local memory accesses. For memory bandwidth bound workload the amount of acceptable cross-stack accesses is determined by ratio of local memory bandwidth and cross-stack bandwidth (see Cross-Stack Traffic).

The following principles should be embraced by workloads that use implicit scaling:

Kernel must have enough work-items to utilize both stacks.
- The minimal number of work-items needed to utilize both stacks is: <number of VEs> * <hardware-threads per VE> * <SIMD width>.
- 2 stack Intel^® Data Center GPU Max with 1024 VE and SIMD32 requires at least 262,144 work-items.
Device time must dominate runtime to observe whole application scaling.
Minimize cross-stack memory accesses by exploiting locality in algorithm.
Slowest moving dimension should be large to avoid stack load imbalance.
Cross-stack memory accesses and local memory accesses should be interleaved.
Avoid stride-1 memory access in slowest moving dimension for 2D and 3D kernel launches.
If memory access pattern changes dynamically over time, a sorting step every nth iteration should be performed to minimize cross-stack memory accesses.
Don’t use a memory pool based on a single allocation (see Memory Coloring).

Many applications naturally have a concept of locality. These applications are expected to be a good fit for using implicit scaling due to low cross-stack traffic. To illustrate this concept, let’s use a stencil kernel as an example. A stencil operates on a grid which can be divided into blocks where majority of stencil computations within a block use stack local data. Only stencil operations that are at border of the block require data from another block, i.e. on another stack. The amount of these cross-stack/cross-border accesses are suppressed by halo to local volume ratio. This concept is illustrated below:

Cross-Stack Traffic

As mentioned in last section, it is crucial to minimize cross-stack traffic. To guide how much traffic can be tolerated without significantly impacting application performance we can benchmark STREAM kernel with varying amount of cross-stack traffic and compare to stack-local STREAM performance. The worst case is 100% cross-stack traffic. This is generated by reversing loop order in STREAM kernel (see STREAM):

#pragma omp target teams distribute parallel for simd
for (int i = N - 1; i <= 0; --i)
{
    c[i] = a[i] + b[i];
}

Here, each stack has 100% cross-stack memory traffic as work-groups on stack-0 access array elements N-1 to N/2 which are located in stack-1 memory. This kernel essentially benchmarks stack-to-stack bi-directional bandwidth. This approach can be generalized to interpolate between 0% cross-stack accesses and 100% cross-stack accesses by modified STREAM below:

// Code for cross stack stream
#include <iostream>
#include <omp.h>

// compile via:
// icpx -O2 -fiopenmp -fopenmp-targets=spir64 ./stream_cross_stack.cpp
// run via:
// EnableWalkerPartition=1 ZE_AFFINITY_MASK=0 ./a.out

template <int cross_stack_fraction>
void cross_stack_stream() {

    constexpr int64_t size = 256*1e6;
    constexpr int64_t bytes = size * sizeof(int64_t);

    int64_t *a = static_cast<int64_t*>(malloc( bytes ));
    int64_t *b = static_cast<int64_t*>(malloc( bytes ));
    int64_t *c = static_cast<int64_t*>(malloc( bytes ));
    #pragma omp target enter data map( alloc:a[0:size] )
    #pragma omp target enter data map( alloc:b[0:size] )
    #pragma omp target enter data map( alloc:c[0:size] )

    for ( int i = 0; i < size; ++i ) {

        a[i] = i + 1;
        b[i] = i - 1;
        c[i] = 0;
    }

    #pragma omp target update to( a[0:size] )
    #pragma omp target update to( b[0:size] )
    #pragma omp target update to( c[0:size] )

    const int num_max_rep = 100;

    double time;

    for ( int irep = 0; irep < num_max_rep+10; ++irep ) {

        if ( irep == 10 ) time = omp_get_wtime();

        #pragma omp target teams distribute parallel for simd
        for ( int j = 0; j < size; ++j ) {

            const int cache_line_id = j / 16;

            int i;

            if ( (cache_line_id%cross_stack_fraction) == 0 ) {

                i = (j+size/2)%size;
            }
            else {

                i = j;
            }

            c[i] = a[i] + b[i];
        }
    }
    time = omp_get_wtime() - time;
    time = time/num_max_rep;

    #pragma omp target update from( c[0:size] )

    for ( int i = 0; i < size; ++i ) {

        if ( c[i] != 2*i ) {

            std::cout << "wrong results!" << std::endl;
            exit(1);
        }
    }

    const int64_t streamed_bytes = 3 * size * sizeof(int64_t);

    std::cout << "cross_stack_percent = " << (1/(double)cross_stack_fraction)*100
              << "%, bandwidth = " << (streamed_bytes/time) * 1E-9 << " GB/s" << std::endl;
}

int main() {

    cross_stack_stream< 1>();
    cross_stack_stream< 2>();
    cross_stack_stream< 4>();
    cross_stack_stream< 8>();
    cross_stack_stream<16>();
    cross_stack_stream<32>();
}

The kernel on line 48-65 accesses every cross_stack_fraction'th cache line cross-stack by offsetting array access with (j+N/2)%N. For cross_stack_fraction==1, we generate 100% cross-stack memory accesses. By doubling cross_stack_fraction we decrease cross-stack traffic by a factor of 2. Note that this kernel is written such that cross-stack and local memory accesses are interleaved within work-groups to maximize hardware utilization. Measured performance on 2-stack Intel^® Data Center GPU Max with 2 GB array size can be seen below:

Measured bandwidth with cross-stack accesses
Partial cross-stack STREAM bandwidth [GB/s]	cross_stack_fraction	% of cross-stack accesses	% of max local 2-stack STREAM bandwidth
355	1	100%	17%
696	2	50%	33%
1223	4	25%	58%
1450	8	12.5%	69%
1848	16	6.25%	87%
2108	32	3.125%	99%

As can be seen in above table, applications should try to limit cross-stack traffic to be less than 10% of all memory traffic to avoid a significant drop in sustained memory bandwidth. For STREAM with of 12.5% cross-stack accesses we measure about 69% of the bandwidth of a local STREAM benchmark. These numbers can be used to estimate impact of cross-stack memory accesses on application kernel execution time.

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

oneAPI GPU Optimization Guide

Programming Principles

Cross-Stack Traffic