Registerization and Avoid Register Spills
Registers and Performance
Register is the fastest storage in the
memory hierarchy. Keeping data in registers as long as possible
is critical to performance. However, register space is
limited and much smaller than memory space. The current generation of
Intel
®
GPUs, for example, has 128 general-purpose registers, each
32 bytes wide by default for each XVE thread. Though the compiler
aims to assign as many variables to registers as possible, the
limited number of registers can be allocated only to a small set
of variables at any point during execution. A given register can
hold different variables at different times because different sets
of variables are needed at different times. If there are not enough
registers to hold all the variables, register can spill, or some
variables currently in the registers can be moved to memory to make
room for other variables.In SYCL, the compiler allocates registers to private variables in work items.
Multiple work items in a sub-group are packed into one XVE thread. By default,
the compiler uses register pressure as one of the heuristics to choose SIMD
width or sub-group size. High register pressures can result in smaller
sub-group size (for example 8 instead of 16) if a sub-group size is not
explicitly requested. It can also cause register spilling or cause certain
variables not to be promoted to registers.
The hardware may not be fully utilized if sub-group size or SIMD
width is not the maximum the hardware supports. Register spilling
can cause significant performance degradation, especially when spills
occur inside hot loops. When variables are not promoted to registers,
accesses to these variables incur significant increase of memory traffic.
Though the compiler uses intelligent algorithms to allocate variables
in registers and to minimize
register spills, optimizations by developers can help the compiler to
do a better job and often make a big performance difference.
Optimization Techniques
The following techniques can reduce register pressure:
- Keep live ranges of private variables as short as possible.Though the compiler schedules instructions and optimizes the distances, in some cases moving the loading and using the same variable closer or removing certain dependencies in the source can help the compiler do a better job.
- Avoid excessive loop unrolling.Loop unrolling exposes opportunities for instruction scheduling optimization by the compiler and thus can improve performance. However, temporary variables introduced by unrolling may increase pressure on register allocation and cause register spilling. It is always a good idea to compare the performance with and without loop unrolling and different times of unrolls to decide if a loop should be unrolled or how many times to unroll it.
- Prefer USM pointers.A buffer accessor takes more space than a USM pointer. If you can choose between USM pointers and buffer accessors, choose USM pointers.
- Recompute cheap-to-compute values on-demand that otherwise would be held in registers for a long time.
- Avoid big arrays or large structures, or break an array of big structures into multiple arrays of small structures.For example, an array ofsycl::float4:sycl::float4 v[8];can be broken into 4 arrays offloat:float x[8]; float y[8]; float z[8]; float w[8];All or part of the 4 arrays offloathave a better chance to be allocated in registers than the array ofsycl::float4.
- Break a large loop into multiple small loops to reduce the number of simultaneously live variables.
- Choose smaller sized data types if possible.
- Do not declare private variables as volatile.
- Share registers in a sub-group.
- Use sub-group block load/store if possible.
- Use shared local memory.
The list here is not exhaustive.
The rest of this chapter shows how to apply these techniques, especially
the last four, and gives examples.
Choosing Smaller Data Types
constexpr int blockSize = 256;
constexpr int NUM_BINS = 32;
std::vector<unsigned long> hist(NUM_BINS, 0);
sycl::buffer<unsigned long, 1> mbuf(input.data(), N);
sycl::buffer<unsigned long, 1> hbuf(hist.data(), NUM_BINS);
auto e = q.submit([&](auto &h) {
sycl::accessor macc(mbuf, h, sycl::read_only);
auto hacc = hbuf.get_access<sycl::access::mode::atomic>(h);
h.parallel_for(
sycl::nd_range(sycl::range{N / blockSize}, sycl::range{64}),
[=](sycl::nd_item<1> it) [[intel::reqd_sub_group_size(16)]] {
int group = it.get_group()[0];
int gSize = it.get_local_range()[0];
sycl::ext::oneapi::sub_group sg = it.get_sub_group();
int sgSize = sg.get_local_range()[0];
int sgGroup = sg.get_group_id()[0];
unsigned long
histogram[NUM_BINS]; // histogram bins take too much storage to be
// promoted to registers
for (int k = 0; k < NUM_BINS; k++) {
histogram[k] = 0;
}
for (int k = 0; k < blockSize; k++) {
unsigned long x =
sg.load(macc.get_pointer() + group * gSize * blockSize +
sgGroup * sgSize * blockSize + sgSize * k);
#pragma unroll
for (int i = 0; i < 8; i++) {
unsigned int c = x & 0x1FU;
histogram[c] += 1;
x = x >> 8;
}
}
for (int k = 0; k < NUM_BINS; k++) {
hacc[k].fetch_add(histogram[k]);
}
});
});
This example calculates histograms with a bin size of 32. Each work
item has 32 private bins of unsigned long data type. Because of the
large storage required, the private bins cannot fit in registers,
resulting in poor performance.
With
blockSize
256, the maximum value of each private histogram
bin will not exceed the maximum value of an unsigned integer. Instead
of unsigned long type for private histogram bins, we can use
unsigned integers to reduce register pressure so the private bins
can fit in registers. This simple change makes significant performance
difference. constexpr int blockSize = 256;
constexpr int NUM_BINS = 32;
std::vector<unsigned long> hist(NUM_BINS, 0);
sycl::buffer<unsigned long, 1> mbuf(input.data(), N);
sycl::buffer<unsigned long, 1> hbuf(hist.data(), NUM_BINS);
auto e = q.submit([&](auto &h) {
sycl::accessor macc(mbuf, h, sycl::read_only);
auto hacc = hbuf.get_access<sycl::access::mode::atomic>(h);
h.parallel_for(
sycl::nd_range(sycl::range{N / blockSize}, sycl::range{64}),
[=](sycl::nd_item<1> it) [[intel::reqd_sub_group_size(16)]] {
int group = it.get_group()[0];
int gSize = it.get_local_range()[0];
sycl::ext::oneapi::sub_group sg = it.get_sub_group();
int sgSize = sg.get_local_range()[0];
int sgGroup = sg.get_group_id()[0];
unsigned int histogram[NUM_BINS]; // histogram bins take less storage
// with smaller data type
for (int k = 0; k < NUM_BINS; k++) {
histogram[k] = 0;
}
for (int k = 0; k < blockSize; k++) {
unsigned long x =
sg.load(macc.get_pointer() + group * gSize * blockSize +
sgGroup * sgSize * blockSize + sgSize * k);
#pragma unroll
for (int i = 0; i < 8; i++) {
unsigned int c = x & 0x1FU;
histogram[c] += 1;
x = x >> 8;
}
}
for (int k = 0; k < NUM_BINS; k++) {
hacc[k].fetch_add(histogram[k]);
}
});
});
Do Not Declare Private Variables as Volatile
Now we make a small change to the code example:
constexpr int blockSize = 256;
constexpr int NUM_BINS = 32;
std::vector<unsigned long> hist(NUM_BINS, 0);
sycl::buffer<unsigned long, 1> mbuf(input.data(), N);
sycl::buffer<unsigned long, 1> hbuf(hist.data(), NUM_BINS);
auto e = q.submit([&](auto &h) {
sycl::accessor macc(mbuf, h, sycl::read_only);
auto hacc = hbuf.get_access<sycl::access::mode::atomic>(h);
h.parallel_for(sycl::nd_range(sycl::range{N / blockSize}, sycl::range{64}),
[=](sycl::nd_item<1> it) [[intel::reqd_sub_group_size(16)]] {
int group = it.get_group()[0];
int gSize = it.get_local_range()[0];
sycl::ext::oneapi::sub_group sg = it.get_sub_group();
int sgSize = sg.get_local_range()[0];
int sgGroup = sg.get_group_id()[0];
volatile unsigned int
histogram[NUM_BINS]; // volatile variables will not
// be assigned to any registers
for (int k = 0; k < NUM_BINS; k++) {
histogram[k] = 0;
}
for (int k = 0; k < blockSize; k++) {
unsigned long x = sg.load(
macc.get_pointer() + group * gSize * blockSize +
sgGroup * sgSize * blockSize + sgSize * k);
#pragma unroll
for (int i = 0; i < 8; i++) {
unsigned int c = x & 0x1FU;
histogram[c] += 1;
x = x >> 8;
}
}
for (int k = 0; k < NUM_BINS; k++) {
hacc[k].fetch_add(histogram[k]);
}
});
});
The private histogram array is qualified as a volatile array. Volatile
variables are not prompted to registers because their
values may change between two different load operations.
There is really no reason for the private histogram array to be volatile,
because it is only accessible by the local execution thread.
In fact, if a private variable really needs to be volatile, it is not
private any more.
Sharing Registers in a Sub-group
Now we increase the histogram bins to 256:
constexpr int blockSize = 256;
constexpr int NUM_BINS = 256;
std::vector<unsigned long> hist(NUM_BINS, 0);
sycl::buffer<unsigned long, 1> mbuf(input.data(), N);
sycl::buffer<unsigned long, 1> hbuf(hist.data(), NUM_BINS);
auto e = q.submit([&](auto &h) {
sycl::accessor macc(mbuf, h, sycl::read_only);
auto hacc = hbuf.get_access<sycl::access::mode::atomic>(h);
h.parallel_for(
sycl::nd_range(sycl::range{N / blockSize}, sycl::range{64}),
[=](sycl::nd_item<1> it) [[intel::reqd_sub_group_size(16)]] {
int group = it.get_group()[0];
int gSize = it.get_local_range()[0];
sycl::ext::oneapi::sub_group sg = it.get_sub_group();
int sgSize = sg.get_local_range()[0];
int sgGroup = sg.get_group_id()[0];
unsigned int
histogram[NUM_BINS]; // histogram bins take too much storage to be
// promoted to registers
for (int k = 0; k < NUM_BINS; k++) {
histogram[k] = 0;
}
for (int k = 0; k < blockSize; k++) {
unsigned long x =
sg.load(macc.get_pointer() + group * gSize * blockSize +
sgGroup * sgSize * blockSize + sgSize * k);
#pragma unroll
for (int i = 0; i < 8; i++) {
unsigned int c = x & 0x1FU;
histogram[c] += 1;
x = x >> 8;
}
}
for (int k = 0; k < NUM_BINS; k++) {
hacc[k].fetch_add(histogram[k]);
}
});
});
With 256 histogram bins, the performance degrades even with smaller
data type unsigned integer. The storage of the private bins in each
work item is too large for registers.
Each Work Item Has 256 Private Histogram Bins

If the sub-group size is 16 as requested, we know that 16 work items
are packed into one EU thread. We also know work items in the same
sub-group can communicate and share data with each other very
efficiently. If the work items in the same sub-group share the private
histogram bins, only 256 private bins are needed for the whole sub-group,
or 16 private bins for each work item instead.
Sub-group Has 256 Private Histogram Bins

To share the histogram bins in the sub-group, each work item broadcasts
its input data to every work item in the same sub-group. The work item
that owns the corresponding histogram bin does the update.
constexpr int blockSize = 256;
constexpr int NUM_BINS = 256;
std::vector<unsigned long> hist(NUM_BINS, 0);
sycl::buffer<unsigned long, 1> mbuf(input.data(), N);
sycl::buffer<unsigned long, 1> hbuf(hist.data(), NUM_BINS);
auto e = q.submit([&](auto &h) {
sycl::accessor macc(mbuf, h, sycl::read_only);
auto hacc = hbuf.get_access<sycl::access::mode::atomic>(h);
h.parallel_for(
sycl::nd_range(sycl::range{N / blockSize}, sycl::range{64}),
[=](sycl::nd_item<1> it) [[intel::reqd_sub_group_size(16)]] {
int group = it.get_group()[0];
int gSize = it.get_local_range()[0];
sycl::ext::oneapi::sub_group sg = it.get_sub_group();
int sgSize = sg.get_local_range()[0];
int sgGroup = sg.get_group_id()[0];
unsigned int
histogram[NUM_BINS / 16]; // histogram bins take too much storage
// to be promoted to registers
for (int k = 0; k < NUM_BINS / 16; k++) {
histogram[k] = 0;
}
for (int k = 0; k < blockSize; k++) {
unsigned long x =
sg.load(macc.get_pointer() + group * gSize * blockSize +
sgGroup * sgSize * blockSize + sgSize * k);
// subgroup size is 16
#pragma unroll
for (int j = 0; j < 16; j++) {
unsigned long y = sycl::group_broadcast(sg, x, j);
#pragma unroll
for (int i = 0; i < 8; i++) {
unsigned int c = y & 0xFF;
// (c & 0xF) is the workitem in which the bin resides
// (c >> 4) is the bin index
if (sg.get_local_id()[0] == (c & 0xF)) {
histogram[c >> 4] += 1;
}
y = y >> 8;
}
}
}
for (int k = 0; k < NUM_BINS / 16; k++) {
hacc[16 * k + sg.get_local_id()[0]].fetch_add(histogram[k]);
}
});
});
Using Sub-group Block Load/Store
Memory loads/stores are vectorized. Each lane of a vector load/store
instrction has its own address and data. Both addresses and data take
register space. For example:
constexpr int N = 1024 * 1024;
int *data = sycl::malloc_shared<int>(N, q);
int *data2 = sycl::malloc_shared<int>(N, q);
memset(data2, 0xFF, sizeof(int) * N);
auto e = q.submit([&](auto &h) {
h.parallel_for(sycl::nd_range(sycl::range{N}, sycl::range{32}),
[=](sycl::nd_item<1> it) {
int i = it.get_global_linear_id();
data[i] = data2[i];
});
});
The memory loads and stores in the statement
data[i] = data2[i];
are vectorized and each vector lane has its own address. Assuming the
SIMD width or the sub-group size is 16, total register space for addresses
of the 16 lanes is 128 bytes. If each GRF rgister is 32-byte wide, 4 GRF
registers are needed for the addresses.
Noticing the addresses are contiguous, we can use sub-group block load/store
built-ins to save register space for addresses:
constexpr int N = 1024 * 1024;
int *data = sycl::malloc_shared<int>(N, q);
int *data2 = sycl::malloc_shared<int>(N, q);
memset(data2, 0xFF, sizeof(int) * N);
auto e = q.submit([&](auto &h) {
h.parallel_for(
sycl::nd_range(sycl::range{N}, sycl::range{32}),
[=](sycl::nd_item<1> it) [[intel::reqd_sub_group_size(16)]] {
sycl::ext::oneapi::sub_group sg = it.get_sub_group();
int x;
using global_ptr =
sycl::multi_ptr<int, sycl::access::address_space::global_space>;
int base = (it.get_group(0) * 32 +
sg.get_group_id()[0] * sg.get_local_range()[0]);
x = sg.load(global_ptr(&(data2[base + 0])));
sg.store(global_ptr(&(data[base + 0])), x);
});
});
The statements
x = sg.load(global_ptr(&(data2[base + 0])));
sg.store(global_ptr(&(data[base + 0])), x);
each loads/stores a contiguous block of memory and the compiler will compile
these 2 statements into special memory block load/store instructions. And
because it is a contiguous memory block, we only need the starting address of
the block. So 8, instead of 128, bytes of actual register space, or at most 1
register, is used for the address for each block load/store.