Box Blur Filter Using Intel Subgroup Extensions in OpenCL™

Sonal Sharma

Abstract
OpenCL™ Overview
Intel Subgroup Extensions in OpenCL™
Box Blur Filter Algorithm
OpenCL™ Application for Box Blur Filter
OpenCL™ Application for Box Blur Filter using Intel Subgroup Extensions
Performance Data on Intel® Processors with Intel® Graphics
Intel® VTune™ Amplifier Tool Profiles
Conclusion
About the Author
References

Abstract

This paper highlights the OpenCL™ application for Box Blur filter, an image processing and filtering algorithm, and it describes how to optimize and accelerate the performance of a naïve OpenCL application using Intel OpenCL Subgroup extensions. The paper focuses on the concept of block read and write calls. Intel Subgroup extensions offer built-in APIs that provide benefit to the OpenCL application to perform bulk read/write and thereby reducing the overall number of read/write calls. By taking advantage of hardware capabilities, OpenCL application developers can read/write blocks of data and process more work items in a workgroup by creating subgroups. The work items within the subgroup can share data without the use of shared local memory and use of barriers. This paper also provides the performance observed on 5th generation Intel® Core™ processors with Intel® Graphics. Using Intel® VTune™ Amplifier tool and analyzing the profiles of the workload, developers can observe better GPU utilization.

OpenCL Overview

OpenCL is an open industry standard maintained by Khronos Group and is a framework for parallel programming across heterogeneous systems for faster and more efficient processing. OpenCL is widely used in applications such as Image processing, video processing, gaming, and more. It improves the performance of the applications. OpenCL portability allows applications to run across multiple platforms and multiple devices within a platform. With the help of the OpenCL™ standard, optimization techniques, concepts of heterogeneous compute and set of extensions offered by Intel, developers can take the benefits and enhance their application to improve the performance significantly ¹.

Intel Subgroup Extensions in OpenCL

The concept of subgroups was introduced in OpenCL™ 2.0 where the workgroup consists of one or more subgroups. Two sets of subgroup extensions are offered: Khronos Subgroup extensions and Intel Subgroup extensions. There are different set of APIs offered in both cases. Please refer to the reference link for detailed specification ². Note that the Intel subgroups extension can also be used with OpenCL™ 1.2.

In this article, we focus on the cl_Intel_subgroups extension. The motivation of this extension is to enhance OpenCL applications by benefiting from the fact that work items execute together in a subgroup. The work items in a subgroup can take advantage of the hardware features. This feature enables work items in a subgroup to share data without implementing shared local memory or using barriers. This advantage is not available to work items in a work group.

The Intel subgroup extension adds a set of subgroup “block read and write” functions to take advantage of specialized hardware to read and write blocks of data from/to buffers or images. In this article, we optimize the OpenCL application for Box Blur filter using the block read/write APIs offered by cl_Intel_subgroup extensions.

Block read API calls for buffers: Reads 1, 2, 4, or 8 unsigned integers (uints - 32 bits each) of data for each work item in the subgroup from the specified pointer as a block operation:

uint  intel_sub_group_block_read(const __global uint* p)
uint2 intel_sub_group_block_read2(const __global uint* p)
uint4 intel_sub_group_block_read4(const __global uint* p)
uint8 intel_sub_group_block_read8(const __global uint* p)

Block write API calls for buffers: Writes 1, 2, 4, or 8 uints of data for each work item in the subgroup to the specified pointer as a block operation:

void  intel_sub_group_block_write(__global uint* p, uint data)
void  intel_sub_group_block_write2(__global uint* p, uint2 data)
void  intel_sub_group_block_write4(__global uint* p, uint4 data)
void  intel_sub_group_block_write8(__global uint* p, uint8 data)

Box Blur Filter Algorithm

Box Blur is an image processing and filtering algorithm ⁵. It is a simple algorithm of a filter where each pixel in the output image is equivalent to the average of the neighboring pixel in the input image. The input pixels are unpacked to get the RGB components, and the filter is applied on each component followed by packing it back into the pixel. The diagram of the algorithm is shown in Figure 1, and the mathematical representation of the algorithm is shown in Figure 2.

Figure 1: Box Blur filter for a diameter of 3, computed using the value for (1,1) using pixel value (1,1) and 8 neighboring pixels.

For example, to calculate the Box Blur of pixel (1,1) for a Box Blur of a diameter of 3, the value of the current pixel and all the 8 neighboring pixels (the shaded pixels in the diagram) are used to compute the output of the pixel (1,1).

Figure 2: Mathematical formula of a Box Blur filter.

The radius is derived from the Box Blur size. For example, a Box of size 3x3 has a diameter of 3 and a radius of floor (3/2) = 1. Whereas, Factor = 1 / (diameter*diameter); for example, for a diameter of 3, the factor is 1/9 (this takes the average). If the value of x and y go out of bounds, clamp the values between 0 and the image size (not shown in the formula).

OpenCL Application For Box Blur Filter

OpenCL™ kernel development for Box Blur filter was done using the Intel® Code Builder for OpenCl™ API tool ⁴. The Box Blur filter was implemented using OpenCL 1.2 with buffer memory objects. Zero copy buffers were created using CL_MEM_USE_HOST_PTR. Input and output buffers were created using the “unsigned char” datatype, and the size of each buffer is (width × height × 4). For test cases, two image sizes were used: 1920x1080 and 4256x2832 resolution. The test cases included application of a Box Blur filter for various diameter sizes: 3, 5, 7, 9, and 11. The host code steps included zero copy buffer creation and allocation for both input and output buffers. A global workgroup size was assigned as {width, height}.

After setting up the arguments and kernel dispatch, the output is mapped to the buffers. The kernel code reads “uint data” and extracts the Red, Green, and Blue (RGB) byte components of each pixel. To apply the Box Blur filter, each color component is averaged with the corresponding color components of the neighboring pixels. The resulting RGB component bytes are packed into the uint pixel value again before writing it to the output buffer (refer to Section 5 for more details on the Box Blur Filter algorithm). This kernel implementation processes one pixel per work item.

Figure 3: Diagram showing computation of one pixel at a time. The neighboring pixel values (orange squares) are also read to compute the output pixel (green square).

Host code: Global workgroup size:

For input buffer and output buffer of type unsigned char and size (width × height × 4)
Global_size[] = {width, height};

Kernel pseudo code: Processing Box Blur for one pixel per work item:

Get x and y using get_global_id(0) and get_global_id(1).
Declare temporary variables Temp_R, Temp_G, Temp_B as float and initialize to 0.

Create a for loop to read the value of the main pixel and neighboring pixels based on the radius of Box Blur (see the formula in Section 5 for reference.)

for (int i = -radius; i <= radius; i++)
{
    for (int j = -radius; j <= radius; j++)
    {
    a.	Using the value of i and j, calculate the offset and the index of a Pixel.
    b.	Read one Pixel value of type uint using the index value.
    c.	Unpack each Pixel to get R, G, B byte components.
    d.	Apply Box Blur Filter on each RGB component. (See the formula in Section 5  for reference) 
      i.	Temp_R  += R * Factor;
      ii.	Temp_G  += G * Factor;
      iii.	Temp_B  += B * Factor;
      }
}

Pack Temp_R, Temp_G, Temp_B components to a uint pixel value and write pixel value to the output buffer.

OpenCL Application For Box Blur Filter Using Intel Subgroup Extensions

The naïve OpenCL application for Box Blur filter is improved using Intel Subgroup extensions. Here, Intel Subgroup extensions is used for block read and write functions The test case chosen to showcase the feature implements a kernel that computes 16 pixels per work item. In the current example, we read a block of “4 uint data” at once as a block read operation and similarly write a block of “4 uint data” to the output buffer as a block write operation. The new global workgroup size to compute 16 pixels is {width/4, height/4}. The rest of the host code remains the same, and the kernel code is modified to calculate the output for the entire block of data, that is, for 16 pixels. The number of times the kernel is dispatched is less; the work item handles more workload as the kernel now computes for 16 pixels.

Figure 4: Diagram showing computation of 16 pixels in a work item. The extra pixel values (orange squares) read to compute the output of 16 pixels (green squares).

Host code: Global workgroup size:

For input buffer and output buffer of type unsigned char and size (width × height × 4)
Global_size[] = {width/4, height/4};

Kernel pseudo code: Processing Box Blur for 16 pixels per work item:

Get x and y using get_global_id(0) and get_global_id(1).
1. int x = 4 * get_global_id(0);
2. int y = 4 * get_global_id(1);
Initialize temporary vector variables TempR_r_t, TempG_r_t, TempB_r_t as float4 and initialize to 0, where t ϵ {1, 2, 3, 4}.

Create a for loop to read the value of the main pixel and neighboring pixel based on the radius of Box Blur (see the formula in Section 5 for reference).

for (int i = -radius; i <= radius; i++)
{
     for (int j = -radius; j <= radius; j++)
     {
      a.	Using the value of i and j, calculate the offset and the index.
      b.	Read blocks of data – read 4 uints. 
                   // Reading for each row
	           uint4 r1 = intel_sub_group_block_read4(src + index);
	           uint4 r2 = intel_sub_group_block_read4(src + index + width);
                   uint4 r3 = intel_sub_group_block_read4(src + index + 2*width);
	           uint4 r4 = intel_sub_group_block_read4(src + index + 3*width);
      c.	Unpack rt to get R, G, B component for each row where t ϵ {1, 2, 3, 4}.
      d.	Apply Box Blur Filter on RGB component for each row (see the formula in Section 5  for reference). 
                    i.	TempR_rt += Rt * Factor;
                    ii.	TempG_rt += Gt * Factor;
                    iii. TempB_rt += Bt * Factor
                    Where t ϵ {1, 2, 3, 4}
               }
}

Pack TempR_r_t, TempG_r_t, TempB_r_t component for each row into variable Output_t, where t ϵ {1, 2, 3, 4} and Output_t is of type uint4.

Write 16 pixels to the output buffer:

intel_sub_group_block_write4(dst + out_index, Output1);

intel_sub_group_block_write4(dst + out_index + width, Output2);

intel_sub_group_block_write4(dst + out_index + width*2, Output3);

intel_sub_group_block_write4(dst + out_index + width*3, Output4);

Performance Data And Graph

The performance of OpenCL buffers and OpenCL buffers using Intel® Subgroup extensions were measured on a BDW Lenovo Yoga* system. Its specifications are four cores, Intel® Graphics GT2 system. The performance numbers were collected for two different image resolutions: 1920x1080 and 4256x2832 bitmap images. Box Blur filter of different diameter was used for the performance number collection: 3, 5, 7, 9, and 11. The graphs below show kernel times in ms (Figures 5 and 7) and total time (kernel time + host time) in ms (Figures 6 and 8). The lower the time, the better the performance. The average speed-up in the kernel time is 1.52x and average speed-up in total time is 1.36x.

Figure 5: Box Blur filter performance comparison. Kernel time of naïve OpenCL™ application versus Intel Subgroup Extensions for an image size of 1920x1080 on 5th generation Intel® Core™ processors with Intel® Processor Graphics.

Figure 6: Box Blur filter performance comparison. Total time of naïve OpenCL™ application versus Intel Subgroup Extensions for an image size of 1920x1080 on 5th generation Intel® Core™ processors with Intel® Processor Graphics.

Figure 7: Box Blur filter performance comparison. Kernel time of naïve OpenCL™ application versus Intel Subgroup Extensions for image size of 4256x2832 on 5th generation Intel® Core™ processors with Intel® Processor Graphics.

Figure 8: Box Blur filter performance comparison. Total time of naïve OpenCL™ application versus Intel Subgroup Extensions for an image size of 4256x2832 on 5th generation Intel® Core™ processors with Intel® Processor Graphics.

Intel® VTune™ Amplifier Tool Profiles

The Intel VTune Amplifier performance tool was used to collect the profile of the workloads. Intel VTune Amplifier profiles for the Box Blur application using OpenCL buffer and OpenCL buffer with Intel Subgroup extensions were collected for 4K images and a box blur diameter of 11. The profiles from both the implementation were analyzed to track GPU usage and EU utilization: EU Active%, EU Stall%, and EU Idle%. The GPU metrics were used to compare the performance of both implementations on the hardware.

The Graphics/Platform tab showed the EU utilization. Figure 9 shows the Intel VTune Amplifier profile of naïve OpenCL application for Box Blur filter, and Figure 10 shows the Subgroup implementation of the Box Blur filter. For naïve OpenCL application, the EU Active% is 90.6 percent, and EU Stall% is 9.4 percent. For the Subgroup implementation of the Box Blur filter, EU Active% is 99.7 percent, and EU Stall% is 0.3 percent. The EU Active% in the case of the OpenCL buffer with Intel Subgroup extensions increased by 10 percent. Overall, the EU utilization shown for the kernel using the subgroup extension is better.

Figure 9: Intel® VTune™ Amplifier tool profile of naïve OpenCL™ application for a Box Blur filter with an image size of 4256x2832 and a diameter of 11.

Figure 10: Intel® VTune™ Amplifier tool profile of a Box Blur filter using Intel Subgroup extensions, an image size of 4256x2832, and a diameter of 11.

Conclusion

The paper presented a basic Box Blur filter OpenCL application and optimization technique using OpenCL Intel subgroup extensions. The test case showed how to optimize an OpenCL application and enhance its performance. The Subgroup example used for experimentation was 4x4, that is, computing 16 pixels per work item. It showed the benefit of using subgroups to increase the workload per work item for better EU utilization. Performance data graph showed a speedup of 1.52x for the kernel time and 1.36x for total time. Profiles from the Intel VTune Amplifier tool showed better EU utilization in the case of using subgroups. OpenCL application developers can experiment with Intel subgroup extensions using different subgroup sizes and optimize their application to the best based on their system specifications.

About The Author

Sonal Sharma is a software application engineer working at Intel in California. Her work responsibility includes OpenCL enabling for applications running on Intel® platforms. She does performance profiling and GPU optimization for media applications and is well acquainted with Intel® performance tools like Intel VTune Amplifier, Intel Code Builder for OpenCL API, and Intel® Graphics Performance Analyzers.