Coloring with Beignet: Performing Color Management on Intel® HD Graphics with OpenCL*

Published: 02/02/2016  

Last Updated: 02/01/2016

Co-authored by Alina Chera

Brief Introduction to Color Management

This article presents a proof of concept implementation that accelerates the computation of color profile transformations using OpenCL* on Intel® HD Graphics. To understand the context of this document, the reader must first be aware of some basic concepts about colors, color profiles, and their importance to color management.

The human eye's perception of the world around it is influenced by its internal structure (photoreceptor cells) and the sources of light it is exposed to. The perception of an image of an object can thus be influenced by the object’s original setting and the characteristics of the recording device. To reduce representation errors and capture/display images accurately, digital devices and graphics artists associate color profiles to the original content. This attempts to solve the problem of different devices displaying different color and lighting characteristics.

The end goal is to preserve the quality of digital images across devices. For example, color management helps photographers retain the accuracy of scenes and allows applications that work with photographs to read color profiles to display photographs with high fidelity. Color profile transforms on images are presented in the next sections so the reader can get a visual understanding of the process.

To standardize color correction techniques, the International Color Consortium (ICC) created a set of specifications that helps software and hardware vendors represent color profile transforms. The specifications are intended to be vendor-neutral and set the rules and characteristics that color profiles should implement [1]. Color transformations represent the mapping from a source to an output (display) color profile.

A simplified flow is shown in the diagram below:

A simple way to describe a color profile

A simple way to describe a color profile is to specify its reference white point[2], parametric curve type, gamma value[3], and primaries (which are the three colors that define the color spectrum of the profile, commonly red, green, and blue references).

The Profile Connection Space (PCS) provides the bridge between the two color profiles.

A color profile can transition to PCS through a series of operations that transform its primaries and reference white point data from RGB or Grayscale into LAB or XYZ color space[4].


Quick Color Management System (QCMS)

From a software point of view, the operations required to perform a color transform translate into a series of scalar and matrix multiplications. Several open source tools or libraries are available to perform these tasks, such as:

  • Quick color management system (QCMS)
  • Little color management system (LCMS)
  • Sample ICC
  • Etc.

This article focuses on QCMS as it is the current library used by the Chrome and Firefox browsers for color management.

QCMS is able to read color profiles from memory or disk and create transform objects that map colors from one color profile to another. The transform provides a routine that takes a chunk of data (image pixels) as input and outputs the adjusted content as an image ready to be displayed on the current system. Before doing the actual calculations it performs a series of optimizations and caches constant data—this caching speeds up the transformation process.

QCMS’s transformation engine has code that uses SIMD instructions to accelerate the main computing loop on x86 hardware. In Chrome and Firefox, QCMS handles color transforms for JPEG, PNG, and WebP images with an embedded ICC profile. We will try to improve the transform routines by performing them on the GPU using OpenCL.


OpenCL is a framework for programming parallel heterogeneous hardware. It appeared as an open standard in 2009 and is currently supported by a wide range of hardware and operating systems. We used OpenCL’s programming model to move the execution of critical code from the CPU to the GPU.

This article assumes that the reader has some understanding of OpenCL or other GPU programming APIs. You can find the OpenCL specification here:


Beignet[5] is an open source implementation of the OpenCL standard for Intel® graphics processors. The current implementation uses Beignet on Intel® HD 530 Graphics GT2 running on a 6th Generation Intel® Core™ i7-6700 processor[6]. The GPU has 24 execution units running at 1.15 GHz compared to a CPU that can reach 4 GHz. Our goal is to use OpenCL efficiently to obtain faster execution times for QCMS transforms. This work used Beignet version 1.1 that supports OpenCL 1.2 contexts.


We have used QCMS code present in Chromium Project’s repository:

We created a parallel implementation of one of the transform routines (qcms_transform_data_rgba_out_lut_precache) in OpenCL and performed comparative performance analysis with the serial version.

As the term precache mentions, the routine relies on a set of precomputed values at runtime (gamma tables, color lookup tables, etc.). We replaced the precache stage of the algorithm with our OpenCL setup code to accomplish the following:

  • Created context
  • Built program
  • Allocated buffers
  • Copied constant data to device memory
  • Performed argument setup
  • Etc.

We changed the resulting function code into a set of OpenCL calls that transfers the source data to the GPU, and then executes the kernel and retrieves the results:

A set of OpenCL calls that transfers the source data to the GPU

The compute-intensive code was moved into a .cl file and adapted to run on OpenCL, as shown below:

A .cl file adapted to run on OpenCL

We created another version that used OpenCL’s vector data types to load/store multiple elements and perform SIMD-like operations. In the previous version each work-item was responsible for computing a single pixel—these new changes took into consideration the scenario where the global work size is different from the input length. The input length must also  be passed as an argument to the kernel. The resulting code is displayed below:

OpenCL’s vector data types to load/store multiple elements

Another optimization involves the use of local memory and local work groups. The thread groups are selected to be the same size as the global tables and perform a local copy of the constant data.

A barrier is used as a synchronization point before each GPU thread is allowed to continue its execution, as shown here:

A synchronization point before each GPU thread


To get these results, we ran the OpenCL kernels and compared their performance against each other and with the CPU. The test hardware consisted of a 6th Generation Intel® Core™ i7-6700 processor-based machine running 64-bit Ubuntu* and a 4th Generation Intel ® Core™ i7 machine with HD 4600 Graphics running the same Beignet and OS version as the 6th Generation Intel® Core™ i7-6700 machine.

The tests were performed with qcms_tests (QCMS internal unit and performance tests) on 2048x2048 images for 100 iterations:

  • ./qcms_tests -t qcms_test_precache -w 2048 -h 2048 -i sRGB.icc -o AdobeRGB.icc -n 100

We obtained the best performance on the 6th Generation Intel® Core™ i7 machine with the OpenCL kernel version that used SIMD operations. Beignet does not support local memory usage on 4th Generation Intel ® Core™ i7 processors for Linux kernel versions prior to 4.2, and on the 6th Generation Intel® Core™ machine using local memory did not provide a significant speedup compared to the SIMD version (probably due to the small kernel size). Nevertheless, the measurements yielded some interesting results:

Performacne of QCMS on CPU vs GPU


The best speedup was obtained on the  6th Generation Intel® Core™ i7’s GPU: 1.5x over the single core CPU running at 4 GHz. OpenCL performance on the 4th Generation Intel® Core™ i7 processor  with local memory was significantly worse, but that is to be expected given the lower number of execution units and the memory restrictions imposed by the algorithm. Still, the HD 530 brings significant improvements to the state of the art, and is able to outperform the CPU with the correct optimizations.

Color management is an important topic for programs that have to work with images and videos. QCMS is one of the fastest open source solutions that can perform color transforms on the CPU. Expanding the QCMS library with routines that can offload some or all of the computation to the GPU can save time and power.

Technologies like OpenCL or OpenGL can be used to accelerate the color correction process in applications that use QCMS to provide a better user experience through colorimetric accuracy and increased speed. Beignet provides a solid implementation of the OpenCL standard that can be used to offload CPU work to great effect on modern integrated GPUs.


Some of the test images and scenarios used can be found below:

Original test image with sRGB profile
Original test image with sRGB profile.

Transformed image with high gamma output profile
Transformed image with high gamma output profile.

Image viewed through a display profile that enhances blue channel
Same image viewed through a display profile that enhances blue channel.


[1] International Color Consortium:
[2] Standard Illuminant:
[3] Gamma Correction:
[4] CIE 1931 color space:
[5] Beignet:
[6] Skylake:

Special thanks

The authors would like to thank Noel Gordon from Google.

Product and Performance Information


Performance varies by use, configuration and other factors. Learn more at