Background
Kingsoft1 Cloud* is a public cloud service provider. It provides many services including cloud storage. Massive images are stored in Kingsoft Cloud storage. Kingsoft provides not only data storage for their customers but also image processing services to its public cloud customers. Customers can use these image processing services to complete functions such as image scaling, cutting, quality changing, and image watermarking according to their service requirements, which helps them provide the best experience to end users.
In the next section we will see how Kingsoft optimizes the imaging processing task to run on systems equipped with Intel® Xeon® processors.
Kingsoft Image Processing and Intel® Xeon® Processors
Intel® Advanced Vector Extensions 2 (Intel® AVX2)8 accelerates compression and decompression while processing a JPEG file. Those tasks are usually done using libjpeg-turbo2. Libjpeg-turbo is a widely used JPEG software codec. Unfortunately, the libjpeg-turbo library is implemented using Intel® Streaming SIMD Extensions 2 (Intel® SSE2)9, not Intel AVX2.
To optimize libjpeg-turbo to take advantage of Intel AVX2, Kingsoft engineers modified that library to include support for Intel AVX2—the libjpeg-turbo library with Intel AVX2 implemented is found in the library3. The new library accelerates the processes of color space conversion, down/up sampling, integer sample conversion, fast integer forward discrete cosine transform (DCT)4, slow integer forward DCT, integer quantization and integer inverse DCT.
Besides taking advantage of Intel AVX2 to reduce processing time, Kingsoft image processing tasks also gain performance when running on systems equipped with Intel® Xeon® processors E5 v4 over systems equipped with Intel® Xeon® processors E5 v3, due to having more cores and larger cache size. Image processing tasks like image cutting, scaling, and quality changing are cache-sensitive workloads; therefore, a larger CPU cache will make it run faster. Also, more cores means more images can be processed in parallel. Together, tasks finish faster, and running in parallel increases the overall performance.
Kingsoft makes use of the Intel® Math Kernel Library (Intel® MKL)7, in which its functions are optimized using Intel AVX2.
The next section shows how we tested the Kingsoft image processing workload to compare the performance between the current generation of Intel Xeon processors E5 v4 and those of the previous generation of Intel Xeon processors E5 v3.
Performance Test Procedure
We performed tests on two platforms. One system was equipped with the Intel® Xeon® processor E5-2699 v3 and the other with the Intel® Xeon® processor E5-2699 v4. We wanted to see how much performance improved when comparing the previous and the current generation of Intel Xeon processors and how Intel AVX2 plays a role in reducing the image processing time.
Test Configuration
System equipped with the dual-socket Intel Xeon processor E5-2699 v4
- System: Preproduction
- Processors: Intel Xeon processor E5-2699 v4 @2.2 GHz
- Cache: 55 MB
- Cores: 22
- Memory: 128 GB DDR4-2133MT/s
System equipped with the dual-socket Intel Xeon processor E5-2699 v3
- System: Preproduction
- Processors: Intel Xeon processor E5-2699 v3 @2.3 GHz
- Cache: 45 MB
- Cores: 18
- Memory: 128 GB DDR4-2133 MT/s
Operating System: Red Hat Enterprise Linux* 7.2- kernel 3.10.0-327
Software:
- GNU* C Compiler Collection 4.8.2
- GraphicsMagick* 1.3.22
- libjpeg-turbo 1.4.2
- Intel MKL 11.3
Application: Kingsoft image cloud workload
Test Results
The following test results show the performance improvement when running the application on systems equipped with the current and previous generations of Intel Xeon processors and when running the application with the Intel AVX2 non-supported and Intel AVX2 supported libjpeg-turbo libraries.
Figure 1: Comparison between the application using the Intel® Xeon® processor E5-2699 v3 and the Intel® Xeon® processor E5-2699 v4.
Figure 1 shows the results between the application using the Intel Xeon processor E5-2699 v3 and the Intel Xeon processor E5-2699 v4. The performance improvement is because Intel Xeon processors E5 v4 have more cores, a larger cache, and Intel AVX2.
Figure 2: Performance comparison between non-supported Intel® Advanced Vector Extensions 2 (Intel® AVX2) jpeg-turbo and supported Intel AVX2 jpeg-turbo.
Figure 2 shows that application performance improves up to 45 percent when using the libjpeg-turbo library with Intel AVX2 implemented over that with Intel SSE2 implemented. The improvement is achieved because Intel AVX2 instructions perform better than Intel SSE2 instructions. The application is running on a system equipped with the Intel Xeon processor E5 v4.
Conclusion
Kingsoft added support to Intel AVX2 on the libjpeg-turbo library. This allows their applications using the newly modified library to take advantage of the new features in Intel Xeon processors E5 v4. More cores and larger cache size also play an important role in improving performance of the applications running on systems equipped with these processors over systems having previous generations of Intel Xeon processors.
References