AN 831: Intel® FPGA SDK for OpenCL™: Host Pipelined Multithread

ID 683013
Date 11/20/2017
Public
Give Feedback

1.4. Design Example: Data Compression Algorithm

In this section, an example design of the data compression algorithm is presented to show how it influences the total system performance.
The following figure illustrates sequential tasks of the data compression algorithm. Most of these tasks must be executed in a sequential order for each input file.
Figure 4. Data Compression Algorithm Sequential Tasks
  • Huffman calculation finds the frequency table of all symbols in the input file and generates the Huffman code.
  • Deflate including LZ77 and Huffman encoding algorithms generate the compressed data stream.
    Attention: This is the most time consuming task of the sequence.
    • LZ77 engine searches the input file to find repetitions included in the file and replaces them with a pointer. This pointer includes distance to the location of previous occurrence of that string and length that matches.
    • Huffman encoding compresses the file furthermore by using the Huffman code.
  • CRC is calculated for the input file and later added at the end of the compressed file for the purpose of error detection. This is the only task that can happen in parallel to Deflate task.
  • Output generation called Metadata is the last task in the sequence that puts the output stream together, as defined in RFC 1951.

As illustrated in Data Compression Algorithm Sequential Tasks, the total time and latency required to generate the compressed output from any input file is the summation of processing time of each individual task.

To accelerate this algorithm, FPGA platform is used. OpenCL* is a very powerful tool that makes implementation on a hardware much faster when compared to the RTL design, especially for the software programmers. Using Intel® FPGA SDK for OpenCL™ for this purpose helps in optimizing the Deflate algorithm (the most time consuming task) for FPGA platform and achieving a very high throughput of more than 3 GB/s for input files larger than 4 MB.

CRC is also offloaded into FPGA with the same throughput as Deflate task. The CRC task can run in parallel to the Deflate task on the FPGA hardware since it requires only the input file for calculating the checksum.
Note: Development and optimizations on the FPGA hardware is not covered in this document.
The following figure illustrates how the tasks are arranged on a CPU and an FPGA:
Figure 5. Arrangement of Tasks on CPU and FPGA

After adding the Huffman calculation and Metadata process to the CPU as a pre-process or a post-process for deflate algorithm, a huge reduction was observed in the total system performance. By using the pipelining framework along with the host channel streaming feature to pass data between the host and device, total system performance improved back to the same throughput of the Deflate algorithm in situations where multiple input files were used for compression. This new proposed system architecture also minimizes latency.

For more information about the host channel streaming feature provided by the Intel® FPGA SDK for OpenCL™ , refer to the Host Channel Streaming Feature section. For information about applying the pipelining framework to the data compression algorithm, refer to the Pipelining Framework Details. For information about CPU optimization required for fine tuning the framework, refer to Fine Tuning the Framework Design and for final results, refer to Performance Evaluation.