ID 683013
Date 11/20/2017
Public

## 1.4.3.1. Huffman Code (Tree) Generation

With the existing FPGA software solutions, one of the main bottlenecks is the Huffman code or tree generation. The Huffman code generation first computes the frequency of each character in the input data, and then uses this frequency to construct the Huffman tree and related codes corresponding to each character.

It is observed that the frequency computation is the major bottleneck but not the tree generation from the codes. Following is the code snippet that computes the frequency:

unsigned int freq[256]
for (int i = 0; i < length; ++i) {
freq[data[i]]++;
}
The existing implementations with the Huffman frequency table computed on a CPU had a maximum throughput of up to 4.5 GB/s and the worst case was up to 1.2 GB/s with multiple threads, when data was homogeneous. The following two aspects of computing Huffman data frequency table is considered:
• Improving worst case scenarios

### Worst Case Scenarios

You can solve this issue by adding a padding between the frequency table entries for each character. In case of the Huffman frequency counter, the loop was modified and updated to have 16 different tables. The data is accumulated after the loop ends.

unsigned int freq[256]

unsigned int freq_tab[16][256];
for (int i = 0; i < length/16; ++i) {
for (int j = 0; j < 16; ++j) {
freq_tab[j][data[i]]++;
}
}

for (int i = 0; i < 16; ++i) {
for (int j = 0; j < 256; ++j) {
freq[j] += freq_tab[i][j];
}
}

With the above distributed table update, the frequency computation throughput became identical for all cases.

### Threads in Huffman Frequency Counter

Data is divided into multiple blocks and the frequency table of each block is computed independently in separate threads. The frequency tables are then merged at the end to generate the final table.

While experimenting, it was observed that as the number of threads increased, the frequency computation did not scale. This issue can be improved with CPU affinity of each thread by assigning threads to specific CPUs, to keep caches valid for longer than usual. By setting CPU affinity to threads, you can improve the best performance of the frequency computation (up to 15 GB/s).

Figure 9. Update 16 Tables for 16 Adjacent Chars in the Stream