- Home›
- Technology and Research›
- Intel Technology Journal›
- Multi-Core Software
Multi-Core Software
Parallel Software Development with Intel® Threading Analysis Tools
MULTIPLE PATTERN MATCHING ALGORITHM
Aho-Corasick_Boyer-Moore Algorithm
The Aho-Corasick_Boyer-Moore (AC_BM) [8] algorithm is an efficient string pattern matching algorithm that is derived from the Boyer-Moore algorithm to improve efficiency when there are multiple string patterns to search against. Multiple string pattern matching is one of the core problems in the rule-based Intrusion Detection System (IDS) such as Snort [9]. The AC_BM algorithm can be used to improve the efficiency of the detection engine of the IDS. The basic idea of the algorithm is that the string patterns are first built into a pattern tree, and then the algorithm slides the pattern tree using bad character and good prefix shifts to find the matches. Figure 1 shows the pseudo code of the AC_BM algorithm. In the code, we use acbm_init() to construct the pattern tree, and we use acbm_search() to compare the given text with the pattern tree.
Serial Implementation

Figure 1: Serial implementation of AC_BM algorithm
click image for larger view
Before moving on to parallelize the application and tuning the performance, it is better to baseline the performance of the serial version. We test the performance against different workloads based on different file sizes. System environment:
- OS: Windows XP* SP2;
- CPU: Intel® Core™2 Duo processor 63001 @ 1.86 GHz; Memory: 1.00 GB
Figure 2 shows the test result of matching patterns against different file sizes:
Text size 1M 2M 4M 8M 16M 32M
Time (ms) 12.167 24.826 49.327 96.304 191.728 383.353

Figure 2: Performance of different file sizes
click image for larger view
Analysis
Identifying performance bottlenecks is the first step in improving the performance of an application. For simple applications, it's possible to find the bottleneck by looking at the algorithm and source code, and by analyzing the nature of an application. However, for large and complicated applications, it's difficult to predict performance issues when the bottlenecks are not only dependent on the algorithm, but also on the execution environment such as the processor or memory system. That is why we need a performance analyzing tool.
The Intel VTune™ Performance Analyzer finds performance bottlenecks by automating the process of data collection with three types of data collectors: sampling, call graph, and counter monitor. These data collectors help find the hotspots and the bottlenecks in the system, application, and micro-architecture levels. We use the VTune Performance Analyzer to profile the serial version of the AC_BM application and uncover the parallel opportunities.
In counter monitor view (as shown in Figure 3), we find that the average processor time is only 39.453%, which means that the processors are not fully utilized; therefore, we can improve the performance by introducing more threads.

Figure 3: Counter monitor view
click image for larger view
The call graph view (as shown in Figure 4) shows the critical path of the application. It helps to determine which function in the call tree to thread that will result in the best performance gain. Threading a function higher up in the tree results in a coarser-grain level of parallelism, while threading a child function lower in the tree results in a finer-grain level of parallelism. In the graph, we can see that the only function we can thread on the critical path is acbm_search, and acbm_search doesn't have any child function.

Figure 4: Call graph view
click image for larger view
Threading Methods
Currently there are several threading strategies: OS threading libraries such as the Win32* API; compiler directives OpenMP*; Message passing API (MPI); and the newly introduced Intel® TBB. Each method has its own features and there is no simple way to determine which threading method is the best. It mainly depends on the specific application to be threaded.
In the following sections, we parallelize the AC_BM algorithm using the Win32 threading library and Intel TBB, and we compare the performance of the two.
Estimate the Performance Speedup
Before threading the application it is better to estimate the performance speedup of the parallelization: on the one hand, you may know about the upper bound of the performance limit and set the optimization goal for threading and tuning; on the other hand, estimating the performance speedup helps you to determine when to stop and accept the performance gain without trying for more tuning iterations.
In our system, there are two processors, and in the VTune Performance Analyzer sampling view, we find that the acbm_search function takes up 92.90% of CPU time. According to Amdahl's law
Tparallel = {(1-0.929) + 0.929/2}*Tserial + ON;
Suppose ON -> 0, Scalability -> 1.867
So the upper bound of the speedup is 1.867.
[1] Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families. See www.intel.com/products/processor_number for details.
