Technology and Research
Intel® Technology Journal Home
Volume 10, Issue 01
Converged Communications
Table of Contents
Technical Reviewers
About This Journal
Intel Published Articles
Read Past Journals
Subscribe
E-Mail this Journal to a Colleague
Main Visual Description
Intel Technology Journal - Featuring Intel's Recent Research and Development
Converged Communications
Volume 10    Issue 01    Published February 15, 2006
ISSN 1535-864X    DOI: 10.1535/itj.1001.06

  Section 8 of 15  
Using Intel® Technologies to Build Next-Generation Media Servers
INTEL® ARCHITECTURE FOR SIGNAL PROCESSING APPLICATIONS

Intel offers an exceptional signal-processing platform for building next- generation media servers with the lowest total cost of ownership.

Intel® Integrated Performance Primitives

Intel IPPs are a highly optimized suite of libraries for audio, video, imaging, cryptography, speech recognition, and signal processing functions [22]. To maximize performance, Intel IPP use advanced performance-tuning techniques such as pre-fetching and cache-blocking, avoiding data and trace-cache misses as well as branch mis-predictions. Intel IPP exploit instruction set architectures like Intel Wireless MMX technology, SIMD Extensions (SSE), and HT Technology.

The Intel IPP libraries can be linked to the application as dynamically loadable modules, which make the applications platform independent. The libraries automatically detect the underlying processor platform at run-time and execute the function implemented for that particular platform. Table 6 lists some of the Intel IPP that are related to Media Processing. Intel NetStructure® Host Media Processing Software uses the Intel IPP for high performance. Figure 8 demonstrates the performance gain offered by IPP over Compiled C code [22]. In Figure 9 we show IPP encoder performance for H.263 profile 0, QCIF at 15 frames per second.

Table 6: Intel IPP
 
Media Processing Function Intel IPP Available
Audio/Video Play and Record, Audio and Video Transcoding for multi-media connections over IP. Audio Codecs – G.711, G.723.1, G.729, G.722, G.722.1, G.726, G.728, GSM-AMR, MP3, AAC, AC3.
Video Codecs – H.263, H.264, MPEG4
AGC, Tone Detection, Tone Generation, VAD, Conferencing Summer Signal Processing and Vector Math Library – Digital filtering (Adaptive, FIR, IIR), Fourier Transforms, Signal Generation
Speech Recognition Audio Processing – Acoustic Echo Cancellation, Noise Reduction, VAD, Feature Extraction.
Speech Processing – Pitch Detection, Speech Resampling
Echo Cancellation G.168-2000 compliant Echo Canceller
Secure RTP Symmetric Cryptography (DES, 3DES) Hash Algorithm Data Authentication (DES, 3DES)



Figure 8: Intel IPP 5.0 average performance gain over compiled C code3 [22]
click image for larger view
 



Figure 9: H.263 IPP encoder performance4
click image for larger view
 

Figure 10 shows performance data for advanced video processing functions such as text overlay and tiling for video conferencing with video streams of QCIF @ 15fps.



Figure 10: Video manipulation performance4
click image for larger view
 

Processors and Chipsets

Intel offers a wide variety of processors and chipsets to address the desktop, server, and mobile market segments. Table 7 lists a sampling of the processors available today, addressing each of the market segments. These processors offer a combination of features and technologies such as HT Technology, dual-core processors, increased amounts of L2 and L3 cache, increased FSB speeds, Intel Extended Memory 64 Technology (Intel EM64T), Streaming SIMD Extensions 3 (SSE3), and multi-processor platforms (MP). The hardware platform combined with the software tools already mentioned makes Intel Architecture Processors a very compelling media processing hardware platform from both a functionality and a density standpoint. The impact of HT Technology, dual-core processors, multi-processors, cache sizes, Intel C++ Compiler, VTune Performance Analyzer, and Intel IPP on media processing is discussed in detail. All data presented in this paper have been generated on a specific platform configuration that may vary for each set of results. All numbers presented are representative of the platform configuration for only that given set of data.

Table 7: Sampling of Intel processors
 
  Pentium® Processor Extreme Edition 8405 Pentium® 4 Processor Extreme Edition Intel® Xeon® Processor Dual-Core Intel® Xeon® Processor MP
System Type UP UP DP MP
L2 Cache 2x1MB 512 KB 2x2MB 1MB
L3 Cache N/A 2 MB N/A 8 MB
Clock Speed 3.20 GHz 3.46 GHz 2.80 GHz 3.33 GHz
FSB (MHz) 800 MHz 1066 MHz 800 MHz 667 MHz
dual-core    
Intel® EM64T  
HT
Execute Disable Bit  
SSE3      
EIST      
DBS    
Chipset Intel® 955X Express Intel® 925XE Intel® E7520 Intel® E8500
Memory Type Dual-Channel DDR2 Dual-Channel DDR2 400/533 (CL3) Dual-Channel DDR, DDR2 Quad-Channel DDR, DDR2

Table 8: Overview of Intel processor technologies
 
System Type Number of processor sockets in a platform or computer system. Uni-processor means that only one processor can be present. Dual-processor systems allow for up to two processors, and multi-processor systems allow for more than two processors.
L2 Cache Ultra-fast memory that buffers information being transferred between the processor and the slower RAM in an attempt to speed these types of transfers.
L3 Cache Size of 3rd-level cache, typically larger than L2. L3 Cache is ultra-fast memory that buffers information being transferred between the processor and the slower RAM in an attempt to speed these types of transfers. Integrated L3 cache provides a faster path to large data sets stored in cache on the processor.
FSB (Front Side Bus) Bus connecting the processor to the main external memory. The frequency shown in the table represents the operating frequency of the bus.
Intel® EM64T Intel Extended Memory 64-bit technology enables 64-bit computing.
Execute Disable Bit Allows the processor to classify areas in memory where application code can execute and where it cannot.
SSE3 Internet Streaming SIMD (Single Instruction Multiple Data) Extensions are instructions that reduce the overall number of instructions required to execute a particular program task. 3 refers to the 3rd iteration of these enhanced instructions.
EIST Enhanced Intel SpeedStep® technology enables a system to dynamically adjust processor voltage and core frequency.
DBS Demand-Based Switching uses EIST to dynamically lower the processor voltage and core frequency based on processor utilization.

Hyper-Threading Technology

HT Technology boosts computing performance by enabling a single processor to function as two "virtual" processors by executing two threads in parallel, allowing software to multi-task more effectively [23]. It provides thread-level parallelism on a single processor, resulting in more efficient use of processor resources, higher processing throughput, and improved performance of multi- threaded applications. Multi-threaded software divides its workloads into processes and threads that can be independently scheduled and dispatched by the operating system. In a multiprocessor system, those threads execute on different processors, whereas in a single processor that is HT Technology enabled, the threads execute in parallel on a single processor. HT Technology uses the idle cycles in a processor core such as stalls due to memory access, to enable parallel execution of another thread such as arithmetic computation that utilizes internal registers.

Prior to HT Technology, media-processing functions such as playing audio (disk or memory I/O intensive) and decoding a g.729a VoIP packet would complete in order of priority. If the thread executing audio play happened to be higher priority, the g.729a decode thread would not be able to take full advantage of the stalled processor cycles in the audio play thread due to memory or disk I/O. With HT Technology enabled, the g.729a decode thread can execute in parallel during the stalled cycles by doing any arithmetic operations that do not require memory I/O.

Figure 11 and Figure 12 show the performance due to HT when compared with single- and dual-processor systems for a specific media processing application: the SIP-to-SIP connection is configured for G.711 and G.729a RTP connection. For a G.711 RTP connection, HT Technology results in close to a 50% reduction in CPU utilization–equivalent to executing on a dual-processor configuration. For a G.729a RTP connection, HT Technology results in a 15-20% reduction in CPU utilization, whereas a dual-processor configuration without HT Technology results in a 30-40% reduction in CPU utilization. In both test application configurations, the platform comes close to achieving the theoretical maximum of four times the performance increase when compared to a dual-processor which is two times the performance speed, with HT Technology-enabled (x2) platform with a single-processor platform without HT Technology. The difference can be attributed to more frequent conflicts between the various channel threads using shared resources such as the execution engine, external memory access, and cache. If there is contention in usage of these resources by two different threads, the operations are serialized. A G.711 encode/decode algorithm has a small code and data footprint that will most likely not result in any cache misses. The G.729 algorithm has a much larger data footprint and data structure per channel and there is likely to be more cache thrashing and contention for external memory access.



Figure 11: HT technology and dual-processor performance G.7116
click image for larger view
 



Figure 12: HT technology and dual-processor performance G.7296
click image for larger view
 

Dual-Core Processors

Dual-core processors provide increased computing performance by combining two full processing cores into a single processor. These processors are well suited to multi-tasking environments because there are two complete execution cores instead of one, each with an independent interface to the FSB. Unlike HT Technology, dual-core processors offer complete parallelism of threads through independent execution units with no contention between threads in accessing resources such as the execution engine, FSB, and cache. Dual-core processors also open up a multitude of opportunities for executing completely independent tasks on each processor such as gaming on one core while running a virus scan on the other. The dual-processor results in Figure 12 indicate the potential performance boost due to dual-core processors.

Cache–Intel Architecture Processors

Cache Intel Architecture processors are offering increasing amounts of L2 and in some cases L3 cache, with significant potential benefits for cache-friendly applications. Media-processing applications benefit from increasing levels of cache as CPU-intensive signal-processing algorithms execute more quickly, with fewer processor stalls associated with fetching instructions and data from memory. A typical signal-processing chain for an audio play application to a SIP endpoint involves fetching a block from memory or file, parsing the block for header and raw data, feeding the raw data into a decoder, and adjusting gain by a gain block. The output of the gain block is then fed into an encoder, followed by packetization into RTP packets. A multi-channel voice mail server would repetitively perform the signal processing functions associated with audio play with the instructions executing out of cache. Data for each channel are fetched only at the start of the processing with subsequent blocks accessing the data out of cache. Large amounts of L2 and L3 cache ensure that there are fewer cache misses during the entire signal-processing chain. Figure 13 and Figure 14 show the performance boost (reduction of 12-15% in CPU utilization) due to L3 cache for an audio play application.



Figure 13: Cache performance G.729a7
click image for larger view
 



Figure 14: Cache performance G.7117
click image for larger view
 

 

  • 3 All code running on a PC with an Intel® Xeon® processor supporting Hyper-Threading Technology, 3.6 GHz, 1 MByte L2 cache and 2 GB RAM using Microsoft Windows* XP.
  • 4 All code running on a PC with an Intel® Pentium® M processor supporting Hyper-Threading Technology, 2.1 GHz, 2 MByte L2 cache and 1 GB RAM using Microsoft Windows* XP.
  • 5 Intel® processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families. See http://www.intel.com/products/processor_number for details.
  • 6 All code running on a PC with dual Intel® Xeon® processors supporting Hyper-Threading Technology, 3.2 GHz, 512 KByte L2 cache and 1 GB RAM using RedHat* Enterprise Linux.
  • 7 All code running on a PC with an Intel® Pentium® 4 processor supporting Hyper-Threading Technology (disabled), 3.2 GHz, 512 KByte L2 cache, 2 MByte L3 cache and 1GB RAM using RedHat* Enterprise Linux.

 


  Section 8 of 15  

In This Article
Abstract
Introduction
Taxonomy of a Media Service Network
Circuit-Switched Network
Packet-Switched Network
Application Programming Interfaces
Intel NetStructure® Host Media Processing Software
Intel Architecture for Signal Processing Applications
Intel Development Environment
Where We Go From Here
Conclusion
Performance Testing
Acknowledgments
References
Authors' Biographies
Download a PDF of this article.   
Email This Page
Back to Top