|
Intel offers an exceptional signal-processing platform for building next-
generation media servers with the lowest total cost of ownership.
Intel® Integrated Performance Primitives
Intel IPPs are a highly optimized suite of libraries for audio, video, imaging,
cryptography, speech recognition, and signal processing functions [22]. To
maximize performance, Intel IPP use advanced performance-tuning techniques such
as pre-fetching and cache-blocking, avoiding data and trace-cache misses as
well as branch mis-predictions. Intel IPP exploit instruction set architectures
like Intel Wireless MMX™ technology, SIMD Extensions (SSE),
and HT Technology.
The Intel IPP libraries can be linked to the application as dynamically
loadable modules, which make the applications platform independent. The
libraries automatically detect the underlying processor platform at run-time
and execute the function implemented for that particular platform. Table 6
lists some of the Intel IPP that are related to Media Processing. Intel
NetStructure® Host Media Processing Software uses the Intel IPP for high
performance. Figure 8 demonstrates the performance gain offered by IPP over
Compiled C code [22]. In Figure 9 we show IPP encoder performance for H.263
profile 0, QCIF at 15 frames per second.
Table 6: Intel IPP
|
Media Processing Function
|
Intel IPP Available
|
|
Audio/Video Play and Record, Audio and Video Transcoding for multi-media
connections over IP.
|
Audio Codecs – G.711, G.723.1, G.729, G.722, G.722.1, G.726, G.728,
GSM-AMR, MP3, AAC, AC3.
Video Codecs – H.263, H.264, MPEG4
|
|
AGC, Tone Detection, Tone Generation, VAD, Conferencing Summer
|
Signal Processing and Vector Math Library – Digital filtering (Adaptive,
FIR, IIR), Fourier Transforms, Signal Generation
|
|
Speech Recognition
|
Audio Processing – Acoustic Echo Cancellation, Noise Reduction, VAD,
Feature Extraction.
Speech Processing – Pitch Detection, Speech Resampling
|
|
Echo Cancellation
|
G.168-2000 compliant Echo Canceller
|
|
Secure RTP
|
Symmetric Cryptography (DES, 3DES) Hash Algorithm Data Authentication (DES,
3DES)
|

Figure 8: Intel IPP 5.0 average performance gain over compiled C code3 [22]
click image for larger view

Figure 9: H.263 IPP encoder performance4
click image for larger view
Figure 10 shows performance data for advanced video processing functions such
as text overlay and tiling for video conferencing with video streams of QCIF @
15fps.

Figure 10: Video manipulation performance4
click image for larger view
Processors and Chipsets
Intel offers a wide variety of processors and chipsets to address the desktop,
server, and mobile market segments. Table 7 lists a sampling of the processors
available today, addressing each of the market segments. These processors offer
a combination of features and technologies such as HT Technology, dual-core
processors, increased amounts of L2 and L3 cache, increased FSB speeds, Intel
Extended Memory 64 Technology (Intel EM64T), Streaming SIMD Extensions 3
(SSE3), and multi-processor platforms (MP). The hardware platform combined with
the software tools already mentioned makes Intel Architecture Processors a very
compelling media processing hardware platform from both a functionality and a
density standpoint. The impact of HT Technology, dual-core processors,
multi-processors, cache sizes, Intel C++ Compiler, VTune™
Performance Analyzer, and Intel IPP on media
processing is discussed in detail. All data presented in this paper have been
generated on a specific platform configuration that may vary for each set of
results. All numbers presented are representative of the platform configuration
for only that given set of data.
Table 7: Sampling of Intel processors
|
|
Pentium® Processor Extreme Edition 8405
|
Pentium® 4 Processor Extreme Edition
|
Intel® Xeon® Processor Dual-Core
|
Intel® Xeon® Processor MP
|
|
System Type
|
UP
|
UP
|
DP
|
MP
|
|
L2 Cache
|
2x1MB
|
512 KB
|
2x2MB
|
1MB
|
|
L3 Cache
|
N/A
|
2 MB
|
N/A
|
8 MB
|
|
Clock Speed
|
3.20 GHz
|
3.46 GHz
|
2.80 GHz
|
3.33 GHz
|
|
FSB (MHz)
|
800 MHz
|
1066 MHz
|
800 MHz
|
667 MHz
|
|
dual-core
|
|
|
|
|
|
Intel® EM64T
|
|
|
|
|
|
HT
|
|
|
|
|
|
Execute Disable Bit
|
|
|
|
|
|
SSE3
|
|
|
|
|
|
EIST
|
|
|
|
|
|
DBS
|
|
|
|
|
|
Chipset
|
Intel® 955X Express
|
Intel® 925XE
|
Intel® E7520
|
Intel® E8500
|
|
Memory Type
|
Dual-Channel DDR2
|
Dual-Channel DDR2 400/533 (CL3)
|
Dual-Channel DDR, DDR2
|
Quad-Channel DDR, DDR2
|
Table 8: Overview of Intel processor technologies
|
System Type
|
Number of processor sockets in a platform or computer system. Uni-processor
means that only one processor can be present. Dual-processor systems allow for
up to two processors, and multi-processor systems allow for more than two
processors.
|
|
L2 Cache
|
Ultra-fast memory that buffers information being transferred between the
processor and the slower RAM in an attempt to speed these types of transfers.
|
|
L3 Cache
|
Size of 3rd-level cache, typically larger than L2. L3 Cache is ultra-fast
memory that buffers information being transferred between the processor and the
slower RAM in an attempt to speed these types of transfers. Integrated L3 cache
provides a faster path to large data sets stored in cache on the processor.
|
|
FSB (Front Side Bus)
|
Bus connecting the processor to the main external memory. The frequency shown
in the table represents the operating frequency of the bus.
|
|
Intel® EM64T
|
Intel Extended Memory 64-bit technology enables 64-bit computing.
|
|
Execute Disable Bit
|
Allows the processor to classify areas in memory where application code can
execute and where it cannot.
|
|
SSE3
|
Internet Streaming SIMD (Single Instruction Multiple Data) Extensions are
instructions that reduce the overall number of instructions required to execute
a particular program task. 3 refers to the 3rd iteration of these enhanced
instructions.
|
|
EIST
|
Enhanced Intel SpeedStep® technology enables a system to
dynamically adjust processor voltage and core frequency.
|
|
DBS
|
Demand-Based Switching uses EIST to dynamically lower the processor voltage and
core frequency based on processor utilization.
|
Hyper-Threading Technology
HT Technology boosts computing performance by enabling a single processor to
function as two "virtual" processors by executing two threads in
parallel, allowing software to multi-task more effectively [23]. It provides
thread-level parallelism on a single processor, resulting in more efficient use
of processor resources, higher processing throughput, and improved performance
of multi- threaded applications. Multi-threaded software divides its workloads
into processes and threads that can be independently scheduled and dispatched
by the operating system. In a multiprocessor system, those threads execute on
different processors, whereas in a single processor that is HT Technology
enabled, the threads execute in parallel on a single processor. HT Technology
uses the idle cycles in a processor core such as stalls due to memory access,
to enable parallel execution of another thread such as arithmetic computation
that utilizes internal registers.
Prior to HT Technology, media-processing functions such as playing audio (disk
or memory I/O intensive) and decoding a g.729a VoIP packet would complete in
order of priority. If the thread executing audio play happened to be higher
priority, the g.729a decode thread would not be able to take full advantage of
the stalled processor cycles in the audio play thread due to memory or disk
I/O. With HT Technology enabled, the g.729a decode thread can execute in
parallel during the stalled cycles by doing any arithmetic operations that do
not require memory I/O.
Figure 11 and Figure 12 show the performance due to HT when compared with
single- and dual-processor systems for a specific media processing application:
the SIP-to-SIP connection is configured for G.711 and G.729a RTP connection.
For a G.711 RTP connection, HT Technology results in close to a 50% reduction
in CPU utilizationequivalent to executing on a dual-processor
configuration. For a G.729a RTP connection, HT Technology results in a 15-20%
reduction in CPU utilization, whereas a dual-processor configuration without HT
Technology results in a 30-40% reduction in CPU utilization. In both test
application configurations, the platform comes close to achieving the
theoretical maximum of four times the performance increase when compared to a
dual-processor which is two times the performance speed, with HT
Technology-enabled (x2) platform with a single-processor platform without HT
Technology. The difference can be attributed to more frequent conflicts between
the various channel threads using shared resources such as the execution
engine, external memory access, and cache. If there is contention in usage of
these resources by two different threads, the operations are serialized. A
G.711 encode/decode algorithm has a small code and data footprint that will
most likely not result in any cache misses. The G.729 algorithm has a much
larger data footprint and data structure per channel and there is likely to be
more cache thrashing and contention for external memory access.

Figure 11: HT technology and dual-processor performance G.7116
click image for larger view

Figure 12: HT technology and dual-processor performance G.7296
click image for larger view
Dual-Core Processors
Dual-core processors provide increased computing performance by combining two
full processing cores into a single processor. These processors are well suited
to multi-tasking environments because there are two complete execution cores
instead of one, each with an independent interface to the FSB. Unlike HT
Technology, dual-core processors offer complete parallelism of threads through
independent execution units with no contention between threads in accessing
resources such as the execution engine, FSB, and cache. Dual-core processors
also open up a multitude of opportunities for executing completely independent
tasks on each processor such as gaming on one core while running a virus scan
on the other. The dual-processor results in Figure 12 indicate the potential
performance boost due to dual-core processors.
CacheIntel Architecture Processors
Cache Intel Architecture processors are offering increasing amounts of L2 and
in some cases L3 cache, with significant potential benefits for cache-friendly
applications. Media-processing applications benefit from increasing levels of
cache as CPU-intensive signal-processing algorithms execute more quickly, with
fewer processor stalls associated with fetching instructions and data from
memory. A typical signal-processing chain for an audio play application to a
SIP endpoint involves fetching a block from memory or file, parsing the block
for header and raw data, feeding the raw data into a decoder, and adjusting
gain by a gain block. The output of the gain block is then fed into an encoder,
followed by packetization into RTP packets. A multi-channel voice mail server
would repetitively perform the signal processing functions associated with
audio play with the instructions executing out of cache. Data for each channel
are fetched only at the start of the processing with subsequent blocks
accessing the data out of cache. Large amounts of L2 and L3 cache ensure that
there are fewer cache misses during the entire signal-processing chain. Figure
13 and Figure 14 show the performance boost (reduction of 12-15% in CPU
utilization) due to L3 cache for an audio play application.

Figure 13: Cache performance G.729a7
click image for larger view

Figure 14: Cache performance G.7117
click image for larger view
|