How to Choose between Hardware and Software Prefetch on 32-Bit Intel® Architecture

Submit New Article


Last Modified On :   May 7, 2008 1:42 AM PDT
Rate
 


Challenge
Determine the effectiveness of software-controlled versus hardware-controlled data prefetch for memory optimization. The Pentium® 4 processor has two mechanisms for data prefetch: software-controlled prefetch and an automatic hardware prefetch.

Software-controlled prefetch is enabled using the four prefetch instructions introduced with Streaming SIMD Extensions (SSE) instructions. These instructions are hints to bring a cache line of data in to various levels and modes in the cache hierarchy.

The automatic hardware prefetch can bring lines into the unified first-level cache based on prior data misses. The automatic hardware prefetcher will attempt to prefetch two cache lines ahead of the prefetch stream. This feature was introduced with the Pentium 4 processor.

Solution
Generally, prefer software-controlled prefetch in situations where all the following are true: irregular access patterns are present, short arrays must be prefetched, and making changes to existing application code is acceptable. In practice, the individual advantages and disadvantages of hardware and software prefetching must be weighed against the needs of an individual situation.

The software-controlled prefetch is not intended for prefetching code. Using it can incur significant penalties on a multiprocessor system when code is shared.

Software prefetching has the following characteristics:

  • Can handle irregular access patterns, which do not trigger the hardware prefetcher.
  • Can use less bus bandwidth than hardware prefetching.
  • Software prefetches must be added to new code, and they do not benefit existing applications.
There are different strengths and weaknesses to software and hardware prefetching on the Pentium 4 processor. The characteristics of the hardware prefetching are as follows (compare with the software prefetching features listed above):

  • Works with existing applications.

  • Requires regular access patterns.

  • Start-up penalty before hardware prefetcher triggers and extra fetches after array finishes. For short arrays, this overhead can reduce effectiveness of the hardware prefetcher.

    • The hardware prefetcher requires a couple of misses before it starts operating.
    • Hardware prefetching will generate a request for data beyond the end of an array, which will not be utilized. This behavior wastes bus bandwidth. In addition, this behavior results in a start-up penalty when fetching the beginning of the next array; this occurs because the wasted prefetch should have been used instead to hide the latency for the initial data in the next array. Software prefetching can recognize and handle these cases.

  • Will not prefetch across a 4K page boundary (i.e., the program would have to initiate demand loads for the new page before the hardware prefetcher will start prefetching from the new page).
Source





Comments (0)



Leave a comment

Name (required)

Email (required; will not be displayed on this page)

Your URL (optional)


Comment*