• Select a language



Hyper-Threading Technology and Write Combining Store Buffers -- Understanding, Detecting and Correcting Performance Issues
Page & Feed Options
Print this
Bookmark This
Digg this | Add to your del.icio.us account
Table of Contents

Introduction
By Thomas R. Craver

Processors enabled for Hyper-Threading Technology provide multiple logical processors per physical processor package. The state information necessary to support each logical processor is replicated while sharing the underlying physical processor resources. Many applications underutilize processor resources, so multiple threads running in parallel with Hyper-Threading Technology can achieve higher utilization and increased net throughput.

The write combining store buffers comprise one such shared physical processor resource. This paper describes how to write code that properly uses write combining store buffers to improve application performance on Hyper-Threading Technology enabled processors, and how to detect incorrect use of write combining buffers that can degrade performance. This paper is targeted to Intel® Xeon® processors, and implementation details are subject to change in future processors enabled with Hyper-Threading Technology.

Definitions
Store Buffer – A Store Buffer is an internal processor buffer that holds written data in preparation for depositing it farther out into the memory hierarchy. Each instruction that does a data write fills one or more store buffers.

Write Combining Store Buffer – A write combining store buffer accumulates multiple stores in the same cache line before eventually writing the combined data farther out into the memory hierarchy, for the purpose of accelerating processor write performance. Stores to the same cache line, which have been recorded in store buffers, are copied to a write combining store buffer to free the store buffer for re-use, rather than waiting for the data to be written to the appropriate level of the memory hierarchy. Write combining buffers are also used to accumulate SSE non-temporal stores and stores to uncached memory.

Background
The Intel NetBurst™ microarchitecture includes a first-level write-through data cache and a second-level unified cache on the processor. Intel Xeon processors MP also include a third-level unified cache and feature Hyper-Threading Technology. This cache, together with the off-processor main memory, form a memory hierarchy. For additional information on caches on Intel Xeon processors MP, see Effects of Shared Cache on Hyper-Threading Technology Enabled Processors, by Phil Kerley.

Data is read from the first-level cache – the fastest cache – if at all possible. If the data is not in that level, the processor attempts to read it from the next level out, and so on. When data is written, if the first-level cache contains the cache line being addressed, the data is written there as well as being "written through" to the second-level cache. If the cache line is not in the first-level cache, the write goes to the second-level cache but not the first level. In either case, a data store operation places data into a "store buffer". If a store buffer is not available, the processor must wait ("stall") until one becomes available.

There are also a limited number of write combining store buffers, each holding a 64-byte cache line. If a store goes to an address within one of the cache lines of a write combining store buffer, the data can often be quickly transferred to and combined with the data in the write combining store buffer, completing the store operation much faster than writing to the second-level cache. This leaves the store buffer free to be re-used sooner, minimizing the likelihood of entering a state where all the store buffers are full and the processor must stall to wait for a store buffer to become available.

PrevPrev2  3  4  5  Next

Page 1 of 6