Architecture of a 3D Software Stack for Peak PentiumŪ III Processor Performance (continued)


Previous Next     Page 5 of 10

Device Driver Optimization

Some conventional 3D libraries implement an offline driver model. Vertices are transformed, lit, and then stored in a temporary location within cacheable memory. The geometry engine then signals the graphics controller's driver that the buffer is ready, and the device driver begins to move the information from the temporary, cacheable memory to local memory controller memory (typically in a write-combinable and uncacheable memory range6 ). Looking back at the top portion of Figure 5, we can easily see that this will hurt the concurrency we are trying to build between memory access and computations. The conventional driver portion of the time is indicated by the "Driver Loop" time bar.

In addition to reduced concurrency within a typical geometry pipeline, offline driver models also have a tendency to upset the utilization of the external bus (the bus between any CPU core and memory). Many cache lines are modified in the process of storing all of the command and vertex information for a primitive (post transform and lighting information). This can easily evaporate all the careful cache utilization work done in the application and the transform and lighting routines by writing unanticipated data to the caches. Quickly, the application finds itself faced with modified cache lines that need eviction prior to pulling fresh cache lines that contain the current data necessary for computation. The modified line evictions cause an unnecessary load on internal and external busses and can significantly hurt algorithm performance.

ArchGE solved both the problem of decreased concurrency between memory and computation and the issue of inefficient cache management by implementing a different driver model. OLD differs from a conventional driver model in one simple area: a large temporary buffer for transformed and lit vertex information is not required. With OLD, vertices are transformed and lit (four at a time in our single pass implementation) and then stored immediately to memory presented by the graphics controller. ArchGE implemented an OLD mechanism for a commercially available high-performance graphics controller. In the ArchGE geometry pipeline, four vertices are transformed, lit (if visible), and deposited directly in the graphics controller's memory. Since only small blocks of vertex and command information are stored directly to memory, we have increased the concurrency between the computation in transform and lighting and have also decreased the effects of excessive cacheable writes.

Figure 8:  Application-level performance achievable with online vs. conventional driver models

Figure 8: Application-level performance achievable with online vs. conventional driver models

Two very significant results were achieved with the implementation of an OLD in ArchGE. The first was the level of application speedup shown by this device driver model. Figure 8 demonstrates some of the possibilities of the online driver methodology. We measured three different scenes of varying complexity. The first scene in the chart, Scene 1, generates a 1.8x speedup over the same scene run with an offline driver in ArchGE. The high level of performance increase is attributable to the fact that the scene is very geometry intensive (approximately 82,000 vertices submitted per frame for processing) and is not bound by graphics controller fill rates (mostly small triangles). Thus, Scene 1 is more sensitive to processor capabilities and available bus bandwidth. With no external limitations on the performance of this workload, OLD is able to show close to peak performance.

The second scene in the chart, labeled Scene 2, is a close-up view of the first one and is much more sensitive to graphics controller fill rates. The somewhat lower speedup, 1.30x over a conventional driver model (vs. 1.8x for Scene 1) reflects the scene's sensitivity to fill rate. Finally, the third scene measured, Scene 3, shows a 1.17x performance delta over an offline driver. The third scene represents what we feel to be the worst-case content for ArchGE, since the geometric complexity of the content is relatively low, and the fill rate requirements are quite high, which make the graphics controller the performance bottleneck.

Figure 8 shows the large range of performance that is possible by implementing an online driver model. Our studies on the content of current games and benchmarks have indicated that results achievable fall between the peak of 1.8x and 1.3x. Tuning of the content for the third scene should yield results that fall into this range.

Increasing the amount of time spent transforming and lighting vertices is an additional effect of an online driver. With all of the additional time spent in code optimized with the PentiumŪ III processor's Internet Streaming SIMD Extension instructions, we are able to get much closer to the theoretical speedup generated by kernel-level optimizations of transformation and lighting (according to Amdahl's Law described previously). Figure 9 clearly displays the additional amount of time spent in meaningful computation in the case of the online driver. The pie-chart on the left of Figure 9 shows that 90% of our time is spent in the ArchGE library transforming and lighting vertices. In contrast, the pie-chart on the right of Figure 9 shows only 50% of our time in transformation and lighting, while we spend 46% of our time in device driver code (copying vertices to the graphics controller's local memory). Both of the profiles shown were generated using ArchGE on a very complex scene, which is not limited by graphics card fill rates.

Figure 9: Effects of an online driver model to time spent in computation vs. data formating and movement

Figure 9: Effects of an online driver model to time spent in computation vs. data formating and movement

The online driver feature translates into more time spent transforming and lighting vertices and less time moving data in and out of the cache hierarchy. The increase in focus on transformation and lighting (coupled with the optimizations possible with the Pentium III processor) allows an application to increase the level of content and generate a more realistic user experience. Our measurements have shown realistic speedups in the range of 1.8x to 1.3x with an online driver.


6 See [9] for more information on PentiumŪ III processor memory type definitions.




Previous Next     Page 5 of 10