Architecture of a 3D Software Stack for Peak PentiumŪ III Processor Performance (continued)


Previous Next     Page 2 of 10

Introduction

The prohibitive cost of applying the algorithms necessary to compute geometry and lighting in a conventional 3D pipeline has long kept 3D in the realm of high-end workstations. Figure 1 illustrates the classic 3D pipeline structure, which consists of several key components. First, the geometry and lighting calculations are performed on the system's host processor1. The application's 3D models are transformed into their virtual worlds, and lighting information is generated. These calculations are done in either a popular 3D library (OpenGL* or Microsoft's Direct3D* for example) or by the application. The generated information is then handed to another component, a 3D graphics controller, for rasterization (conversion into a 2D pixel representation of the image) on the computer screen. Keeping these two components in balance is one of the fundamental challenges that high-performance 3D engine development must address.

Figure 1: Typical 3D pipeline structure and its associated application-level components

Figure 1: Typical 3D pipeline structure and its
associated application-level components

An increase in graphics' controller performance means the 3D libraries built to deliver the geometry information to the cards must also increase in performance to keep the division of work in balance. The Internet Streaming SIMD Extensions were developed, in part, to increase the efficiency and throughput of the geometry and lighting calculations thus realizing higher system performance. However, to achieve peak 3D performance with the PentiumŪ III processor, components beyond the kernel level should also be optimized.

In order to demonstrate the peak 3D performance and usage models for the Pentium III processor, we developed the Architecture Geometry Engine (ArchGE) to fit into the 3D library layer (Figure 1). ArchGE incorporates the Internet Streaming SIMD Extensions to boost the geometry and lighting performance, but also adds two new architectural extensions to a general purpose 3D library:

  1. MultiPrimitive API extension
  2. "Online" driver model

The exact kernel-level speedup achieved with Internet Streaming SIMD Extensions over x87 code varies depending on the type and number of lights (infinite, local, specular, etc.), amount of clipping, primitive type, and other content variables2. When used in typical transformation and lighting kernels, we expect to see 1.4-2.0x the kernel-level performance over optimized x87 floating-point code for single light workloads. Workloads with multiple lights and larger primitive sizes are expected to see in excess of 2.0x the performance over optimized x87 implementations. Additionally, we have measured highly tuned, custom engines that see 2.5x-2.75x kernel-level performance. However, only a percentage of this kernel-level performance translates into application-level performance, the important measure for the consumer. The application performance increase is governed by Amdahl's Law and is typically less at the application level than at the kernel level unless additional optimizations to the software stack are made [2].

The MultiPrimitive API extension allows an application to gather all the primitives (e.g., strips, fans, vertex buffers) that share identical render state information and submit them in a single API call to the library. The MultiPrimitive optimization has been shown to provide 17%-40% of additional performance over conventional (single primitive per call) methods for drawing primitives. The additional performance is a result of the amortization of call overhead and vertex prefetch costs over a greater number of vertices being processed. The reduction of time spent in "startup" cost translates to more time spent in useful geometry and lighting computations that are accelerated by Streaming SIMD Extensions.

The second key extension introduced in ArchGE is an "online" driver (OLD). The OLD mechanism allows the graphics controller's driver to present the final destination buffer for the transformed and lit vertices directly to the geometry pipeline. Typically, in a general purpose API, all the vertices are transformed and lit, then placed in a buffer controlled by the library. The library signals the graphics controller when it is safe to take the buffer and render the information. When the buffer is ready, the device driver must copy the data from the library's buffer into the controller's memory (typically allocated in AGP memory). There are three issues with this methodology. First, moving large batches of transformed and lit vertices between library and graphics memory exercises the processor bus but not its computational throughput, thus leading to inefficient use of available resources. Second, this process typically generates excessive cache write-back activity (moving modified lines from a smaller, faster cache level to a larger and slower cache level or memory). This tends to aggravate the loading of the processor buses, reduce the efficiency of the cache hierarchy and prefetch instructions, and reduce the throughput of geometry computations. Third, the additional copy and formatting of the data by a typical device driver can increase driver execution times by up to 10x that of an OLD approach. This time spent in additional data movement is not time spent doing meaningful computations. OLD solves each of these issues by allowing the graphics pipeline to deposit transformed and lit vertices into the graphics controller's local memory, as they are calculated. The "direct deposit" of vertex information increases the concurrency between the geometry and lighting computations (computation intensive) with the storage of the results (bus intensive). This increased concurrency has been demonstrated to provide an additional application-level performance speedup of 30%-80% relative to a typical offline driver implementation.

The remainder of this paper discusses the following methods of 3D software stack optimizations (see Figure 1) and how these optimizations affect application-level performance:

  • 3D Library/API Layer: batch multiple primitives per drawing command, single pass vs. multiple pass geometry pipeline
  • Device Driver/Graphics Controller Layer: online driver delivery of processed vertices
  • 3D Application Layer: object-level clipping and render state sorting

With all of these optimizations in place, ArchGE is able to display nearly 2x the peak application-level speedups of optimized x87 floating-point pipelines on similarly configured machines running identical workloads. While existing 3D libraries and device drivers are able to perform the computations necessary for real-time 3D graphics, the techniques described in this paper add significantly to the overall performance for such implementations.


1 This paper assumes a basic understanding of 3D graphics. For an in-depth review of this material refer to [1].

2 Kernel-level performance for our study is defined as including transformation, culling, specular lighting, transposition into graphics controller vertex order from a SIMD format, and storing the processed vertices to AGP memory.




Previous Next     Page 2 of 10