|
Architecture of a 3D Software Stack for Peak PentiumŪ III Processor Performance (continued)
3D Library And API Optimizations
There are several popular 3D libraries and countless custom engines available to handle most of the details behind manipulating objects in three dimensions and displaying them on a 2D monitor. Existing 3D libraries typically have architectures that may potentially limit the performance an application can realize on a processor like the PentiumŪ III processor. Multi-Pass Vertex Processing Current 3D libraries normally have a multiple-pass structure for operating on input vertex information. In a multi-pass geometry pipeline, the vertices are processed through several individual loops. Each loop processes all the vertices submitted to the pipeline through transformation, backfaced culling (removal of non-forward facing triangles) and then lighting (MP half of Figure 3). There are two issues with this approach:
Multi-pass processing is heavily dependent on cache management and potentially breaking vertex blocks, submitted for transformation and lighting, into cache sized increments. After absorbing all of the cache misses incurred during the transformation phase, the pipeline should not also have to service misses during the culling and lighting portions (even if the data stays in the L2 cache, there is still a small penalty to access it). In addition to the extra programming efforts to directly manage cache usage for a multi-pass implementation, a small basic code block size makes it difficult to effectively interleave memory accesses and computation. Ideally, the memory leadoff/latency times for a loop should be balanced by the computation time within that loop. In a multi-pass pipe there is rarely enough computation per loop iteration to balance the load/store requirements. This makes our critical code sections memory bound and not very scalable as processor core frequency increases. (System memory performance historically lags behind processor performance.)
Figure 3: Multi-pass (MP) vs. single-pass (SP) geometry pipe Single-Pass Vertex Processing In order to address the issues of a multi-pass pipeline structure, ArchGE implements the single-pass (SP) methodology shown in Figure 3. The key difference in this approach is that a few vertices are processed through transform, culling, lighting, and writing to the graphics controller's memory in a single loop3. This eliminates the need to add code to carefully manage cache utilization and also increases the basic block size significantly. The Internet Streaming SIMD Extension PREFETCH instruction is used to hide memory latency behind the computation performed in the pipe. Data, which will be transformed (x, y, z coordinate information) during the next iteration of the loop, is brought into the cache while transforming the current vertex. The same methodology applies to normals and texture coordinate(s) values during lighting computation. By using PREFETCH instructions and implementing a large basic block, ArchGE is able to significantly increase the concurrency between the memory and computation. Our studies have shown that, as a result of this increase in coherency, a single-pass pipeline is 20%-30% faster then an optimized multi-pass pipeline. Since the ArchGE pipe tends to be more compute bound than its multi-pass counterpart, it should also scale more effectively with processor frequency. MultiPrimitive API Extension How vertices are submitted to a geometry pipe is almost as important as how they are processed. Most 3D libraries support many different ways to pass the application vertex information through the application programmer interface (API4). Vertices are grouped together into primitives (typically triangle-based) by the application and then passed to the library for transformation, lighting, and then rasterization by the graphics controller. OpenGL*, for instance, supports ten types of these primitives ranging in complexity from individual points to quadrilateral strips [4]. Since most graphics controllers accept information in triangle-based format, these are currently the most popular primitive types. Figure 4 demonstrates three such primitives.
Figure 4: Different triangle primitive types and the vertices necessary to draw them Most existing graphics libraries allow an application to submit only one primitive at a time for processing. This means that an application can only process one triangle strip, one triangle fan, or one indexed list of vertices per function call (or whatever primitives are supported by the library). Since most primitives are comprised of relatively few vertices, the overhead involved in just making the function call to process each individual primitive becomes significant5. The overhead for processing a single primitive can be broken into two parts: additional instructions outside of geometry computations and memory de-pipelining. The obvious source of additional work is the added instructions and cycles necessary to push and pop parameters, set up transform matrices and lighting information, validate parameters, etc. This was measured to be on the order of a thousand cycles per call in some popular libraries. This is a very significant amount of time if the application is submitting a small number of vertices per call. In the ideal case, the Pentium III processor with Internet Streaming SIMD Extensions allows for almost complete overlap of memory accesses and computation. This is achieved by fully pipelining memory accesses using the PREFETCH instruction (lower portion of Figure 5).
Figure 5: Ideal picture of increased memory and computational concurrency within a 3D pipeline Each box in Figure 5 represents a block of processing time in a simplified 3D pipeline. The top portion of the figure is a conventional pipe, with serial memory and computation (a simplification since even older processor families allow for a small amount of concurrency between memory and computation). The bottom portion of Figure 5 shows what can be achieved by utilizing the PREFETCH and streaming store features of the Pentium III processor. Practically, however, an effect we refer to as "memory de-pipelining" occurs at primitive boundaries causing the total time in our ideal case to stretch somewhat [8]. For example, there can be "startup costs" associated with prefetching the first several vertices of a primitive during which computation is effectively stalled waiting for the data. For nested loops, memory de-pipelining can occur during the interval between the last iteration of an inner loop and the next iteration of its associated outer loop.
Figure 6: Memory de-pipelining between two short primitives Figure 6 shows a graphical example of the effects of memory de-pipelining. In the figure, the large boxes represent the amount of time to do normal computation plus the time spent waiting for initial PREFETCH instructions to return data to the cache (which delays completion for several of the initial iterations of geometry processing). The smaller boxes represent the amount of time necessary to complete computation in the steady-state. The recommended technique to alleviate the performance issue of memory de-pipelining is "prefetch concatenation." Concatenation can bridge the execution pipeline bubbles between the boundary of an inner loop and its associated outer loop by using the PREFETCH instruction to "look ahead" to the next outer loop iteration. In the example outlined in Figure 6, the geometry pipeline "looks ahead" across primitive boundaries. It is clear that if an API only allows an application to submit a single primitive per call, this technique cannot be used at primitive boundaries to amortize the memory start-up costs for each primitive submitted for processing. This is especially important when dealing with primitives containing relatively few vertices (less then 100). In order to reduce both the impact of an application calling through the API layer for every primitive and the memory de-pipelining effects, ArchGE implements a MultiPrimitive method for passing primitives to the geometry engine. This allows an application to pass a list of primitives and a corresponding list of primitive lengths to ArchGE with one call. MultiPrimitive generates a 40% increase in application-level performance for the ArchGE/Scene Manager software stack. Figure 7 shows details of the sensitivity of MultiPrimitive to primitive size and the number of primitives in a batch. In the best case of small primitives with many primitives in each call, MultiPrimitive achieves over 400% the performance of a single primitive API. At the low end of the spectrum, very large primitives (65 - 120 vertices per primitive) with only two per call, MultiPrimitive is still able to achieve a 20% increase in application-level performance. Based on studies of existing games and benchmarks, we anticipate that this feature could potentially generate a 30%-40% application-level speedup for typical workloads.
The results in Figure 7 were generated by varying two variables: the maximum vertices per primitive and the number of primitives per batch. As we increase the maximum number of vertices allowed per primitive, the average number of vertices per primitive does not necessarily increase in a linear fashion (along the x-axis). The scene used for the experiment documented by Figure 7 had an original structure of 66 vertices per primitive on average. As we get closer to the original maximum average primitive size, moving right along the x-axis in Figure 7, a compression in the x-axis result occurs because additional vertices on a per-primitive basis do not affect the end average to any great extent. 3 In the case of ArchGE a few vertices is actually four, which nicely correlates to the PentiumŪ III processor's internet S.S.E. register width. 4 The API is the set of function calls a program can make to interact with a library. 5 This observation is based on a study of several current games and benchmarks. A similar observation was made |