The technology of computer gaming is undergoing a major conceptual shift: a shift to multi-threaded engines running on multi-core processors. Multi-core processors are powering the next generation of PCs and gaming consoles, and game developers want to target as many of these platforms as possible. Unfortunately, threaded execution and cross-platform support are non-trivial, and many developers find it difficult to get these features into their own code. This article will attempt to smooth this transition by exploring these two features in the context of a simple demo application. By building up understanding of these technologies in a ground-up fashion, game developers can increase their understanding and implement these important features in their own projects.
The demo application created for this article can be found here. The demo application has a Microsoft Visual Studio* 2005 solution file for building and running on Windows, and a makefile for building on Linux. When run, the demo application opens a window and draws an OpenGL scene (Figure 1). This demo application and the code that comprises it will be used for examples throughout this article.
Figure 1: Demo application at launch
click here for a larger view
Multi-threaded software design is a study unto itself, but the basic principles are easily understood. Conventional, single-threaded software executes all of its code in a serial fashion. For games, this generally means there is a central loop that executes all of the game's tasks (handling input, updating game world, rendering, etc), once for each frame rendered to the screen. This serial execution model is sufficient for single-core processors, but multi-core processors have additional processing resources that will go unused. The unused processing resources can be reclaimed by splitting the game tasks into independent "threads" that can be executed on any of the logical cores in a multi-core processor. This is called a parallel or threaded execution model. Parallel execution is key to achieving high performance gaming on modern processors, which are increasingly moving to multi-core.
The demo application has three threads: one that runs a basic event loop, one that updates the positions of objects in the game world, and one that renders the game world to the window. A display at the bottom of the window shows how often the task assigned to each thread is executing, in units of Calls Per Second (Figure 2). This metric is equivalent to Frames Per Second for the rendering task, but the more general name applies to all threads which execute a task repeatedly. The display at the bottom also reports whether the tasks are being executed in serial or parallel. By pressing the Tab key, this setting can be toggled to see the effect on each task's CPS.
The render task and the update task can have their workloads increased or decreased interactively by pressing the Z and X keys for the render task, and the “.”(period) and “/”(slash) keys for the update task. By adjusting these workloads, simulations can be made of games whose running speeds are limited by either graphics or calculation.
Some developers may find it confusing to have each task's execution rate be completely decoupled – do we really need to process events faster than we can update the game world? This is a reasonable question - it's closely related to the question of how many threads a game should have. The answer is different for each game and each processing environment. A high-performing game should be able to take advantage of all available logical cores without suffering undue slowdown when running on a single-core processor. For this reason, some games attempt to detect the number of logical cores before deciding how many threads to spawn. With modern thread-scheduling systems, this optimization very rarely provides significant gains over a generalized division of processing into independent tasks. The demo application demonstrates this by allowing the user to serialize all the tasks onto one thread - on a single-core processor this demonstrates the absence of significant overhead from having more threads than logical cores.
Figure 2: Demo application with increased workload
The demo program consists of four C++ classes and some glue code to hold them together. The classes are:
ThreadManager: This class manages a thread pool - a group of threads that are created when the program begins and assigned tasks throughout the program's lifetime. Creating the threads only once saves paying the overhead of thread creation and destruction over and over again during program execution. This thread pool implementation "dedicates" each thread to a game task; the thread will call the same function repeatedly after it has been started. This is a convenient scheme for allowing the thread to be started and stopped easily, and for calculating the Calls Per Second. There are other schemes for managing a thread pool which are discussed in the Future Work section.
The ThreadManager class also provides some convenience methods for thread-related tasks. Each thread can be assigned local storage, which is a way to associate a variable with the thread itself, such that a piece of code can behave differently depending upon which thread is executing it. The demo application uses this feature to ensure that the rendering function will continue to work when it is serialized with all the other tasks on the primary thread, or when the rendering context has changed (when transitioning to or from fullscreen). The ThreadManager class also defines methods that allow a thread to sleep for a period of time, or yield to other threads if any are waiting to execute.
The demo application uses a subclass of ThreadManager called ThreadManagerSerial. This subclass has additional methods the demo application uses to move tasks from dedicated threads to and fro m the primary thread.
CriticalSection: This is a helper class used in ThreadManager and elsewhere in the demo application to create and manage critical sections, which are used to prevent multiple threads from attempting to read and modify shared data at the same time.
OpenGLWindow: This class uses SDL, the Simple Direct Media Layer library1 to provide a cross-platform, fixed-resolution rendering context. Some non-obvious rough spots using OpenGL rendering are smoothed over by this class. Rendering can be done in a window or in fullscreen mode. OpenGL rendering contexts become invalid when a window is recreated (when going to fullscreen mode, for example). In order to handle the regeneration of rendering contexts, the class provides a way for threads to determine if they have a valid rendering context. This is necessary since OpenGL requires each rendering thread to be associated with a single rendering context, and threads are not automatically notified if their context becomes invalid.
World: This class is specific to the demo application. It manages a static background of triangles and a set of moving points in the foreground which are updated and rendered by threaded tasks in the demo. The background triangles are used to provide a workload for the rendering task. The foreground points are modeling an "n-body" problem - a showcase of the law of universal gravitation. Imagine each point is a planet or asteroid in space, pulling on all other bodies, and combining when they collide. There is also an invisible black hole at the center of the space, which absorbs everything that collides with it, but eventually "overflows" and spills out new bodies. The n-body problem is the workload for the update task.
Experiments with the Demo
The demo can be manipulated to simulate the performance of games with commonly-occurring runtime characteristics. Some games spend most of their running time drawing a complicated scene. Some games spend more time computing a complicated transformation of the game world. By adjusting the workload of the tasks in the demo, these runtime situations can be simulated and the benefits of threaded execution in those situations can be evaluated.
The key to interpreting the output of the demo application is whether or not it is running on a multi-core processor. Threaded applications run just fine on machines with single-core processors.. But without a multi-core processor, in most cases a program with threaded tasks running in parallel is no more efficient than a program running those same tasks serially. One common exception to this general rule comes from workloads with a high amount of network or disk I/O blocking. Even on single-core processors, threading those tasks can allow other computation to occur while data is being retrieved.
For each scenario below, an expected result will be listed separately for both single-core and multi-core execution. Note that on single-core, changing the workload on any task is going to affect the CPS of all tasks. On multi-core, changing one task's workload will have little to no effect on the other tasks' CPS.
When the demo launches, it shows 10000 in the background, 50 points in the foreground, and is running in windowed mode. All of the scenarios below assume this as the starting state. Triangles can be adjusted 10000 at a time by pressing Z and X. Bodies can be adjusted 50 at a time by pressing .(period) and /(slash). Pressing Tab will switch between serial and parallel assignment of tasks.
Scenario 1: Calculation-bound execution. Adjust the update task by pressing the /(slash) key to add bodies to the n-body problem. Add enough so that the update CPS is significantly less than the render CPS, but not lower than 5 (if possible). On a single-core machine this can be tricky, because increasing the load on the update task will also tend to drop the CPS of the other two tasks. A multi-core machine should be able to get into this state without difficulty. In this state the bodies seem to move in waves - this is a side-effect of the render task and update task operating on the same data. Since the interaction is not thread-safe, the rendering task shows a partially-complete update each frame. A thread-safe approach to rendering is detailed in the Future Work section.
Note the CPS of the update thread, then press the Tab key to switch to serial execution. On single-core, all of the tasks drop their CPS to that same low number (or maybe a bit higher). On multi-core, all of the tasks drop to an EVEN LOWER number. Why is that? Well, the single-core case is already running all the tasks on the same logical core, so switching to serial execution just limits the speedier tasks to running at the same rate of the slow task, resulting in slightly less work for that task to compete against. On multi-core, the slow task is probably already on its own logical core (no guarantees, see Future Work) so it's not competing with any other tasks on that core. But when the tasks are serialized, suddenly that slow task is running on the same core as other work and so things get even slower. The corollary is this: on multi-core, threading a game makes all tasks faster.
Scenario 2: Graphics-bound execution. Adjust the render task by pressing the X key to add triangles to the background. Add enough so that the render CPS is somewhere between 5 and 10. The update CPS should be significantly higher (maybe not much higher on single-core). What is occurring here is that, although the rendering is very slow, the update task simulating the n-body problem is being called frequently and therefore maintaining high accuracy. Even with the low rendering CPS/FPS, it is possible to track the path of individual bodies.
Press the Tab key to switch to serial execution. On both single-core and multi-core machines, the render task will maintain the same CPS, but the update task will have a lower CPS. The update task is now being called much less frequently, making the simulation of the n-body problem less accurate. The effect is that even though the application is rendering the same number of frames each second, the action on the screen is somewhat more chaotic and harder to follow. Bodies will accelerate into a collision course with the black hole, but instead of getting sucked in, the bodies will "tunnel" through and continue moving at a high rate of speed. This tunneling behavior is a sign of a too-slow world update. In a different type of game, this problem could appear as shots passing through enemies, or players getting stuck in walls. Corollary: even graphics- bound games can benefit from more frequent world/physics updates.
Computer gaming has always been a high-performance enterprise. With the shift that has occurred in processor technology, games need to be properly threaded in order to take advantage of all of the host platform's power. Also, the market for a game can be expanded by targeting more than just one platform. With proper abstraction at key places in the code, cross-platform development can be relatively straightforward. Both threading and cross-platform development represent significant technological opportunities for all game titles, from high-profile to homebrew. The techniques described here can be applied to an existing code base or can be used to start development of the next great game.
This demo straddles the line between being an experimental tool, and being a viable starting point for game development using modern techniques. The ideas for future work tend to emphasize one of these paths over the other.
Add more platforms: MacOS* is an obvious choice and would be a fairly straightforward addition. It is of course possible to add other platforms (consoles, handhelds, etc), but broader platform strategies require more thought to interface, control, etc.
Create synchronization primitives for ordering tasks: On calculation-bound projects, it's unnecessary to render duplicate frames. The render thread can be made to wait on a condition variable (another commonly implemented threading API feature) which can be set by the update thread. This can be abstracted in the ThreadManager class so it remains cross-platform.
Provide a fixed frequency for the world update: The n-body problem is a sensitive physical simulation whose accuracy is affected by the length of the world update frequency. Choosing a fixed frequency for the world update task can eliminate variations in accuracy. After completing a world update, the task can wait until a fixed amount of time has passed.
Double-buffer the world update to achieve thread-safe rendering while increasing FPS: If a project is calculation-bound, the world state can be split into static and dynamic halves. With two copies of the dynamic half, the world thread can update one while the render thread can render the other (along with the static world data). This is how many commercial games gain rendering speedups on multi-core processors.
Use platform-specific thread scheduling APIs: Threads can be assigned to any of the logical cores available - there's no guarantee that any two threads will definitely be executing at the same time. Each platform has its own policy on how this assignment is made and on how the total running time is distributed to each thread. Some platforms provide an API to affect the assignment and scheduling policy. These APIs could be abstracted and exposed in the ThreadManager class.
Add more threaded tasks: AI loops, dynamic content generation, audio, networking, etc. are all common features in game projects. Additional threaded tasks can take better advantage of the multi-core proc essors of the future which have more than two logical cores. Alternatively, tasks could be modified to take advantage of data parallelism (using threads to break up a large task into smaller parallel pieces). Data parallelism is an avenue by which games with a small number of tasks can take advantage of multi-core processors. Likewise, as multi-core technology evolves and the number of logical cores comes to dominate the number of parallelizable tasks, data parallelism will become key to maximizing use of the processing resources.
Abandon serialization and use ThreadManager instead of ThreadManagerSerial: Doing this would allow tasks to use more efficient blocking calls while they wait for conditions to be met. As an example, on Windows the runWindowLoop function could call WaitMessage instead of PeekMessage, blocking until it had an event to process. The render function could call glFinish instead of glFlush, blocking until the frame was completely drawn. Alternatively, ThreadManager could be modified to have a one-time dispatch thread pool policy (a producer-consumer queue), which can perfectly match the number of threads with the number of logical cores, in theory providing the absolute minimal threading overhead possible.
Appendix: Building the Demo Application
Unpack the TCPGD.zip archive, which will create the TCPGD directory.
Windows: Launch Microsoft Visual Studio* 2005 and open the TCPGD.sln solution file in the TCPGD directory. Choose the Release configuration and build and run the solution. When building the solution, the GLF project2 may generate some warnings - this will not prevent the application from building and running successfully.
Linux: Enter the TCPGD directory. To build the demo application, you must separately build the GLF and the Main projects. The following commands will build and run the program:
- http://www.libsdl.org* SDL - Simple Direct Media Layer library. SDL is used to create windows and OpenGL rendering contexts in a cross-platform fashion.
- http://www.opengl.org/resources/features/fontsurvey/#glf* GLF - OpenGL font rendering library. GLF is used to display the text in the demo application.
- OpenGL Programming Guide Fifth Edition, Addison Wesley, 2005. The "Red Book" is invaluable resource for all levels of OpenGL development.