Monday, January 12, 2009

The world is not a SceneTree

I wanted to expand on something I thought of while writing my thoughts on the SceneTree. Before I made the argument that transform update and animation should not be tightly coupled with rendering. Now I want to touch on something else: it may be useful to think of the SceneTree as conceptually a tree, but the API and implementation should not be implemented as a global tree data structure.

I jumped the gun on posting this entry, so those of you with RSS feeds may have seen an earlier draft go by. It was pointed out to me that perhaps I was trying to tackle too much in one entry, so I decided to split it up. The win for the reader is (hopefully) less jumbled thoughts from me. The win for me is three days' worth of entries. Win-win!

The title of this entry was shamefully stolen from Tom Forsyth's article on scene graphs (just say no):
The world is not a big tree - a coffee mug on a desk is not a child of the desk, or a sibling of any other mug, or a child of the house it's in or the parent of the coffee it contains or anything - it's just a mug. It sits there, bolted to nothing. Putting the entire world into a big tree is not an obvious thing to do as far as the real world is concerned.
Very succinctly put. Tom was talking about scene graphs, but the same criticism applies to the SceneTree introduced in the previous entry. Summing it up, we replace the scene graph with three systems: the SpatialGraph, the RenderQueue, and the SceneTree. The SpatialGraph deals with visibility determination. The RenderQueue is constructed from the SpatialGraph and efficiently issues render commands. The SceneTree handles animation and transform update. It is certainly a better breakdown that scene graphs, which try to do it all. Unfortunately, Tom's criticism of scene graphs still applies to the idea of a SceneTree.

I have no idea if the original author of the SceneTree concept is using a global tree structure. I just want to demonstrate that a straightforward implementation of SceneTree which used such a structure suffers from many of the same problems that a scene graph does. Given the amount of scene graphs I've seen over time, it is obvious that some people might be tempted to choose a global tree structure to implement a SceneTree.

First, let's take a step back and think about what we really need to accomplish with the SceneTree:
  • We have a number of objects that are in the world.
  • These objects have transforms describing where they are in the world.
  • For a relatively small number of objects, their transforms may be relative to some other frame of reference (i.e. a gun may be attached to a character).
  • There may be different types of attachment -- you may have situations where you want translation to be relative and rotation absolute, or vice versa. Some attachments may be accomplished through physical constraints simulated in the physics system rather than perfect constraints in the update system.
  • When a transform changes due to animation, physics, or other means, we want those changes to propagate to relative objects in the most efficient manner without doing redundant work.
  • We want to have a large number of objects in the world with a large number of characters receiving animation.
What you'll find is while we have the concept of a global hierarchy, in reality we have a mostly boring common case and a relatively small number of objects that actually have hierarchical relationships.

The boring case is what Tom describes. Look around the room you are in right now, and you will find the vast majority of objects are just "there". They do not have any sort of child, parent, or sibling relationship with each other. While most will have physical constraints (gravity pushing down, whatever they are on pushing up), this is due to physical interaction, not a familial relationship. In most games, this sort of interaction is either static or simulated by a physics system.

Having one root node for the entire world and having all these objects have be children of that root node makes no sense, or at the very least conveys no useful information. What does it mean to be a child of "the world" ? In a game, most of these things are going to be static -- they are the literal immovable object. Why clutter up our SceneTree with such things? Sure we might want to attach something to them (destructible lamp on a wall, for example), but that's the same as attaching it to the root frame of reference if the wall doesn't move. There may be some other special cases specific to a game (the wall can not move but it can disappear), but that's just it -- they are special cases that can be handled at a higher level.

There may be no static object at all to put in a tree. Many games treat all the static objects in an atomically streamable unit as one giant blob. There's no reason to deal with individual objects if you can just submit a precomputed command list to the renderer, or an efficiently packed polygon soup to the physics system. 

Most of the same arguments apply with dynamic things -- the familial relationship with "the world" is not useful. What we've really got is a lot of potential root nodes for attachment -- a list or array of conceptual trees. But if nothing is attached to an object, does it need a hierarchical update? Probably not. Now, for code simplicity, it may be simpler to just treat those as a 1-node hierarchy, but if your number of objects which have attachments is relatively low to everything else, you may be better off using a separate data structure which only holds things that have an actual hierarchy. 

Dynamic objects provide other interesting opportunities for optimization which do not map well to a generic tree structure. For objects that are physically simulated, the code can take advantage of the physics system's knowledge of what actually moved that frame, and only update those objects. This is considerably more efficient than traversing an entire SceneTree and checking if an object needs to update. Obviously none of this precludes having a generic tree structure, it just brings into question what good it does.

We've established that for the grand majority of our objects, we do not need a generic tree representation to process transform updates. We still have cases that do indeed have hierarchical relationships such as characters or attachments. Tomorrow, I will discuss why a generic tree structure is not a good choice for characters and bones.

Friday, January 9, 2009

Joe Duffy on CAS operations and performance

Interesting results in a blog entry from Joe Duffy about the scalability of compare-and-swap instructions across x86 architectures. I found this via the gdalgorithms list. The money graph:

The most common occurrence of a CAS is upon lock entrance and exit. Although a lock can be built with a single CAS operation, CLR monitors use two (one for Enter and another for Exit). Lock-free algorithms often use CAS in place of locks, but due to memory reordering such algorithms often need explicit fences that are typically encoded as CAS instructions. Although locks are evil, most good developers know to keep lock hold times small. As a result, one of the nastiest impediments to performance and scaling has nothing to do with locks at all; it has to do with the number, frequency, and locality of CAS operations.

One thing that has been talked about on the recent gdalgorithms thread is understanding that your only options are not between lock-free and heavy operating system lock primitives. Some operating systems provide lighter weight synchronization primitives (critical sections on Win32 with a spincount come to mind). In the past on some consoles I've come up with faster mutex implementations than what the operating system provides, although this is not something I'd recommend to the weak of heart -- it is unbelievably easy to get this stuff wrong, even for very experienced developers.

I think the lesson from this entry is to always measure. Maybe for a specific application a fancy lock-free algorithm may be faster, but maybe not -- as we see from above it can depend on a lot of factors. The other thing to avoid is premature optimization -- get your algorithm correct and without races first using locks, and then evaluate whether a lock-free algorithm is appropriate and faster.

Sunday, January 4, 2009

Animation and physics sitting in a tree

Over at Diary of a Graphics Programmer, Wolfgang Engel points out a gamedev.net post introducing some nomenclature for the systems involved with rendering scene geometry. 

Instead of a scene graph approach (just say no), they propose the SpatialGraph (for visibility determination), the SceneTree (for animation and transform update), and the RenderQueue (filled from the SpatialGraph, just draws stuff fast). It is a division that makes much more sense than a scene graph, which tries to handle all of the above. 

One of my biggest dislikes about scene graphs is how they misled us all to think that animation and transform update belongs in the renderer in an engine design. The renderer just deals with the results of these operations -- it doesn't much care how the transforms were created, it just needs them. Coupling these two things doesn't really make much sense.

If anything, animation/transform update is much more tightly coupled with physics and collision. Conceptually animation is saying "here's where I would like to go" and the physics system says "here's where you can go." It is even more intertwined than that, because the physics system has complete control over things like rigid bodies and ragdolls -- saying both where they would like to go and where they can go. 

If you are doing any sort of ragdoll blending between animation and ragdoll simulation, you have a feedback loop. The animation system figures out its desired pose, and the ragdoll figures out its desired pose. There is a blending step but its not always obvious where that should go. Traditionally the animation system is responsible for blending transforms, but there's an argument that the physics simulation should do it because it knows the constraints and limitations on where the bones can be physically placed. 

I haven't gotten into other interesting issues such as vehicles, which can be a blend of physical simulation (the car moving) and animation (the door being opened) similar to ragdolls, and when you add attachments to the mix (gun turret), keeping the animation and physics systems in sync can be a challenge. 

I'm starting to think that animation and physics are two sides of the same coin. I'm calling the combined thing "simulation." Obviously different games are going to have different complexity of physics, and I'm not sure coupling these two things so tightly is a one size fits all thing. What I do know is that coupling animation/transform update with the renderer is almost never the right thing to do, even though there are still a large number of scene graph based rendering libraries available. 

Saturday, January 3, 2009

Good Middleware revisited

I was having a conversation with a good friend about the Good Middleware entry, and he made a very interesting point. He pointed out that in the future, good middleware is going to allow you to hook their multicore job scheduling just like you can hook memory allocation.

Let's start out with a quick review of why we hook memory allocation, as it will help us understand why we're going to want to hook job scheduling.
  1. Consoles are fixed memory environments. It is critical to balance overall memory utilization for your title. If you can provide five more megabytes for game data over the built-in system allocator, then that allows you to have more stuff in your game, which generally can lead to a better game.
  2. Different titles have different memory needs. A 2D side-scroller is going to have different memory requirements than a world game. Some types of games lend themselves to pre-allocation strategies, where you know ahead of time how many assets of what type you are going to have. Every game I've worked on has spent time tuning the memory allocator to squeeze out as much memory as possible, and what works in one title may not be applicable to another. 
  3. The system allocator may perform badly, particularly when accessed from multiple threads. Often, console operating systems will have poor allocator implementations that have a lot of overhead. Sometimes they will use allocators that are designed for a much wider range of application than games, or designed for systems with virtual memory. Others explicitly punt on providing anything but systems designed for large, bulk allocations and expect you to handle smaller allocations on your own. Finally, no matter how good an allocator may be, it does not have knowledge of your application's memory needs or usage patterns, which are things you can use to optimize your allocator
  4. The system allocator may not provide good leak detection or heap debugging tools. It may provide none at all. If you're doing a console game, I know of no standalone leak detection software such as BoundsChecker that is available. Often you have to provide these facilities yourself.

 CPU core usage can be thought of as a system resource that is arbitrated just like memory. Most middleware that does use jobs spread across multiple cores I've seen so far usually has its own internal scheduler running on top of a couple hardware threads. In the early days of multicore middleware, you couldn't even set the CPU affinity of these vendor threads without recompiling the library, something which is key on the 360. Increasingly, I think games are going to want to own their job scheduling and not just leave it up to the operating system to arbitrate CPU scheduling at the thread level. 
  1. Consoles have a fixed number of cores. Goes without saying but worth touching on. You have a fixed amount of CPU processing you can do in a 30hz/60hz frame. Ideally you want to maximize the amount of CPU time spent on real game work and minimize the amount spent on bookkeeping/scheduling. 
  2. You know more than the operating system does about your application. One approach to scheduling jobs would be to just create a thread per job and let them go. You could do that, but you would then watch your performance tank. Forget the fact that creating/destroying operating system threads is not a cheap task, and focus on the cost of context switches. Thread context switches are very expensive and with CPU architectures adding more and more registers, they are not going to get any cheaper. A lot of game work are actually small tasks more amenable to an architecture where a thread pool sits and just executes these jobs one after another. In this architecture, you do not have any thread context switches between jobs or while jobs are running, which means you are spending more time doing real game work and less time on scheduling.
  3. It is difficult to balance between competing job schedulers. Imagine a world with 16 to 32 cores, and both your middleware and your game need to schedule numerous jobs to a thread pool. It is going to be very difficult to avoid stalls or idle time due to the fact that you have two job schedulers in your system -- one from the middleware and one from your game. Either you are "doubling up" thread allocation on cores and paying the cost of context switches from one thread pool to the next, or you allocate each thread pool to a different set of cores and hope that the work balances out. Unfortunately, I don't think the latter is going to scale very well and you will end up with a lot of unused CPU capacity.
The third reason is the clincher as to why good middleware in the future will allow you to replace their job scheduler with your own. It's the only solution I can think of that will allow you to balance workloads across many, many cores. It is possible in the future that operating systems will provde good enough thread pool/microjob APIs that we won't have to be writing our own, but even in that case, you'd still want the hooks for debugging or profiling purposes. 

I haven't yet seen any middleware that really allows you to hook this stuff easily. Have you?



Friday, January 2, 2009

RTCD on Particle Rendering

Optimizing the rendering of a particle system.

Real Time Collision Detection has an entry on optimizing particle rendering which is a good survey of many techniques you can use to make your particles fly. Don't forget to read the comments, there are some good suggestions in there, too!

Over the last five+ years or so people have been a lot more open about this kind of stuff, and I like it. There's just too much you have to deal with these days to write a good game to have to figure it out on your own. I don't know about other people's experiences but earlier in my career there seemed to be much more of a secretive culture in game development than other types of software.

Whether it is entries like the above or things like Insomniac's Nocturnal initiative, people are much more open about ideas these days. Ultimately, we all have to execute, and doing that well is hard enough on its own. Being more open is healthy and ultimately will lead to better games across the board.

Tuesday, December 30, 2008

XNA Studio 3.0: First impressions

I've been doing some graphics prototyping in my spare time lately, and I decided to see if there was a better way to go about it.

I've used RenderMonkey in the past, and while it certainly has its uses, ultimately it left me dissapointed. For a straightforward shader it's fine, but when you start getting into more complicated techniques it starts to break down. After clicking on a billion buttons to get my render targets and passes set up the way I needed I really wished I could just write some code. The lack of any ability to compute anything on the CPU side is what really frustrated me. What I ended up doing was computing a lot of things shader-side that in a real application would be done on the CPU, and it needlessly complicated the shaders.

Another avenue I've pursued is using the DirectX sample framework or an OpenGL sample, and working from there. Even in these simplified environments I find you end up doing a lot more bookkeeping then actual code. Additionally, OpenGL seems very unstable on my laptop -- even vanilla samples are crashing in the Nvidia DLLs. 

The last couple days I've been playing around with Microsoft's XNA Studio 3.0 as a graphics prototyping tool, and my impression so far is very favorable. The API is pretty straightforward, and for the most part the abstractions seem in the right place. Porting the current thing I'm working on from C++ to C# took no time at all, and so far I've spent much more time writing meaningful code rather than working on scaffolding. 

The GameComponent architecture they have is interesting -- for example, I found myself needing an orbit camera. Rather than write one myself, I just grabbed a component someone else has written. It was one of those few times where the code just dropped in. The only drawback is they haven't implemented taking input from the 360 controller, but that's easy enough for me to write and a lot less involved than doing the whole thing.

I was also surprised how easy it was to get my little project up and running on my Xbox 360. I didn't have to do any code changes, and it all Just Works. The debugging is solid but experienced Xbox 360 developers will miss all the nice tools you get with the real SDK. I'd be really nice if Microsoft released a lightweight version of PIX for XNA that worked with the 360, but I guess you can't have everything.

There have been some minor annoyances. Some of the C# syntax for dealing with vertex arrays can be cumbersome -- new'ing a Vector3 never feels right to me. You also lose access to some hardware features. For example, for some reason it thinks I can't create a floating point depth buffer on my laptop, when I'm very certain the GPU I have can handle that. 

On the 360 side, they simplify a lot of the hardware details, but this has limitations. I can't find any way to resolve the depth buffer on the 360 to a texture in XNA. While these limitations are understandable given the intended audience of XNA, it is somewhat annoying.

All in all, I think it is a pretty good framework so far. I don't think you are going to max out the hardware with it, but for a large category of games it will work really well. 

Monday, December 29, 2008

Interesting GDC Talk #2: Hitting 60hz with the Unreal Engine

Hitting 60Hz with the Unreal Engine: Inside the Tech of MORTAL KOMBAT vs DC UNIVERSE

This is another talk I will be attending. I work at the same studio as Jon and have had more than a few conversations with him about this very topic, and he has given internal presentations on the subject. I don't want to give too much about the talk away, but it is a very interesting case study in taking a graphics pipeline completely designed for a 30hz game, and modifying it to meet a game's performance goals without sacrificing a whole bunch of functionality. Just one example: when we originally got our hands on Unreal, the sum of all post processing took nearly half of a 60hz frame. While Epic has optimized this over time, this gives an idea of some of the challenges on the GPU. It is easy enough to remove functionality until it runs fast enough, but the real trick is to preserve quite a bit of that functionality and still get it under time.

The section on porting the Unreal particle system to the SPU is definitely worth paying attention to - this was a collaborative effort done by a few tech group guys at Chicago. It underscores the difficulty of dealing with legacy code bases that are not designed for NUMA architectures, and the ingenuity and hard work required to overcome that hurdle. This talk should definitely be on your list.