Wednesday, September 23, 2009

Safety

I recently came across two articles that tangentially talk about the same thing -- technologies that are safe. Safe as in usable and not likely to get yourself in trouble.

The first was 30 years of C over at DadHacker. The second is a Joel on Software article (nice to see him actually writing about technology instead of pimping FogBugz or whatever he's selling these days) called The Duct Tape Programmer.

Anyway, I thought I'd write some of my opinions of the language features mentioned in these two articles. For those of you who've known me a while, it just may surprise you where my thoughts have evolved over the years.

Let's cover the C++ features:

Exceptions - While I have no problems with exceptions in a language like Java or C#, in C++ they just don't work well. In games we turn them off for code size and performance reasons, but I would tend to avoid them in C++ even if there was zero hit in either area. It is just too difficult to write exception-safe code in C++. You have to do extra work to do it, and the things that can break are sometimes very subtle. Most importantly, the entire culture and ecosystem built around the language is not exception-friendly. Rare is the library that is exception-safe in my experience. So just say no.

RTTI - Not very useful in practice. Again, there are overhead concerns in games, although most games I've seen end up rolling their own. But the base implementation is rather inflexible -- it is reflection of only the most basic sort, and often in the places you do need run-time type information, you need a lot more than just class ids. It's a feature with its heart in the right place but it just doesn't come together very well. I think part of the problem is its all-or-nothing nature -- usually only portions of my architecture need any sort of reflection, and I don't want to pay for it on all the other classes.

Operator Overloading - Rarely useful outside of math libraries. I'm not even a huge fan of the iostreams model, to tell the truth.

Multiple inheritence - Only with pure virtual interfaces, and even then should be used rarely and avoided if possible. Sharing implementation via inheritance goes awry enough in single inheritance, adding more base class chains just makes the problem worse.

Templates - The big one. I'll admit to having a love affair with templates in my twenties. What can I say? They were fun and a shiny new toy. I sure had some excesses, but even my worst one (a cross-platform file system library) shipped in multiple products. Even then I hid them all behind a straight-C API, so only programmers who had to either debug or extend the library innards had to deal with the templates. If I had to do it again, I'd probably do it differently, but I could say that about any code I've written in my career, whatever the language. I do know that it was an improvement over the previous file system library that was in use, because the new one actually worked.

I can say with a degree of certainty that template metaprogramming is a bust for practical use. There are a few major problems with it: the language isn't really built for it (it's more a clever side effect than anything), there is no good way to debug it, and functional programming isn't very ingrained in the game development culture. Ironically, I think the last part is going to have to change as parallel programming creeps into larger and larger sections of the architecture, but that won't make template metaprogramming practical.

In any case, these days templates are just a tool in the toolbox, and not one I reach for that often. The code bases I've been working in recently all roll their own template container libraries* (provided for us from external vendors), and they do the job. My experiences with code sharing via templates is that more than often it isn't worth the trouble, but sometimes it is. Like anything we do, it is a tradeoff, and one I don't necessarily feel particularly passionate about either way.

*A somewhat amusing side note: I've done performance and code generation tests with one of the hand-rolled template container libraries I've encountered versus STL. STL came out on top for a lot of simple things like loop iteration overhead or sorting, on all the platforms I was interested in. Of course, I'm not about to rewrite hundreds of thousands of lines of code to use STL, and STL still is horrible for memory management. But I suppose that underscores the point "30 years of C" made -- even something as simple as a container library is hard to get right, even for experts. Which library I'm talking about shall remain anonymous for its own protection.

The Other Cost of Code Bloat

The other day I almost wrote a redundant version of the exact same class that someone else on my project had written. In fact, if I hadn't have asked this person a couple general C# questions, and he hadn't put two and two together, I probably would have wrote that redundant class. Good detective work on his part, and shame on me for not doing a search of the code base to see if someone else had already tackled this problem. While I've got a pretty good feel of the C++ which makes up the majority of code in our engine/tools, I haven't looked at the C# side as much as I probably should have.

As the code bases we write get larger and larger, and the team sizes we deal with get larger and larger, the question of how to avoid this scenario becomes an important one. Ideally you hire programmers who perform the necessary code archeology to get a feel for where things are in the code base, or who will ask questions of people more familiar with the code when unsure. Getting a code base of a million or more lines "in your head" takes time, though. I've been working with our licensed engine for about four years now, and there are still nooks and crannies that are unfamiliar to me.

Better documentation should help, but in practice it is rarely read if it even exists. This is because usually such documentation is either nonexistant or if it does exist, horribly out of date. With a licensed engine, you are at the mercy of the little documentation you are provided, and at the end of the day, the code itself is the best documentation.

A sensible architecture with clear delineation of what should go where is often a bigger help. Knowing [where to look] is half the battle, said a saturday morning cartoon show. Again, with a licensed engine, you again are at the mercy of what you are provided. Finding existing functionality usually comes down to experience with the code base and code archeology skills.

Recently, Adrian Stone has been writing an excellent series on minimizing code bloat. Now while the techniques he describes aren't really about eliminating actual code and instead eliminating redundant generated and compiled code, the mindset is the same when you are removing actual lines of code. Aside from the important compile time, link time, and executable size benefits, there is another benefit to removing as much code as you possibly can -- the code will occupy less "head space."

Unused or dead code makes it that much harder to do code archeology. Dead code certainly can make it more difficult to make lower level changes to the engine or architecture, as it is one more support burden and implementation difficulty. In the past, removing large legacy systems (whether written internally or externally) has had unexpected benefits in simplifying the overall architecture -- often there are lower level features that only exist to support that one dormant system.

One of my favorite things to do is delete a lot of code without the end result of the tools or game losing any functionality. It's not only cleaning out the actual lines of code, but the corresponding head space that is wonderful feeling -- "I will never have to think about X again." With the scale of the code bases we deal with today, we don't have the brain power to spare over things we don't need.

Monday, September 21, 2009

Rogue Programming

Gamasutra had an interesting article today titled Gaming the System: How to Really Get Ahead in the Game Industry. I found it probably had more to say about the political dysfunction that can often accompany game development rather than a how-to on being successful. To put it another way: if you find yourself having to follow the sneakier guidelines in this article too much, then you might want to consider a change in where you work.

The programming section is titled "Just Do It" and does have some truth to it. One of my leads and I came up with the term "rogue programming" for what he describes, which was half-joke, half-serious. Here's a quote:

As a programmer, it's not uncommon to see problems that you think should be fixed, or to see an opportunity to improve some piece of code, or speed up a process that takes a lot of time. It's also not uncommon for your suggestion to be ignored, or dismissed with an "it's not broke, so let's not fix it" response...

What should you do? You should just do it -- on your own time.

This is advice which is fraught with a lot of risk, because here's a hard-earned lesson for you: you don't always know best. I know, I know, you're superstar hotshot programmer, and you see something that is broken, so it must be fixed. Sure it is not in the schedule, but it'll just take a few hours, what's the harm? The code base will be so much better when you're done, or the artists and designers will have a feature they didn't have before. How can that not make the project better?

Let me give a cold-water splash of reality: when it is all said and done at the end of the project, you're going to ship with a lot of broken code. I'm not talking about obvious bugs in the shipped project, I just mean nasty, hack-filled, just-get-it-out-the-door brokenness in the code base, and some of that code will be code that you wrote. If this wasn't true, then a long-lived project like the Linux kernel wouldn't still have thousands of developers contributing to it -- obviously, there is still stuff that is "broken" and can be improved!

So in the big picture, a single section of brokenness is not going to make or break your project, and usually, there are bigger fish to fry on any given day, and its best to fry them. Because if your project is cancelled because a major feature was late, will it matter that you cleaned up the way you calculated checksums for identifiers?

That said, if after all of this, you still think something is worth doing, let me tell you how to successfully rogue program:

First, and most importantly, let's define "on your own time." On your own time means you are hitting your scheduled work on schedule, and that will not change if you fix or implement this one thing. If you're behind on your scheduled work, then you really shouldn't be doing any rogue programming. Whether not impacting your schedule means you work a saturday, do some exploration at home, or just have some slack in your schedule you'd like to exploit, if you don't complete the tasks you are supposed to be working on, you've done more damage than whatever improvement you're working on the side could benefit.

Additionally, you need co-conspirators. These days, programming is very collaborative process, and for the most part, the cowboy mentality is a dying thing. If you talk to your lead or other engineers about a problem ("Hey, X is pretty f'd up") and no one else agrees, or you can't make the case, then hey, maybe X really isn't that important! You really want to be working with a group of people that you can convince with sound arguments that something is a problem, and a lot of the time a little discussion about a problem can turn into a scheduled task -- and no rogue programming.

Often you'll be faced with something that everybody agrees *should* be done, but there's no time to do it. In these cases, I've found with good leads (which I've been blessed with the last few years), you can get tacit approval to do something "on your own time." This often takes trust -- I wouldn't recommend going off and "just doing it" until you've earned that trust.

If you've gotten to this point, you're in pretty good shape to go off and do some "rogue programming" -- because at this point (and this is where the joke comes in), it really isn't rogue at all.

Now if you're at a company where you constantly feel like you need to "go around people" to "just do things," then maybe you really do need a change of venue, because that is not a healthy team. I happen to know someone who is hiring.

Thursday, August 27, 2009

Big O doesn't always matter

The other day I was optimizing a bit of debug code which verifies the integrity of objects in memory. The details aren't super-important, but the gist is it is a function which runs periodically and makes sure that objects that should have been garbage collected were indeed purged from memory.

I don't make a general habit of optimizing debug code, but this was a special case -- before, this process only ran in the debug build. Artists and designers run a "development" build, which is an optimized build that still includes assertions and many other development checks.

We recently ran into a bug that would have been detected much earlier if this process had been running in the development build. While programmers run the debug build almost exclusively, we tend to stick to simpler test levels. Trying to debug an issue on a quick loading, small level is much easier than on a full-blown one.

The algorithm is pretty simple -- objects have somewhat of a tree structure, but for various reasons they only have parent links and not child links. For objects at the top-level of the tree, we know for sure whether they should be in memory or not. Objects at the bottom of the tree keep all their parents in memory if they are connected to the reference graph. So the debug check looks at each object and verifies that it is not parented (via an arbitrarily long parent chain) to an object which should have been purged.

First thing I did was measure how long the process was taking, and did some lower level profiling to get an idea of where time was spent. Most importantly, I also saw where I was running into cache misses.

The first pass of optimization -- the original loop was doing a lot of work per-object that was simply unnecessary. This was because it was using a generalized iterator that had more functionality than needed for this specific case -- for most operations, particularly at editor time, this extra overhead is not a big deal. Removing this extra work sped up the process and it was now took about 90% of the time of the original.

I then tried some high-level optimizations. There were two things I tried - one, the inner loop linearly checked each high-level object against an unsorted array of objects we know should be purged. I replaced this with a hash table from our container library. Finally, I realized that a memoizing approach should help here -- since I'm dealing with a tree, I could use a bit array to remember if I've already processed a parent object and deemed it OK. This would allow me to cut off traversal of the parent chain, which should eliminate a lot of work. Or so I thought.

The new algorithm was faster, but not by much - only 85% of the original running time. The additional complexity was not worth 5% of running time, so I went back to the simpler approach. This isn't unusual in optimization -- you often can try something you think will be a big help but turns out not to matter much. I've made mistakes in the past where I stuck with the more complicated implementation for a marginal gain -- but it was not worth it, and it made other optimizations that may have bigger impact harder to do.

As far as why the gain wasn't that much: The unsorted array was relatively small (a handful of elements), so a linear search was faster because it was simpler and had better cache behavior than the hash table implementation I was using. The tree structure of the objects was broad but not deep, so its obvious in hindsight why memoization would not be a win.

Now, one thing that is nice to have is a decent container and algorithm library. I have that at my disposal, so implementing these two changes was a matter of minutes instead of hours. With that kind of arsenal, it is easy to try out algorithmic changes, even if they end up not working out.

At this point, I took another look at the cache behavior from my profiling tools. I tried something incredibly simple -- prefetching the next object into the cache while I was processing the current. This resulted in the process now running at 50% of the time of the original -- a 2X speedup, and likely fast enough for me to enable this in the development build. I'm going to measure again, and see if there are any other easy wins like this to be had.

The processors we use are fast -- incredibly fast, and even with branch penalties on the in-order processors of today's consoles, they can still do a lot of work in the time it takes to retrieve data from main memory. So while on paper, I'm using "slow" algorithms with worse O(n) times, in practice, your memory access patterns can easily drown out any extra calculation. The key, as always, is to measure and test your theories, and not just assume that any given approach will make something faster.

Monday, August 24, 2009

Stencil states for rendering light volumes

In the ShaderX 7 article "Designing a Renderer for Multiple Lights: The Light Pre-Pass Renderer", the author describes a number of approaches for rendering the lights into the lighting buffer. These are all pretty standard approaches for any deferred technique, but I thought the description of using stencil does not explain how to set up the stencil states very clearly. This was probably due to space constraints.

The way it is worded implies that you still need to change the depth comparison function. This is not the case, and is most of the point of the technique. As the article points out, changing the depth test makes many GPUs take their early-Z rejection and go home.

I'm sure you can find this detail elsewhere on the net, but my cursory searches did not find anything, and hopefully this will save at least one person some time. Standard caveats apply: I haven't extensively tested this stuff.

Assuming convex light volumes, this is what I found worked well:


// render backfaces so that only pixels in front of the backface have stencil incremented
AlphaBlendEnable = false
StencilEnable = true
ColorWriteChannels = None
CullMode
= Clockwise
DepthBufferEnable
= true
StencilFunction = Always
StencilPass
= Keep
// If a pixel is front of the volume backface, then we want it lit
StencilDepthBufferFail = Increment

// render volume

// render frontfaces so that any pixel in back of them have stencil decremented
CullMode = CounterClockwise
// pass stencil test if reference value < buffer, so we only process pixels marked above.
// Reference value is 0. This is not strictly necessary but an optimization
StencilFunction = Less
// If a pixel is in front of the volume frontface, then it is not inside the volume
StencilDepthBufferFail = Decrement;

// render volume

AlphaBlendEnable = true
ColorWriteChannels = RGB
// only process pixels with 0 < buffer
StencilFunction = Less
// zero out pixels for so we don't need a separate clear for next volume
StencilPass = Zero
// don't want to do anything if we fail the depth test
StencilDepthBufferFail = Keep

//render a screen space rectangle scissored to the projection of the light volume


Note that unlike shadow volumes, the light volume intersecting the near plane is not a concern here. We are rendering the frontfaces to find pixels that are in front of the light volume -- if parts of the light volume are in front of the near plane, by definition any pixels we're rendering are in back of those parts. So there is no need to render a cap in this case.

The light volume intersecting the far plane is a concern. One way to handle this case is to use a projection matrix with an infinite far plane, like shadow volumes do. Another way to handle it would be to detect this case and not use the stencil approach at all, instead rendering a screen space rectangle scissored to the light bounds.

Finally, I've had better luck switching to rendering the backfaces without depth testing when the camera is inside the light volume, instead of using a screen space rectangle. But I think this has more to do with a bug in my scissoring code than with any fundamental problem!

Sunday, August 23, 2009

XNA: at times awesome, at times frustrating

I'm not sure why, but Microsoft seems intent on crippling XNA for the 360. Perhaps they want to sell more dev kits.

I recently had some more time to work on my little toy project. After some work, I've now got a deferred lighting implementation on the PC.

For the lighting buffer construction, at first I was using a tiled approach similar to Uncharted, which did not require blending during the lighting stage. It did work for the most part, and allowed me to use LogLUV for encoding the lighting information, which was faster. But it had issues - I didn't have any lighting target ping-ponging set up, so I was stuck with a fixed limit of seven lights per tile. Also, even with smallish tiles, you end up doing a lot of work on pixels not actually affected by the lights in question. So I wanted to compare it to a straightforward blending approach, and switched back to an FP16 target, and render the light volumes directly (using the stencil approach detailed in ShaderX7's Light Pre-Pass article).

So this all worked great and my little toy is rendering 100 lights. Of course, on the 360, there's a problem. Microsoft, in its infinite wisdom, decided that the FP10 buffer format on 360 would blow people's minds and it is not supported in XNA. They are using an actual FP16 target, which does not support blending.

So I guess it is going to be back to alternate lighting buffer encoding schemes, bucketing, render target ping-ponging for me. It's not a huge deal, but it is frustrating.

It is a real shame that XNA gives the impression that the 360 GPU is crippled, when in reality it is anything but. Couple lack of FP10 support with inability to sample the z-buffer directly, and the lack of control of XNA's use of EDRAM, and they've managed to turn the 360 into a very weak, very old PC.

Least common denominator approaches generally haven't fared that well over the years. An XBLA title implemented in XNA is going to be at a fundamental disadvantage -- I don't think you are going to see anything approaching the richness of Shadow Complex, for example.

At the end of the day, Microsoft needs to figure out where they are going with XNA. If they are going to dumb it down and keep it as a toy for people who can't afford a real development kit (people who've been bumping into these low ceilings much longer than me), then they should keep on their current path.

The potential for XNA is really much more, though. Today I wrote a pretty decent menu system in about 45 minutes, that handles gamepad, keyboard, and mouse input seamlessly. I don't think I could write that in C++/DirectX anywhere near as fast. If you start looking down the road to future generations of hardware, I'm not worried about the overhead of C# being fundamentally limiting. Games today already use much less efficient scripting languages than C#, and while you are limited to the heavy lifting Microsoft has chosen to implement for you today, who is to say that a future version of XNA couldn't allow shelling out to C++ for really performance intensive stuff?

XNA has a chance to become something really great that would be very powerful for a large class of games. It remains to be seen if Microsoft will let it.

Wednesday, August 19, 2009

One has to have had inflated expectations to experience disillusionment

A colleague sent along this item, which asks if Transactional Memory is beyond the "trough of disillusionment".

I've never had any expectations that STM would be some silver-bullet solution to concurrency, and from the get-go just viewed it as just another tool in the toolbox. Granted, it is a technique that I haven't had much practical experience with yet -- it's on my TODO list. Others might disagree with me, but I'm not even sure how much of a major factor it is going to be in writing games. Of course, if some major piece of middleware is built around it, I suppose a lot of people will end up using STM, but that doesn't necessarily make it a good idea.

The latest piece of evidence against STM as a silver bullet comes from conversations I've had with colleagues and friends who have a lot of experience building highly-scalable web or network servers. STM advocates hail transactions as a technique with decades of research, implementation, and use. About this they are correct. The programming model is stable, and the problems are well known. But what has struck me is how often my colleagues with much more experience in highly-scalable network servers try to avoid traditional transactional databases. If data can be stored outside of a database reliably, they do so. There are large swaths of open source software devoted to avoiding transactions with the database. The main thrust is to keep each layer independent and simple, and talk to a database as little as possible. The reasons? Scalability and cost. Transactional databases are costly to operate and very costly to scale to high load.

I found the link above a little too dismissive of the costs of STM, particularly with memory bandwidth. I've already discussed the memory wall before, but I see this as a serious problem down the road. We're already in a situation where memory access is a much more serious cost to performance than the actual computation we're doing, and that's with a small number of cores. I don't see this situation improving when we have 16 or more general-purpose cores.

A digression about GPUs. GPUs are often brought up as a counter-argument to the memory wall as they already have a very large number of cores. GPUs also have a very specialized memory access pattern that allow for this kind of scalability - for any given operation (i.e. draw call), they generally have a huge amount of read-only data and a relatively small amount of data they write to compared to the read set. Those two data areas are not the same within a draw call. With no contention between reads and writes, they avoid the memory issues that a more general purpose processor would have.

STM does not follow this memory access model, and I do not dismiss the concerns of having to do multiple reads and writes for a transaction. Again, we are today in a situation where just a single read or write is already hideously slow. If your memory access patterns are already bad, spreading it out over more cores and doubling or tripling the memory bandwidth isn't really going to help. Unlike people building scalable servers, we can't just spend some money on hardware -- we've got a fixed platform and have to use it the best we can.

I don't think that STM should be ignored -- some problems are simpler to express with transactions than with alternatives (functional programming, stream processing, message passing, traditional locks). But I wouldn't design a game architecture around the idea that all game code will use STM for all of its concurrency problems. To be fair, Sweeney isn't proposing that either, as he proposes a layered design that uses multiple techniques for different types of calculations.

What I worry about though is games are often written in a top-down fashion, with the needs at the gameplay level dictating the system support required. If at that high level the only tool being offered is STM with the expectation that it is always appropriate, I think it will be easy to find yourself in a situation where refactoring that code to use other methods for performance or fragility reasons may be very difficult and very expensive than if the problem had been tackled with a more general toolbox in the first place.

Concurrency is hard, and day to day I'm still dealing with the problems of the now, rather than four or five years down the road. So I will admit I have no fully thought out alternative to offer.

The one thing I think we underestimate is the ability of programmers to grow and tackle new challenges. The problems we deal with today are much harder and much more complex than those of just a decade ago. Yes, the tools are better for dealing with those problems, and the current set of tools for dealing with concurrency are weak.

That means we need to write better tools -- and more importantly, a better toolbox. Writing a lock-free sw/sr queue is much harder than using one. What I want is a bigger toolbox that includes a wide array of solutions for tackling concurrency (including STM), not a fruitless search for a silver bullet that I don't think exists, and not a rigid definition of what tools are appropriate for different types of game problems.