Tuesday, December 22, 2009

More Stencil States for Light Volume Rendering

A while back I wrote a short entry on stencil states for light volumes. The method I posted works but relies on using a zfail stencil operation. Shortly after, I quickly discovered that it ran considerably slower on ATI cards than on the original Nvidia card I had been writing on, and have been meaning to post an update.

On certain hardware, using anything but Keep in zfail can disable early stencil  -- specifically, ATI PC hardware, and this caused quite a slowdown.

The solution I figured out (and I'm sure others have) was to switch to a method which only relies on stencil pass operations:

AlphaBlendEnable = false
StencilEnable = true

ColorWriteChannels None

DepthBufferEnable = true

StencilDepthBufferFail Keep

// render frontfaces so that any pixel in back of them have stencil decremented
CullMode CounterClockwise
StencilFunction Always

// If a pixel is in back of the volume frontface, then it is potentially inside the volume
StencilPass Increment;

// render volume



// render backfaces so that only pixels in back of the backface have stencil decremented
CullMode Clockwise
// pass stencil test if reference value < buffer, so we only process pixels marked above.
// Reference value is 0. This is not strictly necessary but an optimization
StencilFunction Less
// If a pixel is back of the volume backface, then it is outside of the volume, and should not be considered
StencilPass 
Decrement

// render volume
AlphaBlendEnable = true
ColorWriteChannels RGB
// only process pixels with 0 < buffer
StencilFunction Less
// zero out pixels for so we don't need a separate clear for next volume
StencilPass Zero

//render a screen space rectangle scissored to the projection of the light volume


There is a problem with this method -- if the light volume intersects the near plane, it won't work, because the portion of the light volume in front of the near plane will never increment the stencil buffer.

My solution to this was pretty simple -- if the light volume intersects the near plane, I use the zfail method from the earlier post. Otherwise, I use the stencil pass operation. For most lights, we're using the fastest path on both the major brands of cards. I briefly scanned some papers and articles on shadow volumes (a very similar problem), hoping to find an alternate way to cap volumes intersecting the near plane, but didn't see anything that looked particularly easy to implement or would necessarily perform that well, and this method got performance on ATIs and Nvidias mostly on par.

What about two-sided stencil? This is a mode in DX9 where you can render both backfaces and frontfaces in one pass, with separate stencil operations on each. Because the stencil increment/decrement operations wrap around  (i.e. decrementing 0 becomes 255, incrementing 255 becomes 0), ordering doesn't really matter (although you have to make the StencilFunction Always on both). I did some quick tests using two sided stencil and my initial results showed it was actually slower than rendering both passes separately. I didn't spend much time on it so it is possible that I simply screwed something up, and plan to revisit it at some point.

Saturday, December 12, 2009

Screen Space Spherical Harmonic Lighting

A while ago Jon Greenberg brought up the idea of accumulating lighting in screen space using spherical harmonics, in a blog entry entitled "Has anyone tried this before?"

I've been doing deferred lighting experiments in XNA, and decided to give this technique a try. Please note I'm not doing any antialiasing and all screenshots are the definition of "programmer art" cobbled together from various freely available assets.

Screenshots show full resolution deferred lighting on top and the screen space SH technique at the bottom in the Sponza Atrium by Marko Dabrovic:



The technique was pretty simple to get up and going, and produces some interesting results. The above images are with 27 point lights, 3 spot lights, and 1 directional. The directional is the only light evaluated at full resolution per-pixel, in the apply lighting stage.

The basic idea is to use a quarter-sized lighting buffer (thus, in this case, 640x360) to accumulate 4-coefficient spherical harmonics. The nice thing is you only need the depth information to do so. I used 3 FP16 buffers to accumulate the SH constants. Points and spots are evaluated by rendering the light geometry into the scene and evaluating the SH coefficients for the light direction via cube map lookup, and then attenuating as normal. For the directional light, I evaluate that in the apply lighting shader. I'm not rendering any shadows.

For diffuse lighting, it works pretty well, although due to the low number of SH coefficients, you will get some lighting wrapping around onto backfaces, which in practice just tends to give you "softer" lighting. That may or may not be desirable.

Even though the lighting buffer is quarter-sized, you don't really lose any normal detail since SH accumulates the lighting from all directions. In my test scene, the earth models are the only ones with normal maps (deferred on the left, SH on the right)



I found that when you upsample the lighting buffer during the apply lighting stage naively, you would get halos around the edges of objects. I fixed this using a bilateral filter aware of depth discontinuities.

I was able to fake specular by extracting a dominant light direction from the SH, dotting that with the half vector, raising to the specular power, and multiplying that times the diffuse lighting result. It doesn't really give you great results, but it looks specular-ish. I tried using the lighting looked up at the reflected view vector but found that gave worse results.

Performance-wise, in my little XNA program, which I'd hardly call optimized, the SH lighting is about the same as deferred lighting when I store specular luminance instead of a specular lighting color in the lighting buffer. Here's some screen shots showing 388 lights (384 points, 3 spots, and 1 directional):


Note that there is at least one major optimization that could be performed when I'm calculating the SH coefficients for a light. Currently, my SH lookup cube map is in world space, but my light vectors are calculated in view space for points and spots. This causes me to make a matrix multiplication against the inverse view matrix in all the lighting shaders. This could probably be sped up quite a bit by calculating the SH lookup cubemap in view space each frame.

All in all, it is an interesting technique. I'm not very happy with the specular results at all, and the softness of the lighting could be a benefit or a drawback depending on the look you are going for. Jon also points out that the lighting calculations could easily be moved to the CPU on some platforms, since they only depend on depth information. I'm probably not going to explore the technique much further but thought I'd post what I'd found from the limited investigation I did.

Friday, December 4, 2009

A Production Irradiance Volume Implementation Described

On a previous title I worked on, the dynamic lighting system we had could best be described as "an emergency hack." We found ourselves approaching an E3 demo without a viable dynamic lighting system -- the one in the engine we were licensing required re-rendering geometry for each light. Even using a completely separate lighting rig for dynamic objects (with a much smaller number of lights), this was not practical on the hardware and ran too slow. The engine in question would eventually have a much better dynamic lighting system, but that would not come for some time, and we needed something that worked right away.

The solution was to limit the number of lights that could affect dynamic objects and render 3 lights in a single pass. The three lights were chosen based on the strength of their contribution at the object's center point, and hysteresis was used to avoid light popping. Shadows darkened the scene instead of blocking a specific light, which is an old technique, but worked "well enough."

I was never very happy with this solution, but it was good enough for us to ship with. It was too difficult for artists to light dynamic and static objects consistently due to the separate lighting rigs, and often the dynamic lighting would not match the static lighting very well. Dynamic lights did not take occlusion into account so you could often get bleeding through walls, which would require painful light placement and light channel tweaking.

After that project shipped, I very much wanted to make a better system that would solve most of the problems. I wanted consistent results between static and dynamic lighting, I wanted a single lighting rig, and I wanted a better shadowing solution.

A colleague on another title at the same studio was getting some good results with spherical harmonic-based lighting, albeit in a completely different genre. I had also recently read Natalya Tatarchuk's Irradiance Volumes for Games presentation, and I felt that this was a viable approach that would help achieve my goals.

The way it worked is artists placed arbitrary irradiance volumes in the map. An irradiance volume stores a point cloud of spherical harmonic samples describing incoming light. In the paper, they use an octree to store these samples, but I found that was not desirable since you had to subdivide in all three axes simultaneously -- thus if you needed more sampling detail in X and Z you were forced to also subdivide in Y. Our levels weren't very vertical, so those extra samples in Y were unnecessary and just took up memory.

Instead, I used a kd-tree, which allowed me to stop subdividing an axis once it had reached an artist-specified minimum resolution.

Another problem was what heuristic to use for choosing a sample set. The original paper used a GPU-based solution that rendered depth to determine if a cell contained geometry, and if so, subdivided. The idea is that places with geometry are going to have more lighting variation. The preexisting static lighting pipeline I was working in did not lend itself to GPU-based solution, so I did a similar approach using a CPU-side geometry database to determine if cells contained geometry. In practice, it was pretty fast.

I would subdivide in a breadth-first manner until either I hit an artist-controlled minimum sampling resolution or  we hit the memory budget for that irradiance volume. This allowed me to have a fixed memory budget for my irradiance data, and basically the technique would produce as much detail as would fit in that budget for the volume. I also rendered a preview of the sampling points the heuristic would produce, allowing artists to visualize this before actually building lighting.

Once I had a set of points, I sent it off to Beast to calculate both direct and indirect lighting at each sample point. Once I had the initial SH dataset, I performed some postprocessing.

The first step was to window the lighting samples to reduce ringing artifacts (see Peter Pike Sloan's Stupid Spherical Harmonic Tricks). The amount of windowing was exposed to artists as a "smoothing parameter". I had set up the toolchain so in the editor, I stored both the original Beast-produced SH samples (which took a minute or so to generate), and the postprocessed values. This allowed the artists to change various postprocessing variables without recomputing the lighting, allowing for faster iteration.

What I did is remove redundant lighting samples. Within the KD-tree, the lighting samples are arranged as a series of 3D boxes -- finding the lighting at any arbitrary point within each box is done via trilinear interpolation. Because of the hierarchical nature of the KD-tree, each level split its box into two along one of the three axes. What I would do is compare the value at a "leaf" box point with the interpolated value from the parent box -- if the difference between these two SH coefficient sets was within a certain threshold, I would remove the leaf sample. After this process is done, we are only storing lighting samples for areas that actually have varying lighting.

Samples were referenced by index into a sample array at each node of the KD-tree, which allowed me to further combine samples that were nearly identical. Finally, I encoded the sample coefficients as FP16s, to further save on memory. I was later going to revisit this encoding, as it had some decoding expense at runtime, and there probably were cheaper, better options out there.

At runtime, each dynamically lit object would keep track of what irradiance volume it was in when it moved. Transitions between volumes were handled by having the artists make the volumes overlap when placing them -- since the sample data would essentially be the same in the areas of overlap, when you transitioned there would be no pop.

A dynamically lit object would not just sample one point for lighting, but several. I would take the object's bounding box, shrink it by a fixed percentage, and sample the centers of each face. I would also sample the center point. Dynamic lights would be added into the SH coefficient set analytically. I then extracted a dominant directional light from the SH set, and constructed a linear (4 coefficient) SH gradient + center sample. Rendering a directional light + a linear SH set achieves results similar to rendering a full 9 coefficient set, and is much faster on the GPU. Bungie used this same trick on Halo 3.

The gradient allowed me to get a first order approximation of changing lighting across the dynamic object, which was a big improvement in the quality of the lighting and really helped make the dynamic lighting consistent with the static lighting. Evaluating a 4 SH gradient + directional light was about the same cost as if I'd evaluated a full 9 coefficient SH on the GPU, but produced much higher quality.

The SH set for a dynamic object was constructed on another thread, and only happened if the object moved or its set of dynamic lights changed. This allowed us to support rendering a large number of dynamic objects.

Sometimes the kd-tree subdivision heuristic would not generate high enough detail of sampling for a specific area -- for these cases I allowed the artists to place "irradiance detail volumes", which allowed the artists to override the sampling parameters for specific area of the irradiance volume - either forcing more detail, or using a smaller minimum sample resolution.

Finally, for shadows, in outdoor areas we used a cascaded shadow map solution for the sun, and for interior areas, supported spotlights that cast shadows. The artists had to be careful placing these spotlights as we could not support a large number of shadow casting lights simultaneously. At the time we were rendering these lights as a separate geometry pass, but I had plans to support one shadow casting light + the SH lighting in a single pass.

The end result was for anything car-sized or smaller, with statically placed lights using the same lighting rig as produced the lightmaps, you would have a very difficult time telling which objects were static and which were dynamic. One interesting side effect that was technically a "bug" but actually helped produce good results was the fact that samples underneath the floor would almost always be black, since no light reached them. When constructing the gradient, these samples would usually be used for the bottom face of the bounding box. In practice, though, this just made the object gradually get a little darker toward the floor -- which was not unpleasant, helped ground the object in the scene, and was kind of fake AO. In ShaderX 7, the article about Crackdown's lighting describes a similar AO hack, although theirs was intentional. But we decided to keep the happy accident.

The biggest issue with the system was it didn't deal very well with very large dynamic objects, since a single gradient is not enough if your object spans tens of irradiance volume cells. For that game this wasn't a huge problem, but it might be for other games. Additionally, it still didn't solve the problem of things like muzzle flashes requiring multiple passes of geometry for statically lit items, and at the time I was starting to look to deferred lighting approaches to use for transient, high-frequency dynamic lights in general.

The artists were very happy with the lighting, particularly on characters, and we were producing good results. But at about this time, the plug on the project was pulled and I was shifted off to other duties, and eventually the company would go bankrupt and I would move on to 2K Boston. But I felt that lighting approach was viable in a production environment, and I've since seen other games making presentations on various irradiance volume systems.

Saturday, October 17, 2009

Where is the game architecture research?

I was reading this paper on EA's internal STL implementation, and it got me thinking -- where is the game architecture research?

There is a large amount of academic research poured into real-time graphics, experimental gameplay and entertainment, AI, and even MMO server design. But I find there are a number of architecture issues unique to games that are lacking in any sort of research. I've done searches and not come up with a whole lot, maybe I'm just not using the correct keywords.

Memory is not a solved problem for game consoles
Most if not all garbage collection research is focused on desktop or server based memory usage patterns. They assume virtual memory paging. Many gc algorithms are impractical for a fixed memory environment where utilization needs to be close to 100%. While some game engines use garbage collection, the algorithms are primitive compared to the state of the art generational collectors found in desktop environments, and the waste of memory resources is often 10-20% of total memory. Games generally can not afford large mark or sweep phases as they must execute at a smooth frame rate. Fragmentation can still be an issue in a fixed memory environment, although in this case many allocator strategies exist to combat this.

Multicore architectures for games 
While this is still an active area of research for desktop and server applications, too, I've found exactly one paper that attempts some research in this area for game architectures. This is a particularly fruitful area for research since there are many competing ideas out there (message passing! software transactional memory! functional programming!), but very few researchers testing any of them in the context of building a game. It is difficult enough to make a game by itself, let alone test multiple techniques for exploiting multiple cores. I find this somewhat interesting because aside from servers and scientific processing, games are pushing the state of the art in multicore programming more than anything else.

Automated testing
This is something the EA STL paper brings up -- traditional automated testing techniques break down pretty quickly beyond unit testing lower level libraries. So much of the end result of game code is subjective and emergent that determining how to test even basic functionality automatically is a huge unsolved problem. This results in a large amount of manpower being used for game testing, particularly in the area of regression testing.

This research is being done as a part of production by many companies inside the industry. But it is always going to be the case that in a production environment, you just aren't going to have the time and resources to, say, try three different approaches to multicore architecture and compare them. Generally you make an educated guess and hope for the best. Additionally, because much of this research is done as part of product development, rarely are the results published, which means we're all out there doing the same work over and over.

Sunday, October 4, 2009

An Ode to the GPU. Hopefully not an Epitaph.

The last entry got me thinking about one area of game programming that has gotten unequivocally better over the last ten or fifteen years: graphics programming. From the advent of the GPU to programmable pipelines to the debugging and profiling tools available, things are for the most part way easier today than they were even five years ago.

I am not a graphics programmer. I'm a generalist who often finds himself programming graphics. So there are certainly gaps in the last ten or fifteen years where I wasn't really writing anything significant in graphics. There's a large gap between fixed-function gpus and when HLSL was introduced -- I don't think I've ever done assembly-level pixel shaders, for example.

While I do remember doing a lot of OpenGL in the early days of fixed-function, I didn't do much multipass rendering on fixed function hardware, where companies like Id essentially faked a programmable pixel pipeline with texture and blend ops. Frankly, I thought during that era it was more about fighting the hardware than interesting techniques -- the amount of bs you had to put up with made the area unattractive to me at the time.

Languages like HLSL and Cg piqued my interest in graphics again, and when you think about it, are a pretty impressive feat. They allow a programmer to harness massively parallel hardware without having to think about the parallelism much at all, and the last few years have been more about interesting algorithms and more efficient operations than about fighting hardware capabilities. Sure, you still run up against the remaining fixed function parts of the pipeline (namely, blending and texture filtering), but those can be worked around.

The tools have improved year over year. On the PC, things like PerfHUD have slowly gotten better, with more tools like it being made all the time. The gold standard still remains PIX on the 360 -- so much so that many programmers I know will do an implementation of a new graphics technique first on the 360 just because it is so easy to debug when things go wrong.

So let me just praise the GPU engineers, tools makers, and language and API designers who have done such a good job of taking a hard problem and making it constantly easier to deal with. I think it is rare to get such productivity gains for programmers in any area, and we shouldn't take for granted when it happens.

This is also why the dawn of fully programmable graphics hardware makes me nervous. Nvidia recently announced the Fermi architecture, which will allow the use of C++ on the GPU. Nvidia, AMD/ATI, and Intel are all converging on GPU architectures that allow more and more general computing, but is C++ really the answer here?

HLSL and its ilk make concurrent programming easy. The same can not be said for C++. While an architecture where the underlying threading architecture of a GPU is more open certainly will allow for a wider array of approaches, what is the cost? Are we blinded so much by the possibilities that we forget that the DirectX/OpenGL model is one of the few successes of hiding concurrency for programmers?

I have not really done much with CUDA or compute shaders, so perhaps I am being hasty in judgement. But when I see Intel or Nvidia touting that you can use C++ on their GPUs, I get a little worried. I am not sure that this will make things better, and in fact, may make things very much worse.

Am I just paranoid?

Saturday, October 3, 2009

I'm Afraid the Grass is not Greener

I started reading Coders At Work, and wow, it's rare that you run across a book about programming that's a page-turner, but this is it. I'm not very far into it, but a quote from Brad Fitzpatrick (LiveJournal, memcached, PerlBal) caught my attention. The context is he is bemoaning how it seems like computers are worse than they were ten years ago, that they feel slower even though under the hood they are faster, etc. Then this question and answer comes up:

Seibel: So maybe things are not as fast as they ought to be given the speed of computers. But ten years ago there was no way to do what people,as users, can do today with Google.

Fitzpatrick: Yeah. So some people are writing efficient code and making use of it. I don't play any games, but occasionally I'll see someone playing something and I'm like, "Holy shit, that's possible?" It just blows me away. Obviously, some people are doing it right.

We are? The funny thing is, I'm not sure a lot of game programmers would feel that we are doing things right. We work with imperfect middleware and engines, with hacks upon hacks piled upon them, all until the game we are working on is no longer broken and actually fun to play. We have code in our games we would describe as "shit" or "crap" or "I can't believe we shipped with that." When I was just starting out, I thought maybe it was just the games I was working on that had this problem, but any time you talk to anyone at other companies, it is the same story -- from the most successful games to the smallest ones, we can all list a huge litany of problems in the code bases in which we work or have written.

It's interesting reading this book because at least the first two programmers I've read are in totally different worlds than game development. Not better or worse, just different. The problems and constraints they have are somewhat alien to me.

I've often heard game developers say things like "game development is five to ten years behind the state of the art in 'straight' programming", referring to process or my least favorite term, "software engineering." I may have even said it myself.

The game industry often does a lot of navel gazing (like this entry!). We are constantly comparing ourselves to movies, theme parks, or how the rest of programmers work. Maybe we've got it all wrong. Maybe all along we've been figuring out how programming for games needs to work. If the world that Brad Fitzpatrick lives in feels alien to me and vice versa, then why would we ever think that the processes or techniques that work in one are automatically going to work for the other?

Food for thought.

Wednesday, September 23, 2009

Safety

I recently came across two articles that tangentially talk about the same thing -- technologies that are safe. Safe as in usable and not likely to get yourself in trouble.

The first was 30 years of C over at DadHacker. The second is a Joel on Software article (nice to see him actually writing about technology instead of pimping FogBugz or whatever he's selling these days) called The Duct Tape Programmer.

Anyway, I thought I'd write some of my opinions of the language features mentioned in these two articles. For those of you who've known me a while, it just may surprise you where my thoughts have evolved over the years.

Let's cover the C++ features:

Exceptions - While I have no problems with exceptions in a language like Java or C#, in C++ they just don't work well. In games we turn them off for code size and performance reasons, but I would tend to avoid them in C++ even if there was zero hit in either area. It is just too difficult to write exception-safe code in C++. You have to do extra work to do it, and the things that can break are sometimes very subtle. Most importantly, the entire culture and ecosystem built around the language is not exception-friendly. Rare is the library that is exception-safe in my experience. So just say no.

RTTI - Not very useful in practice. Again, there are overhead concerns in games, although most games I've seen end up rolling their own. But the base implementation is rather inflexible -- it is reflection of only the most basic sort, and often in the places you do need run-time type information, you need a lot more than just class ids. It's a feature with its heart in the right place but it just doesn't come together very well. I think part of the problem is its all-or-nothing nature -- usually only portions of my architecture need any sort of reflection, and I don't want to pay for it on all the other classes.

Operator Overloading - Rarely useful outside of math libraries. I'm not even a huge fan of the iostreams model, to tell the truth.

Multiple inheritence - Only with pure virtual interfaces, and even then should be used rarely and avoided if possible. Sharing implementation via inheritance goes awry enough in single inheritance, adding more base class chains just makes the problem worse.

Templates - The big one. I'll admit to having a love affair with templates in my twenties. What can I say? They were fun and a shiny new toy. I sure had some excesses, but even my worst one (a cross-platform file system library) shipped in multiple products. Even then I hid them all behind a straight-C API, so only programmers who had to either debug or extend the library innards had to deal with the templates. If I had to do it again, I'd probably do it differently, but I could say that about any code I've written in my career, whatever the language. I do know that it was an improvement over the previous file system library that was in use, because the new one actually worked.

I can say with a degree of certainty that template metaprogramming is a bust for practical use. There are a few major problems with it: the language isn't really built for it (it's more a clever side effect than anything), there is no good way to debug it, and functional programming isn't very ingrained in the game development culture. Ironically, I think the last part is going to have to change as parallel programming creeps into larger and larger sections of the architecture, but that won't make template metaprogramming practical.

In any case, these days templates are just a tool in the toolbox, and not one I reach for that often. The code bases I've been working in recently all roll their own template container libraries* (provided for us from external vendors), and they do the job. My experiences with code sharing via templates is that more than often it isn't worth the trouble, but sometimes it is. Like anything we do, it is a tradeoff, and one I don't necessarily feel particularly passionate about either way.

*A somewhat amusing side note: I've done performance and code generation tests with one of the hand-rolled template container libraries I've encountered versus STL. STL came out on top for a lot of simple things like loop iteration overhead or sorting, on all the platforms I was interested in. Of course, I'm not about to rewrite hundreds of thousands of lines of code to use STL, and STL still is horrible for memory management. But I suppose that underscores the point "30 years of C" made -- even something as simple as a container library is hard to get right, even for experts. Which library I'm talking about shall remain anonymous for its own protection.

The Other Cost of Code Bloat

The other day I almost wrote a redundant version of the exact same class that someone else on my project had written. In fact, if I hadn't have asked this person a couple general C# questions, and he hadn't put two and two together, I probably would have wrote that redundant class. Good detective work on his part, and shame on me for not doing a search of the code base to see if someone else had already tackled this problem. While I've got a pretty good feel of the C++ which makes up the majority of code in our engine/tools, I haven't looked at the C# side as much as I probably should have.

As the code bases we write get larger and larger, and the team sizes we deal with get larger and larger, the question of how to avoid this scenario becomes an important one. Ideally you hire programmers who perform the necessary code archeology to get a feel for where things are in the code base, or who will ask questions of people more familiar with the code when unsure. Getting a code base of a million or more lines "in your head" takes time, though. I've been working with our licensed engine for about four years now, and there are still nooks and crannies that are unfamiliar to me.

Better documentation should help, but in practice it is rarely read if it even exists. This is because usually such documentation is either nonexistant or if it does exist, horribly out of date. With a licensed engine, you are at the mercy of the little documentation you are provided, and at the end of the day, the code itself is the best documentation.

A sensible architecture with clear delineation of what should go where is often a bigger help. Knowing [where to look] is half the battle, said a saturday morning cartoon show. Again, with a licensed engine, you again are at the mercy of what you are provided. Finding existing functionality usually comes down to experience with the code base and code archeology skills.

Recently, Adrian Stone has been writing an excellent series on minimizing code bloat. Now while the techniques he describes aren't really about eliminating actual code and instead eliminating redundant generated and compiled code, the mindset is the same when you are removing actual lines of code. Aside from the important compile time, link time, and executable size benefits, there is another benefit to removing as much code as you possibly can -- the code will occupy less "head space."

Unused or dead code makes it that much harder to do code archeology. Dead code certainly can make it more difficult to make lower level changes to the engine or architecture, as it is one more support burden and implementation difficulty. In the past, removing large legacy systems (whether written internally or externally) has had unexpected benefits in simplifying the overall architecture -- often there are lower level features that only exist to support that one dormant system.

One of my favorite things to do is delete a lot of code without the end result of the tools or game losing any functionality. It's not only cleaning out the actual lines of code, but the corresponding head space that is wonderful feeling -- "I will never have to think about X again." With the scale of the code bases we deal with today, we don't have the brain power to spare over things we don't need.

Monday, September 21, 2009

Rogue Programming

Gamasutra had an interesting article today titled Gaming the System: How to Really Get Ahead in the Game Industry. I found it probably had more to say about the political dysfunction that can often accompany game development rather than a how-to on being successful. To put it another way: if you find yourself having to follow the sneakier guidelines in this article too much, then you might want to consider a change in where you work.

The programming section is titled "Just Do It" and does have some truth to it. One of my leads and I came up with the term "rogue programming" for what he describes, which was half-joke, half-serious. Here's a quote:

As a programmer, it's not uncommon to see problems that you think should be fixed, or to see an opportunity to improve some piece of code, or speed up a process that takes a lot of time. It's also not uncommon for your suggestion to be ignored, or dismissed with an "it's not broke, so let's not fix it" response...

What should you do? You should just do it -- on your own time.

This is advice which is fraught with a lot of risk, because here's a hard-earned lesson for you: you don't always know best. I know, I know, you're superstar hotshot programmer, and you see something that is broken, so it must be fixed. Sure it is not in the schedule, but it'll just take a few hours, what's the harm? The code base will be so much better when you're done, or the artists and designers will have a feature they didn't have before. How can that not make the project better?

Let me give a cold-water splash of reality: when it is all said and done at the end of the project, you're going to ship with a lot of broken code. I'm not talking about obvious bugs in the shipped project, I just mean nasty, hack-filled, just-get-it-out-the-door brokenness in the code base, and some of that code will be code that you wrote. If this wasn't true, then a long-lived project like the Linux kernel wouldn't still have thousands of developers contributing to it -- obviously, there is still stuff that is "broken" and can be improved!

So in the big picture, a single section of brokenness is not going to make or break your project, and usually, there are bigger fish to fry on any given day, and its best to fry them. Because if your project is cancelled because a major feature was late, will it matter that you cleaned up the way you calculated checksums for identifiers?

That said, if after all of this, you still think something is worth doing, let me tell you how to successfully rogue program:

First, and most importantly, let's define "on your own time." On your own time means you are hitting your scheduled work on schedule, and that will not change if you fix or implement this one thing. If you're behind on your scheduled work, then you really shouldn't be doing any rogue programming. Whether not impacting your schedule means you work a saturday, do some exploration at home, or just have some slack in your schedule you'd like to exploit, if you don't complete the tasks you are supposed to be working on, you've done more damage than whatever improvement you're working on the side could benefit.

Additionally, you need co-conspirators. These days, programming is very collaborative process, and for the most part, the cowboy mentality is a dying thing. If you talk to your lead or other engineers about a problem ("Hey, X is pretty f'd up") and no one else agrees, or you can't make the case, then hey, maybe X really isn't that important! You really want to be working with a group of people that you can convince with sound arguments that something is a problem, and a lot of the time a little discussion about a problem can turn into a scheduled task -- and no rogue programming.

Often you'll be faced with something that everybody agrees *should* be done, but there's no time to do it. In these cases, I've found with good leads (which I've been blessed with the last few years), you can get tacit approval to do something "on your own time." This often takes trust -- I wouldn't recommend going off and "just doing it" until you've earned that trust.

If you've gotten to this point, you're in pretty good shape to go off and do some "rogue programming" -- because at this point (and this is where the joke comes in), it really isn't rogue at all.

Now if you're at a company where you constantly feel like you need to "go around people" to "just do things," then maybe you really do need a change of venue, because that is not a healthy team. I happen to know someone who is hiring.

Thursday, August 27, 2009

Big O doesn't always matter

The other day I was optimizing a bit of debug code which verifies the integrity of objects in memory. The details aren't super-important, but the gist is it is a function which runs periodically and makes sure that objects that should have been garbage collected were indeed purged from memory.

I don't make a general habit of optimizing debug code, but this was a special case -- before, this process only ran in the debug build. Artists and designers run a "development" build, which is an optimized build that still includes assertions and many other development checks.

We recently ran into a bug that would have been detected much earlier if this process had been running in the development build. While programmers run the debug build almost exclusively, we tend to stick to simpler test levels. Trying to debug an issue on a quick loading, small level is much easier than on a full-blown one.

The algorithm is pretty simple -- objects have somewhat of a tree structure, but for various reasons they only have parent links and not child links. For objects at the top-level of the tree, we know for sure whether they should be in memory or not. Objects at the bottom of the tree keep all their parents in memory if they are connected to the reference graph. So the debug check looks at each object and verifies that it is not parented (via an arbitrarily long parent chain) to an object which should have been purged.

First thing I did was measure how long the process was taking, and did some lower level profiling to get an idea of where time was spent. Most importantly, I also saw where I was running into cache misses.

The first pass of optimization -- the original loop was doing a lot of work per-object that was simply unnecessary. This was because it was using a generalized iterator that had more functionality than needed for this specific case -- for most operations, particularly at editor time, this extra overhead is not a big deal. Removing this extra work sped up the process and it was now took about 90% of the time of the original.

I then tried some high-level optimizations. There were two things I tried - one, the inner loop linearly checked each high-level object against an unsorted array of objects we know should be purged. I replaced this with a hash table from our container library. Finally, I realized that a memoizing approach should help here -- since I'm dealing with a tree, I could use a bit array to remember if I've already processed a parent object and deemed it OK. This would allow me to cut off traversal of the parent chain, which should eliminate a lot of work. Or so I thought.

The new algorithm was faster, but not by much - only 85% of the original running time. The additional complexity was not worth 5% of running time, so I went back to the simpler approach. This isn't unusual in optimization -- you often can try something you think will be a big help but turns out not to matter much. I've made mistakes in the past where I stuck with the more complicated implementation for a marginal gain -- but it was not worth it, and it made other optimizations that may have bigger impact harder to do.

As far as why the gain wasn't that much: The unsorted array was relatively small (a handful of elements), so a linear search was faster because it was simpler and had better cache behavior than the hash table implementation I was using. The tree structure of the objects was broad but not deep, so its obvious in hindsight why memoization would not be a win.

Now, one thing that is nice to have is a decent container and algorithm library. I have that at my disposal, so implementing these two changes was a matter of minutes instead of hours. With that kind of arsenal, it is easy to try out algorithmic changes, even if they end up not working out.

At this point, I took another look at the cache behavior from my profiling tools. I tried something incredibly simple -- prefetching the next object into the cache while I was processing the current. This resulted in the process now running at 50% of the time of the original -- a 2X speedup, and likely fast enough for me to enable this in the development build. I'm going to measure again, and see if there are any other easy wins like this to be had.

The processors we use are fast -- incredibly fast, and even with branch penalties on the in-order processors of today's consoles, they can still do a lot of work in the time it takes to retrieve data from main memory. So while on paper, I'm using "slow" algorithms with worse O(n) times, in practice, your memory access patterns can easily drown out any extra calculation. The key, as always, is to measure and test your theories, and not just assume that any given approach will make something faster.

Monday, August 24, 2009

Stencil states for rendering light volumes

In the ShaderX 7 article "Designing a Renderer for Multiple Lights: The Light Pre-Pass Renderer", the author describes a number of approaches for rendering the lights into the lighting buffer. These are all pretty standard approaches for any deferred technique, but I thought the description of using stencil does not explain how to set up the stencil states very clearly. This was probably due to space constraints.

The way it is worded implies that you still need to change the depth comparison function. This is not the case, and is most of the point of the technique. As the article points out, changing the depth test makes many GPUs take their early-Z rejection and go home.

I'm sure you can find this detail elsewhere on the net, but my cursory searches did not find anything, and hopefully this will save at least one person some time. Standard caveats apply: I haven't extensively tested this stuff.

Assuming convex light volumes, this is what I found worked well:


// render backfaces so that only pixels in front of the backface have stencil incremented
AlphaBlendEnable = false
StencilEnable = true
ColorWriteChannels = None
CullMode
= Clockwise
DepthBufferEnable
= true
StencilFunction = Always
StencilPass
= Keep
// If a pixel is front of the volume backface, then we want it lit
StencilDepthBufferFail = Increment

// render volume

// render frontfaces so that any pixel in back of them have stencil decremented
CullMode = CounterClockwise
// pass stencil test if reference value < buffer, so we only process pixels marked above.
// Reference value is 0. This is not strictly necessary but an optimization
StencilFunction = Less
// If a pixel is in front of the volume frontface, then it is not inside the volume
StencilDepthBufferFail = Decrement;

// render volume

AlphaBlendEnable = true
ColorWriteChannels = RGB
// only process pixels with 0 < buffer
StencilFunction = Less
// zero out pixels for so we don't need a separate clear for next volume
StencilPass = Zero
// don't want to do anything if we fail the depth test
StencilDepthBufferFail = Keep

//render a screen space rectangle scissored to the projection of the light volume


Note that unlike shadow volumes, the light volume intersecting the near plane is not a concern here. We are rendering the frontfaces to find pixels that are in front of the light volume -- if parts of the light volume are in front of the near plane, by definition any pixels we're rendering are in back of those parts. So there is no need to render a cap in this case.

The light volume intersecting the far plane is a concern. One way to handle this case is to use a projection matrix with an infinite far plane, like shadow volumes do. Another way to handle it would be to detect this case and not use the stencil approach at all, instead rendering a screen space rectangle scissored to the light bounds.

Finally, I've had better luck switching to rendering the backfaces without depth testing when the camera is inside the light volume, instead of using a screen space rectangle. But I think this has more to do with a bug in my scissoring code than with any fundamental problem!

Sunday, August 23, 2009

XNA: at times awesome, at times frustrating

I'm not sure why, but Microsoft seems intent on crippling XNA for the 360. Perhaps they want to sell more dev kits.

I recently had some more time to work on my little toy project. After some work, I've now got a deferred lighting implementation on the PC.

For the lighting buffer construction, at first I was using a tiled approach similar to Uncharted, which did not require blending during the lighting stage. It did work for the most part, and allowed me to use LogLUV for encoding the lighting information, which was faster. But it had issues - I didn't have any lighting target ping-ponging set up, so I was stuck with a fixed limit of seven lights per tile. Also, even with smallish tiles, you end up doing a lot of work on pixels not actually affected by the lights in question. So I wanted to compare it to a straightforward blending approach, and switched back to an FP16 target, and render the light volumes directly (using the stencil approach detailed in ShaderX7's Light Pre-Pass article).

So this all worked great and my little toy is rendering 100 lights. Of course, on the 360, there's a problem. Microsoft, in its infinite wisdom, decided that the FP10 buffer format on 360 would blow people's minds and it is not supported in XNA. They are using an actual FP16 target, which does not support blending.

So I guess it is going to be back to alternate lighting buffer encoding schemes, bucketing, render target ping-ponging for me. It's not a huge deal, but it is frustrating.

It is a real shame that XNA gives the impression that the 360 GPU is crippled, when in reality it is anything but. Couple lack of FP10 support with inability to sample the z-buffer directly, and the lack of control of XNA's use of EDRAM, and they've managed to turn the 360 into a very weak, very old PC.

Least common denominator approaches generally haven't fared that well over the years. An XBLA title implemented in XNA is going to be at a fundamental disadvantage -- I don't think you are going to see anything approaching the richness of Shadow Complex, for example.

At the end of the day, Microsoft needs to figure out where they are going with XNA. If they are going to dumb it down and keep it as a toy for people who can't afford a real development kit (people who've been bumping into these low ceilings much longer than me), then they should keep on their current path.

The potential for XNA is really much more, though. Today I wrote a pretty decent menu system in about 45 minutes, that handles gamepad, keyboard, and mouse input seamlessly. I don't think I could write that in C++/DirectX anywhere near as fast. If you start looking down the road to future generations of hardware, I'm not worried about the overhead of C# being fundamentally limiting. Games today already use much less efficient scripting languages than C#, and while you are limited to the heavy lifting Microsoft has chosen to implement for you today, who is to say that a future version of XNA couldn't allow shelling out to C++ for really performance intensive stuff?

XNA has a chance to become something really great that would be very powerful for a large class of games. It remains to be seen if Microsoft will let it.

Wednesday, August 19, 2009

One has to have had inflated expectations to experience disillusionment

A colleague sent along this item, which asks if Transactional Memory is beyond the "trough of disillusionment".

I've never had any expectations that STM would be some silver-bullet solution to concurrency, and from the get-go just viewed it as just another tool in the toolbox. Granted, it is a technique that I haven't had much practical experience with yet -- it's on my TODO list. Others might disagree with me, but I'm not even sure how much of a major factor it is going to be in writing games. Of course, if some major piece of middleware is built around it, I suppose a lot of people will end up using STM, but that doesn't necessarily make it a good idea.

The latest piece of evidence against STM as a silver bullet comes from conversations I've had with colleagues and friends who have a lot of experience building highly-scalable web or network servers. STM advocates hail transactions as a technique with decades of research, implementation, and use. About this they are correct. The programming model is stable, and the problems are well known. But what has struck me is how often my colleagues with much more experience in highly-scalable network servers try to avoid traditional transactional databases. If data can be stored outside of a database reliably, they do so. There are large swaths of open source software devoted to avoiding transactions with the database. The main thrust is to keep each layer independent and simple, and talk to a database as little as possible. The reasons? Scalability and cost. Transactional databases are costly to operate and very costly to scale to high load.

I found the link above a little too dismissive of the costs of STM, particularly with memory bandwidth. I've already discussed the memory wall before, but I see this as a serious problem down the road. We're already in a situation where memory access is a much more serious cost to performance than the actual computation we're doing, and that's with a small number of cores. I don't see this situation improving when we have 16 or more general-purpose cores.

A digression about GPUs. GPUs are often brought up as a counter-argument to the memory wall as they already have a very large number of cores. GPUs also have a very specialized memory access pattern that allow for this kind of scalability - for any given operation (i.e. draw call), they generally have a huge amount of read-only data and a relatively small amount of data they write to compared to the read set. Those two data areas are not the same within a draw call. With no contention between reads and writes, they avoid the memory issues that a more general purpose processor would have.

STM does not follow this memory access model, and I do not dismiss the concerns of having to do multiple reads and writes for a transaction. Again, we are today in a situation where just a single read or write is already hideously slow. If your memory access patterns are already bad, spreading it out over more cores and doubling or tripling the memory bandwidth isn't really going to help. Unlike people building scalable servers, we can't just spend some money on hardware -- we've got a fixed platform and have to use it the best we can.

I don't think that STM should be ignored -- some problems are simpler to express with transactions than with alternatives (functional programming, stream processing, message passing, traditional locks). But I wouldn't design a game architecture around the idea that all game code will use STM for all of its concurrency problems. To be fair, Sweeney isn't proposing that either, as he proposes a layered design that uses multiple techniques for different types of calculations.

What I worry about though is games are often written in a top-down fashion, with the needs at the gameplay level dictating the system support required. If at that high level the only tool being offered is STM with the expectation that it is always appropriate, I think it will be easy to find yourself in a situation where refactoring that code to use other methods for performance or fragility reasons may be very difficult and very expensive than if the problem had been tackled with a more general toolbox in the first place.

Concurrency is hard, and day to day I'm still dealing with the problems of the now, rather than four or five years down the road. So I will admit I have no fully thought out alternative to offer.

The one thing I think we underestimate is the ability of programmers to grow and tackle new challenges. The problems we deal with today are much harder and much more complex than those of just a decade ago. Yes, the tools are better for dealing with those problems, and the current set of tools for dealing with concurrency are weak.

That means we need to write better tools -- and more importantly, a better toolbox. Writing a lock-free sw/sr queue is much harder than using one. What I want is a bigger toolbox that includes a wide array of solutions for tackling concurrency (including STM), not a fruitless search for a silver bullet that I don't think exists, and not a rigid definition of what tools are appropriate for different types of game problems.

Thursday, August 13, 2009

Diminishing returns

One thing that has been on my mind since SIGGRAPH is the problem diminishing returns poses: when do you switch from an approach, algorithm, or model because any gains to be had are increasingly diminishing?

The specific thing that has got me thinking about this is the rapid approach of fully programmable GPUs. So far this is not looking like it will be another evolutionary change to the venerable D3D/OpenGL programming model, and will in fact be a radical change in the way we program graphics. Which is just another way for saying it will be a change in the way we *think* about graphics.

At SIGGRAPH there was a panel of various industry and academic luminaries discussing the ramifications -- is the OpenGL/D3D model dead? (not yet), what will be the model that replaces it? (no one knows), is this an interesting time to be a graphics programmer? (yes). A colleague pointed out that the members of the panel lacked a key constituency -- a representative from a game studio that's just trying to make a game without a huge graphics programming team. The old model is on its last legs, the new world is so open that to call it a "model" would be an insult to programming models. If you're an academic or an engine maker, this doesn't present a problem, in fact, it is a huge opportunity -- back to the old-school, software renderer days. Anything's possible!

But for your average game developer, it could mean you are one poor middleware choice away from disaster. You don't have the resources of the engine creators, so being ripped asunder from the warm embrace of the familiar D3D/OpenGL model can be a little terrifying. To put it another way: the beauty of a model like D3D/OpenGL is that no matter what engine or middleware you use, when it comes to the renderer, there is a lot of commonality. In this new world, there are a bunch of competing models or approaches -- that's part of the point. Engine creators will have a bevy of approaches to choose from -- but if you're just trying to get a game done, and you find your engine's choice of approach doesn't match what you need to do, well, you've got a lot of work all of the sudden.

But we face these choices in software development all the time: when to abandon an algorithm or model because of diminishing returns. Change too soon and you've done a lot of extra work you could have avoided by just refining the existing code. Change too late and you miss opportunities that could differentiate your offerings. We like to pretend like doing cost/benefit analysis on this kind of stuff is easy, as if we were comparing a Volvo against a Toyota at the car dealer. But often the issues can be quite complex, and the fallout quite unexpected.

It's cliche, but we live in interesting times.

Sunday, August 9, 2009

Fresh from SIGGRAPH

You don't know what heat is until you spend a week in New Orleans in August.

Here's a quick list of my favorites:

Favorite course:

Advances in Real-Time Rendering.

Favorite technical paper:

Inferred Lighting

Favorite somewhat technical talk:

Immersive and Impressive: The Impressionistic Look of Flower on the PS3

Favorite non-technical talk:

Making Pixar's "Partly Cloudy": A Director's Vision

Wednesday, July 8, 2009

Things to do when writing a tutorial

Something I've learned over the years that if you are writing a tutorial for a tool you've written, it pays dividends to actually perform the steps of the tutorial yourself as you write it.

This seems obvious, but often while developing a tool you end up developing a set of test data as you go. Often features or changes you add later can break functionality that you only used early on. It can be tempting to try to plow through writing the tutorial, since you know how all the features work -- why bother actually doing them?

If you actually perform the operations without having any existing data, you can uncover a lot of bugs, or features that don't work particularly well. Lately, I've gone even further: Take your fully developed test data, tear it down, and then build it back up again. This tests both the creation code paths and the destruction code paths.

If nothing else, doing the above saves the embarrassment of releasing a tool to your artists and designers, and the first time the try the most basic of things, it crashes, because you haven't exercised that code path for a week or two. One last regression test is worth the extra effort.

Sunday, June 21, 2009

Leaky abstractions in XNA

So continuing my exploration of XNA, this weekend I did some more work on my little toy project.

The first thing I did was get it running on 360. I was happy to see that XNA seems to be able to figure out how to deal with my various render targets, including one MRT, without too much trouble, and the performance was far superior on the 360 than on my laptop. I get about 200 fps on the 360 vs 60 on the laptop.

There was one issue worth noting.

First, some background on the deferred lighting approach I am using:

  1. Render normal + depth into a G buffer for all primitives. Depth writes and tests are enabled in this step.
  2. Render the lights into a lighting buffer using the G buffer. Depth writes and tests are disabled for this pass.
  3. Apply the lighting to each primitive using the lighting from step 2 while computing albedo and (eventually) other material properties on the fly. Depth tests are enabled but not writes.


So the first problem on the 360 is XNA blows away the depth buffer I lay down in step 1 by the time I get to step 3. After some searching on the internets, I discovered this is expected behavior.

I tried setting my render targets to PreserveContents, which does work, but is completely wasteful since I don't give a hoot about restoring the actual color contents of any of these buffers. This dipped performance down to 150fps.

My next attempt was to restore the depth buffer manually from my G Buffer. But this was exhibiting z-fighting, possibly due to slightly different methods of Z calculation for my G-Buffer vs the depth-buffer leading to small differences in the computed Z values. I didn't feel that messing around with z biasing would be a robust solution, so I abandoned this effort.

The solution I ended up choosing was to just clear the z buffer again and reconstruct it during step #3. Since my scenes are so simple this gets me back to just slightly under 200 fps.

It's not an ideal solution, since I had in mind some uses for a stencil buffer laid down in step #1 that would accelerate step #2 (mainly, masking off unlit pixels for the skybox).

XNA's EDRAM handling is a great example of a leaky abstraction. Only having a 10 MB EDRAM buffer does make render target management trickier, but in Microsoft's attempt to completely hide it from XNA programmers, I think they've just made things more frustrating. The concept of a limited buffer for render targets is not that hard to get your head around. You have to understand EDRAM anyway since techniques in XNA that work perfectly on Windows (like what I was doing) will break on the 360. Even worse, you have no real good idea *why* it's breaking unless you understand the limitations of EDRAM and take a guess at what Microsoft is doing under the hood. So what is really being saved here? Just let me deal with EDRAM myself.

Sunday, June 14, 2009

Adventures in XNA continued

This weekend I played around in XNA a little bit more (completely personal stuff, nothing to do with work, opinions are my own, etc). I'm still find it very fun for the most part but the lack of access to the metal can be frustrating at times.

For the most part I've just been experimenting with deferred lighting. As far as what I'm trying to accomplish, I view this stuff like a musician doing scales. Good practice, but the goal is to get familiar with the techniques rather than produce anything "real".

I'd already built up a quick and dirty deferred lighting implementation a couple months before. This weekend I removed some of the hacks I had, added HDR + bloom, threw in some simple terrain, played around with LogLuv encoding, and fixed some artifacts from my first pass implementation.

I suppose that sounds a lot but the nice thing about XNA is there are a bazillion samples out there. The deferred lighting is the thing I'm really concentrating on, so for the other stuff I just grabbed anything I could find. Terrain, and HDR and bloom came pretty much as-is from samples/demos, as did a FPS counter.

As far as the deferred lighting goes, I finally got the half texel offset stuff cleared up. In Direct3D 9, pixel coordinates and texture coordinates don't line up, so when doing something like sampling a normal buffer or lighting buffer, if you don't offset them properly you'll be sampling the lighting from the wrong texel. This entry by Wolfgang Engel was a big help here.

Reading Engel's ShaderX7 article, I also understood why the specular lighting has to be multiplied by n dot l, and fixed up some artifacts I would have due to that (mainly specular appearing on backfaces).

My first pass at HDR used FP16 render targets for everything. I changed the final apply lighting pass to encode into LogLUV, and then implemented the encoding for the lighting buffer suggested by Pat Wilson's article in ShaderX7. A side effect of the very simple material model I'm using allowed me to use a 8:8:8:8 buffer for this and still allow for high range when accumulating lighting. I currently don't have separate diffuse and specular albedo, so when I apply lighting the equation looks like this:

albedo*(diffuseLighting  + specularLighting)


This is part of the joy of a small demo - no artists telling me they have to have a separate specular albedo :). Anyway, I realized that I can just add those together before writing to the lighting buffer, and just do a straightforward encoding of the result in LogLUV space. I do want to put control of the glossiness in the material model, but that will require encoding the depth into 24 bits of a render target and then including an 8 bit specular power in the remainder. (I have to render depth to a separate target because XNA gives no facility for sampling the depth buffer).

In the process of all of this I wasn't quite getting the performance I wanted. I'm doing this on my Dell M1330 laptop which while no slouch, has trouble running World of Warcraft at a decent frame rate. But given such a simple scene, I was just shy of 60 fps so I decided to see if I could fire up NVPerfHUD and see what was going on. You can run NVPerfHUD with XNA apps, but a side effect I discovered is all vertex processing is done in software.

This is a bummer since it greatly throws off timings (a draw call for one mesh took 5 ms on the CPU for a unbelievably simple vertex shader), but I was able to find some GPU hotspots, some of which I at improved by pulling stuff out of the shader and on to the CPU.

Anyway, not sure how much I'll be working on this stuff but when I do I'll try to put the odd post up here. I haven't tried running my latest incarnation on the 360, which will probably be my next step. I think I've got the render targets set up so it should work, assuming the XNA runtime isn't doing anything retarded under the hood. But without PIX on the 360 it'll be hard to really dig into that.

Saturday, May 23, 2009

Enjoy being a beginner

That's a quote that stuck with me from an interview I read about Nolan Bushnell, and I've always thought it was something to aspire to. Because I find as I get older, it is can be easier to just stick to what you already know professionally. Inertia can settle in if you're not careful, and if a particular task is already in your wheelhouse, you tend to gravitate toward it.

The last year at Midway I was exclusively doing graphics tasks, and it was nice to completely focus on one area for an extended period after having hopped around to whatever systems fire was highest priority on Stranglehold. Even then, I'd done a fair amount of graphics work before over my career. Some of the things I'd worked on were areas I wasn't familiar with at the time (spherical harmonics), and in the process I learned a lot.

Now, though, I'm doing something completely different at the new gig. Unfortunately the project is still top secret and I can't get into any detail lest I inadvertently give something away. But unlike graphics, this is an area where aside from a handful of toy projects over the years, I haven't done anything before.

These kind of opportunities are a big reason why I like working on games -- the breadth of the work available is really wide, from the lowest level shaders to the highest level tools. I'm sure there are some other programming gigs that have this kind of range, but I can't think of many. Sure the expectations in terms of what can be done, how long it takes, and how many people it will take to do it can be pretty high, but every once in a while it pays to step back and remember that the work can be very rewarding in its scope and variety.

Tuesday, April 14, 2009

Tuesday, March 31, 2009

The Future is Now

Engadget brings us news of the Zeebo -- a game console that only offers games available for download. Sound familiar?

It is supposed to be released in Brazil in 2009. It is not intended to compete with the front-line consoles such as the PS3, 360 or Wii, and is targeted at emerging markets. Now this may be vapor, and there is some confusion as to the retail price - $599 seems pretty damn steep for anywhere, let alone emerging markets.

It is probably not surprising that the innovation in abandoning the boxed retail model is coming not from the big console manufacturers, but from smaller, more nimble players. I hope for Microsoft, Sony, and Nintendo's own sakes that they are not too entrenched in the boxed retail mindset to realize that it is only a matter of when, not if, boxed retail dies.

Monday, March 30, 2009

Please Get Your Physics Off My GPU

I hope to have some more substantial thoughts on GDC, but one nice trend was a number of talks that focused on offloading graphics work off of the GPU and onto CPUs (in this case, SPUs on the PS3).

For the last couple of years there has been a major push by the graphics card manufacturers to get non-graphic-y things onto the GPU. Cloth, physics, heck I'm sure there's even some GPU AI examples out there somewhere. These are things that console game developers I know don't particularly want or need.

The lion share of console games are GPU bound. The last thing I want to do is put more stuff on the GPU. So even if your cloth or physics solution runs really fast on the GPU, I'm not going to use it because there is no room at the inn. Even if a CPU solution is slower, it won't matter since I've got the spare processing capacity due to having to wait on the GPU, or have processing elements that are not used during a frame.

What I want to do is offload as much as possible to the CPU, since most games still probably are not maxing out the CPU capabilities of the PS3 or 360. It was nice to see some talks focusing on doing hybrid GPU/CPU solutions to things such as lighting or post processing, and I imagine this trend will continue.

Monday, March 23, 2009

The Problem with GDC

It always seems like I'll have a day that has 4 or 5 sessions booked at the same time that I want to see, and then on another day have some time slots with nothing that I want to see. Obviously, you can't please all of the people all of the time, but Wednesday definitely seems like the busy day.

Anyway, here's what my current schedule is looking like (I am not sure why blogger is inserting all this whitespace):


























































Session Title Date Start Time End Time
Discovering New Development Opportunities 03-25-2009 9:00 AM 10:00 AM
Hitting 60Hz with the Unreal Engine: Inside the Tech of MORTAL KOMBAT vs DC UNIVERSE 03-25-2009 10:30 AM 11:30 AM
Next-Gen Tech, but Last-Gen Looks? Tips to Make your Game Look Better - That Don't Include Bloom and Motion Blur. 03-25-2009 12:00 AM 1:00 PM
Out of Order: Making In-Order Processors Play Nicely 03-25-2009 2:30 PM 3:30 PM
Deferred Lighting and Post Processing on PlayStation(R)3 03-25-2009 4:00 PM 5:00 PM
The Unique Lighting in MIRROR'S EDGE: Experiences with Illuminate Labs Lighting Tools 03-26-2009 9:00 AM 10:00 AM
From Pipe Dream to Open World: The Terraforming of FAR CRY 2 03-26-2009 1:30 PM 2:30 PM
Morpheme & PhysX: A New Approach to Combining Character Animation and Simulation 03-26-2009 4:30 PM 5:30 PM
The Cruise Director of AZEROTH: Directed Gameplay within WORLD OF WARCRAFT 03-26-2009 3:00 PM 4:00 PM
Fast GPU Histogram Analysis and Scene Post-Processing 03-27-2009 9:00 AM 9:20 AM
Mixed Resolution Rendering 03-27-2009 10:30 AM 10:50 AM
Rendering Techniques in GEARS OF WAR 2 03-27-2009 2:30 PM 3:30 PM
Dynamic Walking with Semi-Procedural Animation 03-27-2009 4:00 PM 5:00 PM

Thursday, March 19, 2009

Changes

A short personal update:

Last Friday was my last day at Midway Games. After seven and a half years I decided it was time for me to move on and will soon be pursuing another opportunity. It was sad to leave in some ways, as there are many people there who I have enjoyed working with, but it was the right time for a move.

I'll have updates about the new opportunity soon, and hopefully some more content.

Tuesday, March 10, 2009

What Will the Future Bring?

It's been pretty obvious for a while that boxed retail in games will die someday, but recent news that Amazon will buy and sell used games definitely seems like one of the nails in the coffin.

I can't blame retailers such as Amazon, or BestBuy for entering the used game market, but I really wonder when the major players in the game industry will wake up and realize that the day when all games will be downloadable is sooner rather than later.

Another way to phrase the question: when will a major console manufacturer release a console that does not contain any sort of removable DVD/bluray drive?

Consumers are already used to downloadable games in many other forms - cell phones, the iPhone, web games, Steam, Gametap, XBLA, PSN store, etc. There are even rumors that the next version of the PSP will not contain a UMD drive.

So why not cut the cord? Imagine a console comes out in 2012 which does not contain any sort of optical drive, and instead just a large hard drive. All games, not just ones deemed small enough for a special "arcade section, are downloadable. 

There would be issues. Currently US broadband adoption rates are at 59% of households, well behind Japan or Europe. Even optimistic projections estimate that by 2012 only 77% of US households will have broadband, although it would be interesting to know what percentage of console gamers will have broadband. Still, it would be a definite leap of faith to exclude such a large percentage of households from buying your product.

As far as the user experience goes, I don't think there would be many problems. Even if games took multiple hours to download, I don't see how that is any worse than getting a game from GameFly or Amazon is now. Steam has experimented with allowing users to "predownload" popular titles before their release date, and a similar model could be used for the users that just gotta-have-it the day of release. 

Another advantage of this approach is some savings on cost-of-goods on the consoles themselves -- for example, the bluray drive in the PS3 is probably a big driver of the total cost of that system. 

The only question is how retailers would react. They could threaten to not sell the console hardware, but a colleague of mine had an idea about that: prepaid download codes. Retailers could sell these along with the hardware. It won't be as lucrative as current boxed retail sales, but then again, by pushing used game sales so hard, the retailers are eventually going to force game publishers and console manufacturers' hands.

Sunday, January 25, 2009

C0DE517E on understanding your choices

Link (Too Young)

And when something does not look right, do not fix it, but find out first why it does not.

If it's too saturated, the solution is not to add a desaturation constant, but first to check out why. Are we taking a color to an exponent? What sense that operation has? Why we do that? Is the specular too bright? Why? Is our material normalized, or does it reflect more light than the one it receives? Think.
The entry I'm linking today wanders around a bit but eventually lands in a good spot.

Ultimately with games we're just trying to make something fun, and being visually interesting is part of that. We're not in the business of shipping perfectly accurate simulations of light, nor is that possible anyway. It may not be desirable depending on your art style -- I've always felt the ongoing arguments about "graphics not being important" in games is more about "photorealism is not the end all be all." 

Photorealism in games is a discussion for another entry some other day. Back to the linked article, if I were to sum it up I would say it is about understanding your choices. A well placed hack is often necessary, but do you understand what the hack does and (often more important) why it is needed?

In rendering we are almost always approximating some ideal, be it the behavior of reflected light on various surfaces or the behavior of skin over muscle, fat, and bone. The ideal may not be something that exists in the real world -- it may be a stylized alternate reality created only in the minds of your artists. Even these alternate realities have consistent rules, ones often based in the physical realities of the real world, or more importantly, based on how humans perceive those physical realities. If you go back to old Walt Disney cartoons, there is a consistency of action and reaction in any given movie, a set of unwritten rules that the animators provided for their audience. 

So as the author suggests, when presented with something that doesn't look right, a little thought into why it doesn't look right can go a long way. What ideal are you trying to approximate? Why does your approximation produce wrong results to the naked eye? How can you better improve the approximation to produce the desired results? Some times, you may find a bug that can be fixed, or a better technique for fixing the issue.

It may be the case that the answer is to add a simple constant hack. If you go through the process of determining what is going wrong, you will at least have a much better understanding of why that hack is necessary, and how it interacts with the rest of the system. This is true beyond just rendering; understanding the choices you make is key to good software engineering in general. 

Wednesday, January 21, 2009

The world is... the world

(This is the third and final part of a three part series. Part 1. Part 2. )

We've established that a generic tree implementation is not a good choice for implementing the SceneTree data structure and algorithms. This begs the question: then why call this thing a tree?

A group of people I work with obsess over naming classes, systems, functions, member variables, etc. We view it as a very important part of programming -- the name should describe what it does. If I have trouble coming up with a good name for something, then maybe it isn't clear to me what it does, and if it is not clear to me, how is it going to be clear to others?

The best documentation of the code is always going to be the code itself. Obviously, you want to comment your code, but if you pick clear and concise names for things, there is less you need to communicate via comments. If a user comes across a function named DrawCircle, it is a pretty good bet that function will draw a circle. If it doesn't draw a circle, or it draws a circle and formats the hard drive, that would be quite a surprise.

The name SceneTree implies both an interface and an implementation that is based on trees. We've seen from the previous entries that we don't need or want either. So naming it SceneTree and delivering something else would be a case of bad naming.

I don't have an alternate suggestion off the top of my head. To be honest, I'm not absolutely sure we need a separate name for transform update. The important concept is that we separate transform update from the renderer. I've worked in frameworks where this transform update was part of the World system.

In summary, a generic tree design and implementation is not a good choice for world object transform update and hierarchy composition. This is due to many of the same reasons that make a scene graph a bad choice, and due to the way that gameplay code tends to access transforms and bones. Given that a generic tree is a bad choice, the name SceneTree is a bad name.

Tuesday, January 13, 2009

Characters are a special sort of tree but not a SceneTree

(This is the second part of a three part series, here is the first part. )

At this point we've established for the vast majority of objects in our world, a SceneTree implemented as a straightforward tree structure is not a good fit. But what about characters or other objects that are driven by skeletal animation?

At first glance, a straightforward tree structure seems like a really good fit. What is a skeleton if not if bones connected by joints, where each bone may have many children but only one parent? Each bone's transform is relative to its parent. To compose the world transforms of each bone in the skeleton, we merely traverse the tree, multiplying each bone's relative transform with the world transform

If we have a straightforward tree data structure, then we have a base SceneTreeNode class, and anything that we need to be in the tree derives from that node. Well, our bones are a tree, so it makes sense to make a bone a node, right?
class Bone : public SceneTreeNode
{
};
I'm going to describe why the above is not the best decision on a number of fronts . It certainly works, it is simple, tons of commercial and noncommercial projects have used it, what could possibly be the problem?

Let's think about this in terms of our gameplay code, the high level code that makes our game different than every other game out there. We want this code to be as easy to write as is possible. One part of accomplishing this is to hide any complexity the gameplay code doesn't need to know about from it.

Gameplay code doesn't need to know about the hierarchy, and in fact, is going to be much happier if it doesn't know about it. It usually just wants to retrieve a small handful of specific bones' transforms, or attaches other objects to a bone. With Bones as full fledged citizens in the SceneTree, and a straightforward tree structure as the implementation, how would gameplay code go about this? It would need to traverse the SceneTree to find the bone it is interested in and retrieve the cached world transform. This is not a very convenient, and we'd probably add a GetBoneTransform helper to the Character node to hide these details.

We've still got an implementation of GetBoneTransform that hops around a lot of different pieces of memory, causing cache misses all along the way. Maybe this is a performance bottleneck, so we decide to use some kind of efficient indexed lookup to cache the bones at the character node level. We implement GetBoneTransform in terms of this efficient lookup. Attachments can be handled similarly -- rather than use the built-in tree traversal mechanisms, most likely the code will end up caching that list of attachments somewhere else for easy access in gameplay code

If we're going to abstract the tree away from gameplay code, then what is the advantage to making bones full-fledged citizens in the tree? In fact, there are significant design and performance disadvantages.

Bone hierarchies are mostly static, most of the time. I say mostly because sometimes a game may swap different body parts in, or switch level of detail on the skeletons. Practically, though, the hierarchy doesn't really change in any sort of significant fashion. Given this knowledge, a much better implementation is to lay out our bones in a flat array, with a simple index to their parent. This index may be in a separate array depending on cache usage patterns. The array can be laid out in such a way that compositing relative transforms we get from the animation system into absolute transforms is a simple run through an array. There are tricks you can use to make this more efficient, of course, but the point is an array traversal is going to be much better than hopping all around memory calling virtual functions on a base SceneTreeNode class. The approach also lends itself much better to offloading to another CPU due to better memory locality, or easier DMA if the CPU has NUMA.

Do we need a tree at all? Not that I can see -- from the previous entry we've got a list of root level objects, some of which may have an array of tightly packed bones, and an array of attachments. Their attachments may also have bones and attachments, so yes, conceptually, we've got a tree that we can traverse. But gameplay code never needs to know anything about this for the most part, as it deals with bones and attachments directly and isn't very concerned about the internals of the hierarchy.

We don't need a base SceneTreeNode class that is exposed for all -- the internals of how we update attachments and bones and objects are just that: internal. As we've shown, a straightforward tree structure doesn't really fit what we want to do very well as an implementation. From experience, you can spend a fair amount of time optimizing your updates of objects. The ability to special case, offload work to other CPUs, or any number of other abstractions makes it very easy to do so without breaking gameplay code. A generic tree structure does not provide the API we need for gameplay code, nor does it provide a good implementation at the level.

Tomorrow I will conclude with some thoughts on why naming matters.