Solid Angle

Interview on GameDevAdvice Podcast

2023-05-01T17:44:00.002-05:00

Recently I sat down with John Podlasek on his GameDevAdvice podcast. We covered topics ranging from my personal path through the game industry, starting a game studio Disbelief with my business partner Steve Ellmore, advice for programmers starting out in the industry, projects me or the studio have worked on , anti-crunch culture, and more. Plus we throw in a little Midway reminiscing, both the good and absurd.

You can listen on Apple, Spotify or elsewhere.

Where to find me online

2022-12-18T12:44:00.003-06:00

Since Twitter is setting up a Berlin Wall and banning links to any social media sites, figure I would post where you can find me online here

Mastodon: https://mastodon.gamedev.place/@solidangle

Bluesky: https://bsky.app/profile/solidangle.bsky.social

Not really active, but on these sites as well

Post: https://post.news/solidangle

Cohost: https://cohost.org/solidangle

Open Salaries at Disbelief

2018-05-21T09:47:00.000-05:00

Disbelief is a little over four years old. One thing I've wanted to do is write a little more about our experiences of bootstrapping and growing a small services company.

For some background, Disbelief is a tech services company that focuses on problem-solving for game developers. In short, we help people ship their games. Since Steve Ellmore and I founded Disbelief we've grown from five people working out of their apartments to seventeen people spread across two offices in Cambridge, MA and Chicago, IL.

Recently, we completed a transition to open salaries at Disbelief. Everyone at Disbelief knows the salary and responsibilities for each role. We've had a goal of fairness and transparency within the company for a long time, and this is one key part of that.

The Business Case for Internal Transparency

Our early compensation structure was mostly ad-hoc. Steve and I would get together and discuss a number on a per-case basis. While we'd put a lot of thought into it, we weren't happy with this process. It worked fine when the company was a group of people who we had known for years, but it wasn't going to scale and it wasn't particularly transparent to anyone.

A problem-solving services company relies foremost on good communication and cooperation, it's literally our business model. If we can't communicate and work well with clients, we are not going to succeed. Building a culture of communication means we have to practice what we preach internally - clearly communicate business goals, current sales activity, and generally be as open as possible so everyone is rowing in the same direction.

We realized part of this would be more openness when it came to expectations of each position. An ad hoc set of criteria locked in our heads was not sufficient and would not scale. We needed more formal definitions of roles.

Roles and Ladders

What we came up with was a 'ladders' spreadsheet which defines the roles and responsibilities of each position. The rows are each position (Junior Programmer, Programmer 1-3, Senior Programmer 1-3, etc), and the columns describe the responsibilities and criteria for one area of evaluation. The criteria are specific and attempt to minimize the amount of subjective evaluation and ‘gut feelings’.

One principle behind the criteria is they are rooted in business needs. This gives us focus on what is truly important for the business to succeed and what is not.

For example, one column is "External Communication". Junior Programmers are mostly concentrating on learning the craft of systems programming itself - how to take schooling or self-taught lessons and apply them to real systems in the real world, at the quality level Disbelief demands. Reflecting this, their responsibilities for external communication are minimal - they must be able to report status internally while other, more senior engineers handle the bulk of client communication. At the other end of the scale, Senior Programmers are expected to clearly report status to clients, anticipating and managing expectations, and act as ambassadors for Disbelief as a whole.

Defining these roles took a long time, and we thought very carefully about it. It has become a very useful tool. Monthly one on one meetings with managers have a clear set of criteria to discuss performance. Having the deltas between rows allows us to focus our mentorship and training efforts - rather than try to teach everything at once. Promotions can lay out exact responsibilities of the new role, both beforehand and after. New hires can be slotted based on specific criteria rather than gut feelings. Managers can be trained to evaluate using these criteria, which allows us to scale. The 'requirements' section of public job descriptions are mostly pre-written. At the highest level this document lays out a core of what it means to be a Disbelief engineer.

Flattened Salaries

What we did next is still controversial in many business circles - we flattened salaries. For each role, every person in that role is paid the same. You can find many business articles that will argue for this, as it can take a lot of arbitrariness out of compensation, and increases fairness. It limits situations where two people are doing the same job and getting wildly different pay. You can also find many that will argue against - it removes incentives for increased performance.

For us, it came down to a growing confidence that we had nailed down the core aspects of what the job was in written form. We're not naive - no document will ever capture everything nor handle every situation. We feel we have enough graduations in roles and coverage to reflect people's day to day contributions. What we found is the harder cases just made us reflect and refine our positions and roles. We have enough flexibility to recognize that not every Senior Programmer is the same, but also have made a lot of effort to write down specific criteria in specific roles rather than go with gut feelings.

Getting to a flattened salary structure required a period of adjustment -- mostly giving people raises to get everyone at their proper level for their role. This took some time as we're completely bootstrapped, and had to make sure our adjustments didn't cause payroll to outstrip our revenue. What made this a little easier is around the same time, we knew we had to do a series of competitiveness raises to get us closer to the market. These raises were broad based, so most people at the company got raises around the same time - some for competitiveness, some also for normalization.

So far we haven't seen many of the downsides or gotten a lot of negative feedback. We took our time with the transition, explaining it to the entire company, and discussing one on one. More often than not we find this a selling point with candidates.

One reason is base salary is only one part of our compensation. Aside from our benefits package, we have a bonus structure that is based on the performance of the company as a whole. If the company does better, everyone does better.

Cost of living adjustments are done yearly and across the board, and completely separate from merit/promotion increases. Additionally, we frequently review our compensation structure and will make competitiveness adjustments to a position's salary if we feel we're not in line with the market.

At the end of the day, compensation is only one factor in job satisfaction. We feel the positives of equitable salaries at each position outweigh the negatives and foster a culture of cooperation. What works for us may not work for you.

Open Salaries

We have had an open 'Ladders' spreadsheet within the company for quite a while, and everyone knew the salary structure was flat at each role. Only recently did we decide to attach salary information - we wanted to make sure we had time to address any negative feedback to the new compensation structure.

So far the reaction internally and externally has been positive. When mentioning we were releasing salary info internally to a couple of friends running their own companies, their immediate reaction was 'oh I've wanted to do that, how did you get there?" That's part of why I wanted to write up our experiences.

The Future

We hardly consider ourselves done - the compensation structure and roles are a living document, continually reviewed. For instance, we only have programmer roles defined in this structure. While we only recently have started to hire non-programmer roles, we need to define and add them to the overall structure.

An additional problem we're tackling is adding a non-management, individual contributor track for very senior personnel who do not want to be leads but want a path for advancement. For example, recognizing that someone is a domain expert and can mentor/teach the rest of the company about an area. The last thing we want to do is shove people into a management role who are not going to be effective at it.

We've found this a very useful process that helped us sharpen and define our our approach to our work and business. If we find something that works better, we'll not shy away from change. At the end of the day, whatever route you choose for your company should be rooted in what works for your business. This structure may not work for you, but so far it is working for us.

Join Us

If you've read this and thought "I'd like to know more about Disbelief", check out our open positions and drop us a line.

Interview for PluralSight

2015-11-06T06:26:00.005-06:00

I recently sat down for an interview with Kurt Williams for PluralSight.

Tips for Navigating Large Game Code Bases

2015-08-15T10:55:00.000-05:00

(Author's Note: This article is about navigating large game code bases, which can go up to and past the 2 million lines of code mark and are usually mostly C++. I'm sure these tips are applicable to other industries with comparable size projects, but you write what you know)

Freshly hired, you sit down at your desk at your hard won job at Big Game Studio. You're making games! Big ones! Excited to get started, you've gone through the orientation, pulled the source tree to your machine, and fire up the IDE. You wait for it to load... and wait... and wait... just how large is this project? Your producer has already given you a simple task for their editor -- make the FooBaz window remember its position and size between runs of the editor. You have no idea where to start. A sense of panic comes over you as you realize little in your previous experience has prepared you for dealing with a code base this big.

It is not uncommon these days to land a game programming gig and have to deal with a large legacy code base. It could be all in-house code, it could be a licensed engine, or it could incorporate a lot of open source software. Unless you are working on a small game, more than likely the first thing you are going to have to learn is how to navigate this beast.

Don't Panic

The first thing to realize is all of this code was written by people like you. They are smart, they are experienced, but hey you are pretty smart too -- if you weren't, you wouldn't have the gig. It may have taken years or even over a decade to build this code base up. At the end of the day every single person started out as a beginner in this code base, even the person who wrote the first lines of code for it. There's no magic, it's just programming like you've done in previous smaller projects but on an industrial scale.

Building

The first thing to do is figure out how to build the thing. Every large code base I've dealt with has a certain amount of lore for setting up the build environment. If you're really lucky this setup is automated, but I've never encountered this because new hires are rare enough it is usually not worth automating.

Hopefully it is documented somewhere. That document is likely out of date. A lot of places have the new programmer update the document as they go through the setup. If this document doesn't exist, volunteer to write it as you go through the process of setting things up -- this will impress your lead.

The build process is something that is often very home-brewed and custom, so it's difficult to have many general tips. But here are a few:

Make sure you have the correct versions of everything installed. Depending on the organization, you may have had everything set up for you by IT, but in some companies programmers prefer to just have IT install the basics and set up their tools and environment themselves. Either way, you may have just happened to start a week after they updated their compiler or an SDK-- everyone got the email but you weren't here to get it. Ask someone (your mentor, your next desk neighbor, etc) if anything in the build environment has changed recently that may not be documented.

If it's not building, it's almost always something you missed setting up your environment. It's unlikely the checked in build is broken -- if it were you'd probably see some Senior Engineers stomping around looking like T-Rex in hunt of fresh meat. Here are just a handful of things that may have been omitted from any setup guide - environment variables, compiler updates, SDK updates, if on windows some redistributable installer to set up some .NET component, etc.

At all points it is ok to ask someone about build setup problems -- no one expects someone off the street to know all this lore. People do appreciate some effort to figure things out on your own, and in general it is a good idea to do so, because you can ask specific questions ("it seems like I don't have OpenEXR installed, where do I get that?") rather than general ones ("it's broken?"). Specific questions get you an answer faster and take up less time for the person answering. A mistake new programmers often make is transitioning from asking about the build setup (which is expected) to asking about every little thing (which is not).

Finding Things on Your Own

You've got it building, now you need to find this FooBaz window your task is about.

You are going to be tempted to ask where to find every little thing in the code. After all, when the senior engineers aren't stomping around like T-Rex hunting down build breaks, they seem to have an encyclopedic knowledge of the code base. They could answer something in five seconds that might take you five minutes to find.

Resist this temptation. Senior Engineer has stuff to do, and helping people find which file a function is in is not one of them. What they should be doing is educating you on how to find things yourself (hence this entry or an article my co-founder Steve Ellmore wrote on learning how to learn). With enough practice, you'll be the one with the seemingly encyclopedic knowledge of the code base.

Because that's the real trick -- Senior Engineers do not have the entire code base in their head. That's impossible beyond a certain amount of code. What they do know is how to look for things.

I'm not suggesting you never ask any questions or seek help, but learning how to investigate things on your own is a valuable skill that will make you a better programmer. If nothing else, such investigation will lead you to ask better and more specific questions which give you quicker answers.

Your Weapons

The project may have some documentation describing high level structure of the code. Again, expect it to be out of date. Still out of date documentation can often be better than none when wrapping your head around the code base -- large code bases are like big ships, they steer slowly and some of the information is bound to still be relevant.

Top-down searching will only get you so far, though. For example, you may think to search for the main function (WinMain on Windows). You can do this, but realize that this (by definition) is going to lead you to *all* the code. Startup and shutdown code can often be messy, particularly on cross-platform projects. The main game loop may or may not be clean and easy to understand. In modern code bases which spread work across multiple cores, there may not be a main game loop at all.

Find in Files

Your number one weapon for finding things is going to be (in Visual Studio) Find in Files. For the Unix inclined, grep. People love to use fancy tools like Understand, and IDEs have all sorts of built in source browsing functionality, but at the end of the day Find in Files or its equivalent is going to help you the most. Figuring out what to search for is the tough part.

Often I prefer to search for things bottom up, because while the high level details of the code can vary from code base to code base, low level things like platform APIs and system calls do not. For example, at the end of the day, everyone's renderer is using some graphics API. Searching for specific D3D, or OpenGL calls will inevitably give you a starting point for understanding the renderer. You may need to dive down into some open source graphics wrapper or licensed engine code, but you can always work your way back up from that starting point to get the bigger picture.

Other examples are searching for common terms in a specific area. To find the animation system, I'd search for things like "animation", "anim", "bone", "skeleton", "skin", "skinning", etc. For physics "rigidbody", "rigid", "force", "mass", etc.

You want to avoid search terms that are too generic -- "matrix" or "transform" would likely get you hits just about everywhere.

Editor and UI code

One trick for searching any kind of editor or even in game UI code is to search on the terms that show up to the user. Because the editor has non-programmer users, documentation on it can often be in much better shape than available to programmers, particularly on licensed engines. Learning how to navigate and use the editor and game itself can help you understand the underlying code.

Going back to our FooBaz example, I'd fire up the editor and find the FooBaz window. I'd then look for some UI string - a menu name or menu item. I'd then go back to the code and find in files on that menu item name. It may be in a string table, but those are usually key-value pairs so you search again on the key. Now you've found the FooBaz window's code.

Player-visible names may not be the names in the code

One thing to be aware of is what some feature or thing is called in the game is not always what it is called in the code. Early on in feature development it is quite common for an internal-only name to be used when writing the initial code, just because the official name hasn't been thought of yet. For example, in BioShock Infinite what eventually became known as Skylines were originally called something else. Very little of the code referenced the word "skyline".

So if you can't find something after searching for a while, don't get frustrated -- this may just be a piece of lore you do not have yet, and a good question to ask about.

Code Navigation Tools

In day to day C++, a useful code navigation tool is being able to flip between definitions and declarations quickly. I primarily use Visual Studio so I'm familiar with the tools there. IntelliSense is "ok" when it works and when it doesn't spontaneously decide to hang your editor, but I find Visual Assist to be much more reliable and fast with larger code bases.

Tools that generate class inheritance diagrams and other such navigation aides may be useful to you, particularly when just starting out in a code base, but I've never really found them that helpful. Same with doxygen generated docs -- they are always out of date and you find yourself cross referencing the source anyway.

Source Control History

When trying to understand a specific part of the code, a great tool is source control history. Any source control contains a detailed history of the code base. You can find out what changes were made when and by who. You can trace back when a feature was added and see all the related files that need to be changed -- this can be very helpful when making a similar change to make sure you catch all the cases you need to modify. Git has the blame tool. Perforce in particular has a great feature called "Time-Lapse View" which allows you to interactively go back in time in any given source file.

Assume It's Been Done Before

One common mistake I see new programmers to a large code base make is to reinvent the wheel. Tasked with implementing a feature, they need some utility routine, let's say a line-plane intersection. Excited to get their feature working, the blaze ahead and write a line-plane intersection function and then implement their feature. When they go to check it in, the code reviewer says "why didn't you use the existing line-plane intersection function in MathFoo.h?" and they have to go back and update the code. In the process they've wasted some time reinventing the wheel. Even worse, sometimes the code review does not catch the duplication, and now the code base is worse for it.

With a big enough code base, you need to start from a position of assuming someone has already tackled this problem or a problem similar to it before. It's on you to search the code and try and find it before going off and writing a bunch of duplicated functionality.

Returning to our FooBaz window example, I would open other windows in the editor and see if they remember their position and size between runs. If they do, I'd use the UI string trick to find their code, and read the code to figure out how they save and restore this position. I might step through the code in the debugger if I can't figure out how it's done on inspection.

Coding Style

Everyone has a house coding style, and the rule of thumb when dealing with existing code is to stick to the style that's already there. Don't be that person injecting your personal style and creating style mismatches.

Coding style guidelines vary in formality. You may have something as comprehensive as the Google C++ Style Guide, or it may be a short page with some general guidelines. Whatever there is, read it, understand it, and embrace it. You can use your personal style at home, but on the job stick to what's there.

The Map is not the Territory

Putting aside compiler bugs or platform bugs, what the source actually does is the final word. Documents may be out of date. Comments may not match what the code actually does. Someone you ask about the code may have a misunderstanding of what it actually does.

It's an obvious concept, but the answers to "What? How? Where? When?" can always be found in the source code itself. The only thing that can't always be sussed out from the code alone is "Why?" - there you are going to have to track down Senior Engineer and ask them.

Practice Makes Perfect

One advantage new programmers have today is you can find examples of larger game code bases online. Unreal Engine 4 has its entire source code available to just about anyone, and if you learn to navigate that beast, you can navigate just about anything. Id has had a history of open sourcing many of their older games (don't forget to navigate the tools, too). There are also open source game engines you can find and dive into.

With enough practice of navigating large code bases, you can parachute into any project and be productive right away.

BioShock Infinite Lighting

2014-03-03T13:23:00.000-06:00

Programmers don't generally have reels, but we do have blogs. I've been explaining the rendering work I did on BioShock Infinite quite a bit due to recent events, and I thought it made sense to write some of it down here. For the bulk of development, I was the only on-site graphics programmer. As Principal Graphics Programmer I did quite a bit of implementation, but also coordinated and tasked any offsite rendering work.

Goals

One of our artists best described Infinite's style as "exaggerated reality." The world of Columbia was colorful, high saturation, and high contrast. We needed to handle both bright, sunny exteriors and dark, moody interiors simultaneously. We were definitely not going for photorealism.

The size of the levels were bigger than anything Irrational had attempted before. The previous game Irrational had worked on, BioShock, was more of an intimate corridor shooter. In contrast, we wanted Columbia to feel like a big city in the clouds. This meant much bigger and much more open spaces that still retained the high detail required for environmental story telling, because much of the story telling in a BioShock game was done via the world itself.

We wanted a streamlined lighting pipeline for level artists. It was obviously possible to get great results out of the stock UE3 forward lighting pipeline, but it was also very time consuming for artists. Many flags and settings had to be tweaked per-light, per-primitive or per-material. Irrational's level design was very iterative. Levels would be built and re-built to pre-alpha quality many, many times, and big changes were done as late as possible. As a consequence the amount of time we had to bring a level from pre-alpha to shipping quality was generally very short, and without a streamlined lighting pipeline would have been very difficult to accomplish.

Finally, all of this had to perform well on all of our platforms.

The end result

Hybrid Lighting System

The lighting system we came up with was a hybrid system between baked and dynamic lighting:

Direct lighting was primarily dynamic
Indirect lighting was baked in lightmaps and light volumes
Shadows were a mixture of baked shadows and dynamic shadows
The system handled both stationary and moving primitives.

Deferred Lighting
Dynamic lighting was handled primarily with a deferred lighting/light-pre pass renderer. This met our goals of high contrast/high saturation -- direct lighting baked into lightmaps tends to be flat, mostly because the specular approximations available were fairly limited. We went with the two-stage deferred lighting approach primarily because the information we needed for our BRDF and baked shadows would not fit in four render targets. We did not want to sacrifice UE3's per-pixel material parameterization, so something like a material id system to compact the G-Buffers was out of the question. This of course meant two passes on the geometry instead of one, which we dealt with by running render dispatch in parallel, instancing, and clever art.

There's been a ton written on this technique, so I'm just going to point out a few wrinkles about our approach.

We used separate specular and diffuse lighting buffers rather than do the combined trick Crytek used. Aside from getting better results, this was cheaper on all of our platforms. Storing the specular luminance basically requires a FP16 buffer since we need an HDR alpha channel. With separate buffers we used the 10/11 bit FP formats on 360 and PC. We encoded to RGBM and blended in the pixel shader on the PS3. This ends up being equivalent bandwidth to a single FP16 buffer.

Doing a limited depth-only pre pass was still a win on the consoles, but we disabled it on most PC hardware. We only rendered a subset of potential occluders in this pass. Primitives in the depth-only pass had to be static (no skinning), cover a reasonable screen area (nothing small), and require no state changes (simple opaque materials only). The player hands and gun were an exception to the "no skinning" rule, as they always covered a significant amount of screen space and needed to be masked out in stencil anyway.The extra pass was rendered in parallel and was really cheap to do, and on the consoles saved much more GPU than it cost.

We supported UE3-style light functions in the deferred pass by compiling unique light projection shaders for light function materials. This was much cheaper than the stock implementation and our artists used these to great effect.

Finally our G buffer contained the normals and glossiness as is fairly standard, but we had a second buffer which contained baked shadow information. More on this later.

Depth

Imagine a normal/gloss buffer here (image lost due to technical difficulties)

Light attenuation buffer (baked and dynamic shadows)

Diffuse lighting

Specular lighting

End scene color before post

"Physically influenced" BRDF
Our first BRDF was the legacy Phong model which has been used for ages in games. When we were putting together our first demo, we had a lot of trouble making materials that looked good in both bright and dark areas, which resulted in a ton of hacks and tweaking per-primitive and per-material.

We modified our BRDF to help solve this mid-project. It sounds crazy but the artists were willing. They didn't like having to tweak materials per-primitive in the world and knew it would be impossible to deliver on our quality goals if that state of affairs continued.

The new model used energy-conserving Phong, switched to using gloss maps, and added environmental specular with IBL. For IBL, artists would place env spec probes throughout the level with volumes which determined their area of effect, and the lighting build generated pre-filtered cubemaps. We used Sebastian Legarde's modified AMD cubemapgen to filter the cubemaps. Most primitives used a single probe for their spec, but we also supported blending between two probes for certain primitives such as the player gun to avoid popping when transitioning between cube probes.

For efficiency our geometric term was set to cancel out with the divisor.

We experimented with switching to a more physically plausible NDF such as Blinn-Phong. Too much content was built assuming a Phong lobe, so it would have made the transition to the new model too difficult.

We could not afford to do per-light Fresnel, but a material option to use N dot V fresnel for both env spec and analytic spec. This isn't right but I'm pretty sure a few other games have done it, I unfortunately can not find the links.

I would not in a million years say what we did was physically based shading (hence "influenced"), but we did use many of the ideas even if they were applied in an ad hoc fashion. We did get a lot more predictability and consistency of material response in different lighting environments, which was the goal, and did it in a way that minimized the transition pain for a project already in development. If I could have done it all over again, I would have concentrated on the BRDF much, much earlier and used more systemic reference.

Baked Shadows
UE3 had a built in baked shadow system, but it had some limitations. "Toggleable" lights can't move but they can change brightness and color. The system could bake the occlusion information to a texture for a given light by projecting the shadow into texture space using a unique UV mapping for each primitive. Each primitive-light combination required a unique texture in an atlas. The more primitive-light interactions you had, the more the memory used by these textures would grow.

We came up with a system that supported baked shadows but put a fixed upper bound on the storage required for baked shadows. The key observation was that if two lights do not overlap in 3D space, they will never overlap in texture space.

We made a graph of lights and their overlaps. Lights were the vertices in the graph and the edges were present if two lights' falloff shapes overlapped in 3D space. We could then use this graph to do a vertex coloring to assign one of four shadow channels (R,G,B,A) to each light. Overlapping lights would be placed in different channels, but lights which did not overlap could reuse the same channel.

This allowed us to pack a theoretically infinite number of lights in a single baked shadow texture as long as the graph was 4-colorable. I explained this to artists as "any light can not overlap with more than three other lights". Packing non-overlapping lights into the same channel is useful for large surfaces such as floors or hallways. The shadow data was packed into either a DXT1 or DXT5 texture depending on how many shadow channels were allocated for a primitive, and packed into an atlas. Baked shadows were stored in gamma space, as we found this to produce much better results. Storing in linear resulted in banding in the shadows.

During rendering we would un-pack the data into the proper global channels, either using texture remap hardware on the consoles or a matrix multiply on the PC. The global shadow channels were rendered during the G-Buffer pass into a light attenuation buffer. Dynamic shadows from toggleable lights would be projected into this buffer using a MIN blend (since this is just storing obscurance, you want the more obscured value). When projecting lights, each light would sample the light attenuation buffer and do a dot product with a shadow channel mask to attain its appropriate shadowing value.

Some notes on this approach:

Vertex coloring of an arbitrary graph is NP complete. We used an incremental greedy approximation with a couple of heuristics - the first is directional lights had priority for their assigned channel over any other light type, and the second is if a light already had a shadow channel assigned, we preferred to keep it rather than reassign it.
Because our shadow channel assignment was incremental, we could give artists instant feedback in the editor when they had too many overlaps.
Point/Point and Point/Spot overlap detection is trivial, but for Spot/Spot we generated convex hulls that approximated the spotlight falloff shape and did a convex/convex intersection.
Compression artifacts can occur due to packing independent channels into DXT colors, but in practice this didn't affect the final image much as it was mitigated by the inherent noise in our normal maps and diffuse maps.
The sampling rate used for projecting the shadow in the texture can cause data to overlap when two lights falloff shapes are close to each other but do not touch, in practice this does not cause an issue because the two lights are generally already attenuated by their falloff in the overlapping areas.
When projecting dynamic shadows on top of the baked shadows, it is important to clip the shadows to the falloff of the light because the shadow projection is a frustum that may go outside of the light's falloff boundary, which can cause incorrect shadowing on nearby lights sharing the same channel.

Baked Shadows on Dynamic Primitives
One problem with baked shadows was handling static primitives casting shadows on dynamic primitives. In stock UE3, the solution was "preshadowing" which did a dynamic projection of the static caster onto the dynamic primitive, but masked to the dynamic primitive via stencil. This was not sufficient for our needs as the whole point of baked shadows is to avoid the cost of projecting dynamic shadows from static geometry.

Our solution was to bake low-frequency shadowing data from static primitives into a volume texture. These volume textures were streamed into a virtual texture which surrounded the camera. Because we had a global shadow channel assigned per light, we knew that a light's baked shadow data would not conflict with any other lights' shadowing information.

Dynamic primitives just needed to do a single volume texture tap to get their shadowing information during the G-Buffer pass, and wrote it into the light attenuation buffer.

As the camera moved through the world, we streamed in chunks of shadowing information into a single volume texture representing the shadowing information near the camera. We used UVW wrap mode to avoid having to actually shift these chunks around in memory - imagine a 2D tile scroller but in 3D. Far Cry 3 independently developed a similar scheme for moving a virtual volume texture around the world for their deferred radiance transfer volumes, and have a pretty good explanation of the technique.

For objects far way from the camera, we kept around an in-memory "average" shadowing volume texture that covered the entire map. To reduce memory consumption, this data was kept ZLIB compressed in memory and sampled once per primitive on the CPU in a bulk job that ran in parallel.

Indirect Lighting
We stored indirect lighting from "toggleable" lights in lightmaps and light volumes. This could be disabled for certain toggleable lights if their color or brightness was going to be radically modified at runtime. Some fill lights were baked both their direct and indirect contribution, to give artists flexibility in areas that had too many overlapping direct lights to make them "toggleable" or where they needed extra performance.

For static primitives indirect lighting, we used UE3's stock lightmaps pretty much unmodified, except we generated them with Autodesk Beast. UE3's Lightmass GI solver did not exist when we started the project.

For dynamic primitives, we used a similar scheme to baked shadows by storing baked lighting in a volume texture that was streamed around the camera. Light volumes were encoded as spherical harmonics. We used a heavily compressed encoding scheme that used two DXT3 textures to store the constant and linear bands of HDR spherical harmonic data. The constant band was stored as RGBM in a single texture. The linear band was stored as a direction in RGB and a scale in alpha in another DXT3. We used DXT3 rather than DXT5 for predictable quantization of the scale terms, we found this led to much less error when very bright samples were next to dark ones.

The biggest problems were due to bleed of lighting through surfaces due to the low sampling frequency of the light volumes. This was mitigated by the fact we primarily used this on dynamic primitives which did not take part in the GI solution, so there were no problems from self-occlusion there. Additionally, when generating the volumes we biased samples above floors (i.e. we bled light from above a floor to beneath it rather than the other way around).

One particular challenge was doors. Our doors were generally closed except for brief moments, and for lighting build purposes we had a static placeholder in the doorway to prevent light from bouncing between rooms. In game though, the door was a dynamic primitive. This meant it often got the indirect lighting from either one room or the other, depending on where it fell in the light volume sampling grid. One solution I considered was pushing out the light volume sample along the geometry normal, similar to Crytek. In the end, it was easier for me to just generate a lightmap for the door mesh since I knew it was mostly going to remain closed. Since direct lighting was dynamic and the door's shadow was dynamic, you still got proper runtime shadowing when the door opened, but indirect lighting would be baked.

Translucent Lighting
Translucent lighting is always the bane of deferred rendering and we were no different. I considered using inferred lighting, but my prototypes showed the reduction of lighting resolution with even just one layer of translucency was unacceptable for our use cases. I did not want to maintain a separate forward path, as we didn't have the resources.

The solution we used came out of the prototype I had done for screen space spherical harmonic lighting. The basic idea was to do something similar to UE3's lighting environments, but completely on the GPU. Bungie's Destiny has developed a similar translucent lighting approach.

We had three 96x32 FP16 render targets (3072 light environment samples) which would accumulate the lighting in 2-band SH in GPU light environments. Primitives would be assigned a pixel, and write their view space position into another FP16 texture. Each frame we'd project all the visible lights into SH and accumulate them into these render targets. This projection would use the baked shadow volume for shadowing from static primitives. We didn't support dynamic shadows on translucency, although the technique doesn't preclude it. Light volumes would also be projected and accumulated into these render targets.

When a translucent primitive was rendered, it would sample the appropriate pixel for its lighting environment. Even though we were sampling the same pixels over and over, on console we found it was actually faster to have a CPU job convert the SH textures into shader constants after the GPU was done producing them.

The GPU light environments had varying quality levels. Lowest was nondirectional lighting, which only used the constant band but was very fast. Highest we would generate a runtime lightmap by rendering a primitive's positions into UV space and allocating a small area of the SH textures as a lightmap. This was mostly used on large water sheets and other large translucent objects.

Our GPU light environments were useful for skin and hair rendering. For skin, we rendered both the standard deferred lighting and a GPU light environment. In the second pass which applied the deferred lighting, we took the GPU light environment, windowed the SH, multiplied it times the transmission color (reddish for skin) and blended it with the deferred lighting. This in effect was a very cheap approximation to blurring the lighting. For hair, we split the light environment into direct and indirect components, and extracted a directional light for each. We then used a hair spec model loosely based on this AMD presentation.

Miscellaneous
A few things that don't warrant their own section

For SSAO, on consoles and low end PC we used Toy Story 3's line integral approach. On high end PC we used HDAO.
For fog, we used exponential height fog. Fog settings were put into UE3's post process volume system, so artists could preview and tweak fog per-area easily.
We also placed the main directional light's settings in the post process volume system. Artists would turn the sunlight off via the post process volumes when the player was in a fully interior area, which is a simple but very effective optimization.
We rewrote UE3's stock light shafts. Mathi Nagarajan came up with an optimization to do the light shaft radial blur in two passes - a coarse pass and a refinement pass. This allowed us to get many more effective samples, making it practical to use them all the time on console. It does suffer from some stippling when a light source is near the edge of the screen and the viewer is at certain angles. On high end PC we increased the number of samples which handles most of these cases. In hindsight we should have tried increasing the number of samples as a light source got close to the edge of the screen, even on console.
Dynamic shadows were projected at a lower resolution than the screen, using a small number of taps (4 on 360, 4 HW PCF on PS3, and 8 on PC). We then blurred (edge-aware) and up-sampled using a bilateral filter. Even though edge-aware blurs and bilateral filters are not separable, we implemented it as separable after reading this paper, and it worked pretty well.

Acknowledgements
Big games are always a collaborative undertaking, and Infinite was no exception. Toward the end of the project we probably had 5-6 programmers doing rendering work. I can't list them all here but did want to call out a few people specifically. Mathi Nagarajan was an exclusive contractor to Irrational who was on for the bulk of the project, and a key contributor. Iron Galaxy did a lot of platform optimization and bug-fixing, particularly for PS3.

On the art side there were so many awesome and talented artists who really made the game shine, but I want to call out two in particular who worked with me closely on the tech side. Spencer Luebbert was our Tech Artist who iterated with me closely on many key lighting features, and did an excellent job of documenting and educating the rest of the team how to get the best out of the engine. Stephen Alexander was our Lead Effects Artist but also often pushed the engine to its limits to do things I didn't even think possible.

Finally, I want to thank all the people who wrote an article, blog post, presentation, or paper about rendering techniques. This sharing was a big help to everything we did, and this entry is only a small down payment on the debt I owe.

Code reviews

2013-07-07T10:10:00.002-05:00

After reading Aras' Reviewing ALL the CODE entry, I was going to reply describing our process, but it was getting long so I decided to write it up here.

Our process is a little more lo-fi but effective. We use perforce's review daemon, and as part of programmer orientation we set new programmers up to subscribe to the source code folder and set up an outlook filter. That's right, every programmer on the team has a stream of emails for every changelist.

The emails are set up with the first line of the changelist description in the subject and the email of the changelist author in the reply-to field. The body of the email contains the full changelist comment and diffs of the change up to a certain size to avoid flooding our email system.

Reviews are handled by replying to the email, and cc'ing a code reviews email list which goes to everyone and is archived. This is so everyone gets the benefit of subsequent discussion.

We have a few senior engineers who at least read every changelist comment. Personally, I find it is something useful to do while waiting the couple minutes for a big compile to finish. But looking at our code reviews email list, quite a few programmers scan at least some of the changelists, usually looking for changes in code they are most familiar with.

This only works if you enforce meaningful changelist comments. "Fixed bug in renderer" would not be an acceptable changelist comment and would garner a review email to make a better one. A changelist comment should describe the problem being solved and how it was solved. It doesn't need to be a novel but it should be enough information so someone going back to this changelist 6 months from now can understand what was done and why.

I've worked on teams where this was the primary code review process, although currently we use it as a second line of defense - each changelist requires a primary reviewer.

We catch all sorts of issues with this review process - minor issues such as code that is unclear all the way to major bugs that were uncaught by both the author and the primary reviewer. This is a good method for orienting new programmers to the code base, teaching "code base lore", or pointing out bad naming. One of the things I particularly look for is badly named functions or variables - code without short, concise and meaningful names is usually an indicator of a larger problem. I could do a whole entry on names.

Beyond the day to day issues, I also find it is a useful toward improving the quality of the code base as a whole. If you see the same mistakes being made or people having trouble with a particular system, it gets you thinking about ways to prevent those mistakes, or make a system easier to use.

All in all, I find it useful for getting a feel for how the team as a whole is operating and also learning about parts of the code base you might not normally delve into. It's difficult to give people advice on how to solve problems if you don't have at least a cursory understanding of what they are working on. It is also low ceremony and a way to communicate what's going on across largish teams. If you're not doing it, give it a try.

Virtual Addressing 101

2011-02-12T13:00:00.003-06:00

If you haven't read Steven Tovey's excellent article on alternatives to new and malloc, you should. I'll wait.

All done? Good. One topic that was beyond the scope of that article is virtual addressing. Understanding virtual addressing is important to anyone implementing memory management on modern hardware. The PC and both next-gen consoles provide facilities for virtual address management, and it is important to understand the benefits and trade-offs of these facilities when doing memory management.

I am going to simplify many of the details and present a more abstracted view of some made-up hardware. A full discussion of virtual address handling specific to an architecture would be beyond the scope of this entry. The specific details of hardware and OS virtual addressing vary between different architectures, and even different processor generations within the same architecture. In practice, it is always important to read your processor and OS manuals to understand the specific implementation you are working with.

Physical Addressing
Often we like to think of memory in a machine as one big array, somewhat like this:

This is the physical memory map of the Solid Angle PlayBox, a console so spectacularly unsuccessful you probably have never heard of it (or it may just be the fact I made it up). It has 256 MB of memory, physically addressed from 0x0 to 0x10000000.

Real hardware doesn't necessary have one big contiguous lump of physical address space, or may have different physical address ranges mapping to the same memory, with different cache behavior. But again, we're trying to simplify things here.

So this seems great, but what are the problems? The problem is fragmentation. There are actually two types of fragmentation, and it is important to know the difference.

External Fragmentation
When you hear the unqualified word "fragmentation", most often what is being referred to is external fragmentation. External fragmentation occurs when memory has been partitioned into small, non-contiguous chunks, such that while the total amount of free memory is large enough for a big allocation, you can't actually fit it anywhere.

A simple example, using a first-fit heap. Say someone wrote loading code and didn't really consider memory management while doing so (tsk tsk!). This loading code starts by allocating a large temporary buffer for streaming:

Then the loading code reads into the temp buffer, and creates a bunch of permanent data structures.

The loading code then frees the temporary buffer

Repeated many times, with some varying temporary buffer sizes, we could end up with a heap like this:

Now a large allocation comes along, which we have enough memory for, but because memory is partitioned we do not have a large enough contiguous block to fit it. That is external fragmentation, it is fragmentation external to the allocations.

Internal Fragmentation

Internal fragmentation is the type of fragmentation you don't hear about much, or if you do, it is not usually described as fragmentation. Internal fragmentation occurs when the size of the memory manager's internal allocation is larger than what the application actually requested. This is fragmentation internal to the allocations.

An example can be found with fixed-size block allocators. Often you can have a system that makes many allocations, all slightly varying in size. One solution to this is to use a fixed-size block allocator that uses a block size larger than any of your potential allocations. This can lead to a situation where a small amount of memory is unused in each allocation:

Internal fragmentation can occur with other allocators, such as the buddy system.

Virtual Addressing
Most programmers at some point have heard the phrase "All problems in computer science can be solved by another level of indirection", attributed to Dan Wheeler. Many haven't heard the corollary "...except for the problem of too many layers of indirection." This is a shame because I think both together describe the condition of the modern programmer.

Virtual addressing is the direct application of this idea -- instead of accessing memory through its physical address, we add a level of indirection and access it through a virtual address. This indirection is performed in the hardware, so it is mostly transparent to the programmer, and fast, with caveats. Virtual addressing can mitigate many fragmentation issues.

First, an important public service announcement.

Virtual Addressing != Paging to hard drive
Do not confuse virtual addressing with virtual memory management systems that may page data to the hard drive (such as Windows or Linux). I think these concepts sometimes become confused because many descriptions lump the two things together into a heading of "virtual memory." They are not the same thing -- paging systems are built on top of virtual addressing, but you do not need to page memory to the hard drive to reap the benefits of virtual addressing. You don't even need a hard drive!

Virtual Address Space
Virtual addressing implementations are very specific to CPU architecture and OS, but they all share some common properties.

They all have the concept of a virtual address space. The address space may be much larger than the physical memory of the machine -- for example, in our hypothetical console, we may have only 256 MB of physical memory, but with 32 bit pointers we have a 4 GB address space. In practice, architectures and OSes may limit the address space available to applications, either reserving address space for the kernel, or using portions of the address space to for different types of memory access (such as non-cached reads/writes). On multi-process operating systems such as Windows or Linux, each process has its own address space.

Address space is allocated independently from physical memory, and you do not have to have physical memory backing an address space allocation.

The address space is divided into pages. Page sizes vary depending on architecture/OS, but common sizes are 4K, 64K, and 1 MB. Page sizes are always powers of two, as this simplifies the work of translating a virtual address into a physical one. A CPU/OS may only support a fixed page size, or may allow programmers to pick a page size when pages are allocated.

The Page Table
Virtual addresses are translated into physical addresses via a page table. A page table is a simple mapping between a virtual page and a physical page. Going back to our hypothetical console, which has a page size of 64KB, a page table might look like this (again, real world implementations vary):

Each entry in the page table maps a virtual address page to a physical address page. A virtual address allocation may span multiple contiguous address pages, but does not require contiguous physical pages.

When the CPU encounters an instruction which accesses a memory address, it must translate the virtual address into a physical address to know where the data is located in physical memory. With a 64KB page size, the upper 16 bits of a 32 bit address specify the page number, and the lower 16 bits the offset into the page. This is why page sizes are a power of 2 -- determining the page number becomes a simple bit mask and shift. The CPU looks up the virtual page entry in the page table, and finds the corresponding physical page number. This is done for every memory access.

Because this operation happens for every memory access, it needs to be fast and implemented in hardware.There's only one problem: the page table is far too big to be stored on the CPU chip.

Translation Lookaside Buffers
The solution is a special cache for address translation. Because the CPU can not fit the entire page table in on-chip memory, it uses a translation lookaside buffer (TLB), which is a special cache that holds the most recently used page table entries. TLBs can often hold enough page entries for a large amount of address space, usually larger than the amount of memory the L1 or L2 caches can hold.

Back to our memory access scenario, when the CPU must translate a virtual page into a physical page, it first looks in the TLB. If the page table entry is found, the address translation happens very quickly and the CPU continues on its work. If there is a TLB miss, this can often mean a TLB miss handler is invoked. This is actually a software handler provided by the operating system, as the entire page table is managed by the OS, not the CPU. Thus, TLB misses can be very expensive.

On most modern processors, the TLB is multi-level, similar to how L1 and L2 caches work. Thus the CPU may check a smaller, faster address translation cache before consulting the larger, slower TLB, before it resorts to the software handler of a full TLB miss.

The expense of a TLB miss is another reason data locality is very important to performance. If you are hitting data structures willy nilly in address space, aside from the cache misses you will incur, you may incur a lot of TLB misses, too. This is a double-whammy of not keeping data accesses local!

Memory Protection
Most CPUs also add the capability to specify what kind of access to a page is allowed. Page table entries can be constructed which disallow writes, or disallow code execution on some architectures. The former can be used to make sure application-level code does not overwrite kernel data structures, and the latter can be used to help protect against buffer overrun attacks by not making it possible for the CPU to jump into data-only memory. When invalid accesses occur, a HW exception is raised.

You can often specify the memory protection for a page with API calls, which can sometimes be useful for debugging tricky memory overwrite problems, by protecting pages against writes and writing a custom HW exception handler.

Memory protection is also how OSes implement demand-paging of memory from the hard drive. When the OS moves a physical page of memory to the hard drive, it modifies the virtual page table entry to prevent reads and writes. If that page is accessed, a HW exception occurs which the OS handles by loading the appropriate data from the hard drive into a physical page, and setting the page table entry to point to that physical page. Execution of the program then continues from where the exception was fired.

Virtual Addressing-Aware Memory Management
The presence of virtual addressing has a great impact on memory management. While it does not necessarily change the fundamental behavior of many allocator types, it is important to understand when physical memory is actually committed. Physical pages returned to the OS can be used to make up much larger, contiguous allocations, so at the system level, many problems with external fragmentation are severely reduced.

Direct Page Allocation for Large Blocks
For large allocations (> page size), the best memory allocation strategy is sometimes to allocate virtual address space and physical pages directly from the operating system. These types of allocations are often rare, happening when loading data. The advantage is you will not suffer external fragmentation from this allocation strategy, as the OS can always remap physical pages to a contiguous virtual address space if they are available, even if they are not contiguous in physical address space.

The trade-off for doing this is internal fragmentation. Your large allocation may not be an exact multiple of page size, leading to memory that is wasted. First, you want to pick a good threshold for when to do direct page allocation -- this is not a good strategy for things that are not much larger than the page size.Wasted memory can be also be mitigated by choosing an appropriate page size for the allocation on architectures that allow this. For example, where waste would be a significant percentage of the allocation, you may want to choose 4K pages rather than 64K pages. The trade-off here is smaller pages mean many more TLB misses, which can hurt performance.

Stack Allocators
One key thing with virtual addressing is you can allocate large regions of address space without committing physical memory to it. Stack allocators can be implemented by allocating a large region of address space, but only committing physical pages as the stack allocator pointer advances.

The advantage here is you can choose a large maximum stack size without actually committing physical memory to it. While if you do hit the peak, those physical pages must come from somewhere, it allows for situations where your peak may be at a point where those pages are free from other systems (loading comes to mind).

It should be noted that the C++/C call stack on Windows works exactly like this - when you specify a stack size for an application, you are specifying the size of the address space allocation, not the physical allocation. As the stack grows, the runtime allocates physical pages. This is done transparently with a special page called a guard page, which triggers a HW exception when it is accessed by the code, which causes an OS handler to execute which allocates physical memory for that page and set the next virtual page as the guard page.

Pooled allocators

For small allocations, fixed-size pools are often a good solution. Virtual addressing can allow us to have multiple pools of different sizes without fragmenting overall memory.

Basically, we implement our fixed-size pool as a linked list of mini-pools, each some multiple of the page size. On our hypothetical console, 64KB may be a good mini-pool size. If a mini-pool is full, we allocate another set of pages from the OS. If a mini-pool becomes empty, we return the page to the OS. Again, because physical pages do not need to be contiguous when mapped to virtual address pages, these freed pages can be used for any size of allocation, from anywhere in the system.

General Advice

When dealing with virtual allocations, the general rule of thumb is "return physical pages to the operating system whenever you can." If a physical page is allocated but not being used, the OS can not use it for some other, larger allocation that may need to occur. The days of allocating an entire console's memory space in one block and managing it yourself are largely gone, unless you wish to write your own page allocator (which can and has been done). There are some caveats to this, such as with page allocation thrashing, and allocations that are required to be physically contiguous (see below).

Virtual Addressing Problems

Physically Contiguous Requirements

Your particular platform may require certain allocations be performed in contiguous physical memory, such as GPU resources. This is often the case on consoles. Virtual addressing only mitigates external fragmentation for virtual allocations -- for these physical allocations, you still have to deal with fragmentation at the physical page level. Often the way to handle this is to set aside memory for physical resources up front in your application, and manage them separately from your virtual allocations.

Page Allocation Thrashing
Allocating virtual address space and committing physical pages are not cheap operations. Particularly with stack allocators and pools, you want to avoid thrashing -- cases where a repeated pattern of allocs/frees cause pages to be allocated and freed in rapid succession. This can be worked around by thresholding when you free a physical page to the OS - for example, with a pool, you may require that some percentage of the previous physical page be free before freeing the next, totally free one. Additional strategies are only doing page frees at specific, known points where the performance hit is predictable.

Page Size and TLB Misses
Page size can have a very important impact on performance. On platforms which allow you to choose page size when performing a virtual address space allocations, you want to pick the largest page size possible, as larger pages cause far less TLB misses.This is often a tricky balance between wasting memory due to internal fragmentation, and losing performance due to TLB misses. As always, data locality helps to reduce TLB misses.

Page Size and Physical Fragmentation
On platforms with variable page sizes, you can run into problems where you can not allocate a large page even though the memory is free. This is due to external fragmentation of the physical pages themselves - if you allocate a large amount of 4K pages, free them, and try to allocate a 1MB page, it may not have enough contiguous physical memory to successfully allocate a 1 MB page. I've even seen some platforms that will not coalesce smaller pages into larger ones even if they are contiguous (i.e. once you've allocated physical memory as a 64KB page, it will never be coalesced into a 1 MB page). This can be mitigated similar to physical allocation restrictions -- allocate your large pages up front, and do your own page allocator that your other allocators work on top of.

Address Space Fragmentation
It is possible to fragment virtual address space itself. One should be careful of reserving too much virtual address space for things like stack allocators, or leaking address space. While on console the address space is many times larger than the physical memory, and thus usually has enough slack to make up for carelessness, on PC, particularly when writing tools in 32 bit, you can run into situations where you fragment the virtual address space itself.

Summary
My hope is anyone reading this who did not have a good understanding of virtual addressing now understands a little better what is going on under the hood with memory management, at least at a basic level. As always, platform details differ, and if you are doing any kind of memory management work, you really should read the CPU and OS docs on memory management, virtual addressing, and the TLB for your specific platform.

Even programmers who are not writing custom memory managers can benefit from understanding how virtual addressing works. Almost every memory access performs address translation -- and this translation is another important reason to keep data accesses local when designing data structures.

Lazy Logging Parameter Evaluation With Variadic Macros

2011-02-06T12:23:00.000-06:00

This entry is not rocket science, and probably won't be that informative to experienced programmers, but I've seen commercial code bases get something as simple as this wrong. It requires compiler support for variadic macros, which have been in Visual C++ for a while and are also supported by later versions of GCC.

Most games have some sort of logging system. Debugging by printf is one of the first debugging tools most programmers learn. While there are many other tools in the debugging toolbox, this particular one is usually not that far out of reach. Some problems just lend themselves to being solved by logging.

We want to minimize the performance impact of logging code, without having to limit the number of logging statements we place in code. We do not want to constantly recompile different configurations of the game with or without logging enabled. While compile time stripping of logging during development will have the least performance impact, there are many times when you may be at a tester, designer or artist's desk and need to log key information. Providing them with a custom build is a productivity hit for everyone involved.

There are two main performance hits for logging:

1. The cost of the logging itself (writing to the debug window, to a file, to a console, etc)

2. The cost of parameter evaluation

Anyone who has put a log statement in a piece of code executed many times a frame knows it can absolutely kill performance, just by the fact that logging to a file or debug output can be time consuming itself. This first cost can be solved by a channel system that can selectively enable logging. Even beyond the performance cost, it is useful to enable different types of logging at different times. If you're debugging AI code, you are probably not interested in logging from the streaming system. Log statements specify which channel they are on (say, by integer ID), and the logging code checks if that channel is enabled.

Where should this check occur? I've seen some code bases that do this in the logging function itself. This is a mistake, because even if you do not actually output anything, you are still paying the second cost, the cost of parameter evaluation.

Logging, by nature, is very string-intensive. Often you will output human-readable debug names for various assets and entities when logging. Strings computed as parameters to log statements often incur performance penalties - memory allocations, string operations, etc. In addition to string overhead, some information you may wish to log may not be particularly cheap to calculate.

float ThisFunctionIsVeryExpensiveToEvaluate()

LogPrintf(LOG_INFO, "Expensive but precious debug info: %g\n", ThisFunctionIsVeryExpensiveToEvaluate());

What we want is for the expensive function to only be evaluated if the LOG_INFO channel is enabled.

The way to do this is to put the channel check in the macro itself, and only call the log function if the check succeeds. Here's some sample code that accomplishes this using variadic macros:

// Define this to 0 to disable logging
#define ENABLE_LOGGING 1

const int LOG_ERROR=0x1;
const int LOG_WARNING=0x2;
const int LOG_INFO=0x4;

#if ENABLE_LOGGING
   // Simple channel system (you want IsLoggingChannelEnabled to be cheap,
   // but there are other ways to implement something like this)
   static int GlobalEnabledLogChannels;

   // Make sure your compiler inlines this function, as it will be called many
   // times
   // You may want to force it to be inlined using whatever compiler-specific
   // syntax is available to you.
   inline bool IsLoggingChannelEnabled(int channel)
   {
   return 0 != (GlobalEnabledLogChannels & channel);
   }

   // This overload is present to handle the case where the channel argument
   // is optional
   inline bool IsLoggingChannelEnabled(const char*)
   {
   return true;
   }

   // Note: I've seen many logging systems which make the log channel optional.
   // I'm going to handle this case to show how it is done, but if you always
   // require a log channel, this code becomes simpler (for instance, you can
   // make format a required argument to the macro, and not need the ## handling)
   void MyLogPrintf(const char* format, ...);
   void MyLogPrintf(int channel, const char * format,...);

   // The ## is some syntax magic to make GCC ignore the preceding
   // comma if no arguments after channel are present
   // This can happen if no channel is specified in the log print, as it is optional
   #define LogPrintf(channel, ...) \
   if(!IsLoggingChannelEnabled(channel)) {} else MyLogPrintf(channel, ##__VA_ARGS__)

#else

   #if _MSC_VER
   // __noop is Visual C++ specific syntax for "do nothing".
   #define LogPrintf(...) __noop
   #else
   // Compiler should strip this out - but always look at the disassembly to make sure!   inline void Noop()
   {
   }

   #define LogPrintf(...) Noop()
   #endif

#endif

// example log statements
void SomeFunction()
{
    LogPrintf("Hello world!\n");
    LogPrintf(LOG_ERROR, "You should see this very important error\n");
    LogPrintf(LOG_INFO, "Expensive info: %s\n",
    ThisFunctionIsVeryExpensiveToEvaluate().c_str());
}

Hopefully blogger didn't mangle the formatting of all that.

The key concept is to call IsLoggingChannelEnabled() in the macro itself. The if syntax it uses is specially constructed -- done this way it will not change the semantics of an if statement without braces. For example:

if (rand()%2 == 0)
   LogPrintf(LOG_INFO,"Rand was even\n");
else
   LogPrintf(LOG_INFO, "Rand was odd\n");

If we did something like this:

#define LogPrintf(channel, ...) \
if(IsLoggingChannelEnabled(channel)) MyLogPrintf(channel, ##__VA_ARGS__)

that would change the meaning of the above if statement, and the else case would be on if(IsLoggingChannelEnabled(channel)), not the original rand check!

A note on why I made the channel optional: In a particular legacy code base I was dealing with, the channel argument was optional on logging statements, and I had to handle that case without changing every log statement in the application. I wanted to show how you could support something like that.

The main drawback with this approach is an increase in executable size due to all the log channel checks being inlined and the performance hit of the check itself on each log statement. It really depends on your particular game whether you are willing to pay these costs in your development/in-house build or not.

Is Data-Oriented Design a Paradigm?

2010-12-04T08:53:00.000-06:00

Recently there has been quite the flurry of tweets about OOP (objected oriented programming) and DoD (data oriented design). If you're unfamiliar with DoD, here's a nice presentation. If you're unfamiliar with OOP, I'd like to know what cave you've been living in for the last few decades.

DoD has caught on with game programmers because it puts a name to something anyone who has spent time optimizing a game already knew -- your data access patterns have a much bigger impact on your performance than the actual code you execute. I remember many an optimization session on Stranglehold where a reduction in L2 cache misses led to a perfectly correlated reduction in execution time.

DoD goes farther in that it presents a set of guidelines for writing code up front that will run under the reality of the memory-processor speed gap. This does set it apart from a simple optimization technique as it is something you can use before the fact, rather than after the fact. Follow these guidelines, your program will perform better.

Dino Dini argues that this is nothing new, that game programmers have been doing this for decades. He's right. The underlying concepts are not that new, but giving it a name and a simple package of guidelines is new. This has value, I think, because it helps educate programmers about these concepts. I am not discounting anyone's effort in this area, because I think a lot of programmers need to learn these concepts.

That said, I don't think DoD approaches what one would call a programming paradigm. The consensus definition of programming paradigm is a "fundamental style of programming." It certainly is a style of programming, but I don't think it is fundamental.

While I put on my flame retardant, let me explain what I mean. Structured/procedural and OOP are two programming paradigms that historically grew out of the need to manage software complexity. These are paradigms in which you could organize an entire code base. They contain methods for abstraction, and layered design.

DoD says nothing about code complexity. It does not describe how to organize your entire code base. No matter what happens with the processor-memory gap, code complexity is a huge problem for any large project. DoD offers no tools for managing this complexity.

I can imagine a code base completely organized around the structured paradigm (and many exist). The same with OOP. Many real world code bases mix a little bit of both paradigms -- platform APIs tend to be structured, application architecture these days tends to be OOP.

I can see how DoD fits into either of these paradigms. I don't know what a code base completely organized around DoD would look like. I don't think that's even a question that makes sense, as it is not tackling the same set of problems.

This is fine, and does not take away from DoD at all. In fact, I think it frees us to discuss the realities of writing software for today's hardware without having to waste time arguing about OOP vs DoD. They are apples and oranges.

GDC

2010-03-09T09:54:00.000-06:00

I'll be at GDC this week. My tentative session schedule is thus

Session Title	Date	Start Time	End Time	Location
Designing for Performance, Scalability & Reliability: StarCraft II's Approach	2010-03-11	09:00:00	10:00:00	Room 306, South Hall
Go With the Flow! Fluid and Particle Physics in PixelJunk Shooter	2010-03-11	15:00:00	16:00:00	Room 306, South Hall
God of War III: Shadows	2010-03-11	16:30:00	17:30:00	Room 304, South Hall
Code and Complexity: Managing EVE's Expanding Universe	2010-03-12	09:00:00	10:00:00	Room 130, North Hall
Taking Fluid Simulation Out of the Box: Particle Effects in Dark Void	2010-03-12	09:00:00	10:00:00	Room 304, South Hall
Light, Perception, and the Modern Shader	2010-03-12	12:00:00	13:00:00	Esplanade Lobby, South Hall
Creating the Active Cinematic Experience of Uncharted 2: Among Thieves	2010-03-12	13:30:00	14:30:00	Room 305, South Hall
The Next Generation of Fighting Games: Physics & Animation in UFC 2009 Undisputed	2010-03-12	15:00:00	16:00:00	Room 135, North Hall
APB: Creating a Powerful Customisation System for a Persistent Online Action Game	2010-03-12	16:30:00	17:30:00	Room 135, North Hall
Three Big Lies: Typical Design Failures in Game Programming	2010-03-13	09:00:00	10:00:00	Room 125, North Hall
Texture compression in real-time, using the GPU	2010-03-13	10:30:00	10:55:00	Room 132, North Hall
R-Trees -- Adapting out-of-core techniques to modern memory architectures	2010-03-13	11:05:00	11:30:00	Room 132, North Hall
The Rendering Tools and Techniques of Splinter Cell: Conviction	2010-03-13	13:30:00	14:30:00	Room 303, South Hall
Uncharted 2: HDR Lighting	2010-03-13	15:00:00	16:00:00	Room 305, South Hall

I believe Irrational folk will be in and out of the bar at the Marriott quite a bit in the evenings, so if you find yourself in the vicinity and see a big guy with glasses there, that's probably me, so stop by and say hi.

Musings on Data-Oriented Design

2010-02-20T12:25:00.001-06:00

Lately there has been a lot on the interwebs about "Data-Oriented Design." Mike Acton tackles the problems with textbook OOP with the provocative title Typical C++ Bullshit, Sony has an excellent presentation titled Pitfalls of Object Oriented Programming, and Games from Within discusses the subject here. For any programmer wishing to write code that performs well on today's processors, I highly recommend reading all three.

The fundamental problem is pretty simple: C++ was designed during the early 80's, when the gap between processor performance and memory performance was small. Now that gap is large. Notice that the vertical scale on that graph is logarithmic -- the gap is nearly one thousand times larger than it was in the early 80's.

It is understandable that textbook OOP, which came to be under such different hardware performance characteristics, would have performance problems with today's hardware.

I've been thinking about this problem lately and my conclusion is we need better language and compiler support for the layout and access of data in systems languages. Whether that comes as modifications to C++ or as something new, I'm not going to wade into that swamp today.

Where we are

C itself is really just portable assembly language. It defines an abstract machine model but there is a pretty close mapping between C code and the assembly it generates. C++ kept this ability (as it is mostly a superset of C), but added in abstractions to help deal with large code bases. These abstractions necessarily came at a cost -- you can write C++ code that does not map very closely to the assembly it generates.

My proposition is that the data organization capabilities of both C and C++ are the equivalent of portable assembly language for data: a close mapping between the code and the data layout it generates. While the C++ standard does not actually specify a memory layout, the truth is the de facto standard in most compilers is the layout of structures or classes, minus some inserted vtable pointers, generally correspond 1-1 to how they are laid out in memory. Most operating system APIs depend on this fact, as you pass structures to them with strict memory layouts.

To see why this is a problem, let me make an analogy with instruction scheduling. As processors became pipelined and then superscalar, the scheduling of instructions to keep all those pipelines full became a big problem. The early C and C++ compilers did a very poor job of it, and people resorted to either reorganizing their code or dropping down to assembly language to take proper advantage. Compilers have gotten a lot better at scheduling instructions over time -- to the point that things like inline assembly hurt the ability of the compiler to reorder instructions. With the advent of compiler intrinsics, which the compiler understands and can schedule along with other instructions, you're better off sticking in C or C++ rather than using inline assembly these days. While even in C (which again, is portable assembly language), you still run into code that the compiler does not generate machine instructions as you'd like, the tools to detect such problems are quite good and the mechanisms to fix them are usually localized to a particular function.

Moving over to the data side, we are constantly stuck in a space equivalent to hand-scheduling instructions. I think this is the challenge of data-oriented techniques, is that you are forced to be in a head space where you are spending a fair amount of time doing analysis of data access and rearranging code and data structures rather than solving the actual problem your code is intended to solve. I'm sure there are people for which this comes quite naturally (I suspect Mike Acton is one), but for me, at least, this takes a considerable amount of mental effort.

Where we need to be

As I've thought about this more, I've realized that both C and C++ fail in offering any sort of tools to help the programmer tackle the problems of data organization. If the compiler is free to reschedule instructions, should we not let it be free to reorganize our data structures?

Obviously, the compiler can not do this alone. One recurring theme in these presentations is that textbook OOP tends to focus on singular entities. A class has a virtual function that deals with late dispatch on one object. A class defines the layout for one object. Obviously, you don't have to design your classes this way -- and in fact, the above presentations argue you shouldn't. But if you find yourself fighting with or avoiding the language abstractions rather than using them, what have you gained? In that sense, C++'s abstractions hurt us because they lull us into writing code that will run horribly. We need better abstractions.

Both of these presentations move away from the model of classes that deal with one thing and move to code that deals with sets of things. If you are going to do a sphere in frustum test, you're going to be doing it on many things, not just one. Even when sets are not homogeneous, we deal with that by sorting them by type, and executing our operations in bulk on each type.

We need more than sets, though, because different operations need different views on the data. Transform update may only be concerned with the matrix of a game entity, whereas higher level AI code may have a completely different view. We want our data to be laid out optimally for some of our operations, which may mean different data is stored in different places, or we may even have multiple copies of some data in order to support different operations.

One of those views is the view we use for debugging. In our head space, we tend to think about single entities in the game world -- this projectile, this character, this mesh. Textbook OOP tends to couple class layout with this debugging head-space, and is part of the attraction -- I don't have to care about what is going on with everything else in the program, I have everything I need to know about this mesh right here.

The organization the computer needs is much different, though -- when doing frustum culling, for example, what we really want is just a big array of AABBs. When debugging why a specific mesh is being culled, though, it really helps to see all the data about that entity in one place. Otherwise, you spend a lot of time traversing data structures in the watch window, just to find out what state an object got in that caused it to flip its visible bit to false. So the view of the data that humans need is another important piece of the puzzle.

This is the limit of my current musings. I want to write code that deals with sets of things as a natural part of the language and not just some templated container library. I want to be able to specify multiple views on my data, and have the compiler use this information to generate optimal data layout for certain operations. In the debugger, I want a debugging view which is similar to the textbook OOP view. I want a language that is designed to provide these things, and will tackle data layout as an optimization problem similar to register allocation, instruction scheduling, or inlining.

Perhaps this is too radical a departure for a low-level language such as C or C++. I would hope there are some research languages out there that do the kind of things I am talking about -- other duties have prevented me from doing anything more than a cursory literature search. Given that the processor-memory gap is only likely to get worse, I'd certainly hope there is.

More Stencil States for Light Volume Rendering

2009-12-22T09:30:00.001-06:00

A while back I wrote a short entry on stencil states for light volumes. The method I posted works but relies on using a zfail stencil operation. Shortly after, I quickly discovered that it ran considerably slower on ATI cards than on the original Nvidia card I had been writing on, and have been meaning to post an update.

On certain hardware, using anything but Keep in zfail can disable early stencil -- specifically, ATI PC hardware, and this caused quite a slowdown.

The solution I figured out (and I'm sure others have) was to switch to a method which only relies on stencil pass operations:

AlphaBlendEnable = false
StencilEnable = true

ColorWriteChannels = None

DepthBufferEnable = true

StencilDepthBufferFail = Keep

// render frontfaces so that any pixel in back of them have stencil decremented
CullMode = CounterClockwise
StencilFunction = Always

// If a pixel is in back of the volume frontface, then it is potentially inside the volume
StencilPass = Increment;

// render volume

// render backfaces so that only pixels in back of the backface have stencil decremented

CullMode = Clockwise

// pass stencil test if reference value < buffer, so we only process pixels marked above.

// Reference value is 0. This is not strictly necessary but an optimization

StencilFunction = Less

// If a pixel is back of the volume backface, then it is outside of the volume, and should not be considered

StencilPass = Decrement



// render volume

AlphaBlendEnable = true

ColorWriteChannels = RGB

// only process pixels with 0 < buffer

StencilFunction = Less

// zero out pixels for so we don't need a separate clear for next volume

StencilPass = Zero



//render a screen space rectangle scissored to the projection of the light volume

There is a problem with this method -- if the light volume intersects the near plane, it won't work, because the portion of the light volume in front of the near plane will never increment the stencil buffer.

My solution to this was pretty simple -- if the light volume intersects the near plane, I use the zfail method from the earlier post. Otherwise, I use the stencil pass operation. For most lights, we're using the fastest path on both the major brands of cards. I briefly scanned some papers and articles on shadow volumes (a very similar problem), hoping to find an alternate way to cap volumes intersecting the near plane, but didn't see anything that looked particularly easy to implement or would necessarily perform that well, and this method got performance on ATIs and Nvidias mostly on par.

What about two-sided stencil? This is a mode in DX9 where you can render both backfaces and frontfaces in one pass, with separate stencil operations on each. Because the stencil increment/decrement operations wrap around (i.e. decrementing 0 becomes 255, incrementing 255 becomes 0), ordering doesn't really matter (although you have to make the StencilFunction Always on both). I did some quick tests using two sided stencil and my initial results showed it was actually slower than rendering both passes separately. I didn't spend much time on it so it is possible that I simply screwed something up, and plan to revisit it at some point.

Screen Space Spherical Harmonic Lighting

2009-12-12T16:11:00.001-06:00

A while ago Jon Greenberg brought up the idea of accumulating lighting in screen space using spherical harmonics, in a blog entry entitled "Has anyone tried this before?"

I've been doing deferred lighting experiments in XNA, and decided to give this technique a try. Please note I'm not doing any antialiasing and all screenshots are the definition of "programmer art" cobbled together from various freely available assets.

Screenshots show full resolution deferred lighting on top and the screen space SH technique at the bottom in the Sponza Atrium by Marko Dabrovic:

The technique was pretty simple to get up and going, and produces some interesting results. The above images are with 27 point lights, 3 spot lights, and 1 directional. The directional is the only light evaluated at full resolution per-pixel, in the apply lighting stage.

The basic idea is to use a quarter-sized lighting buffer (thus, in this case, 640x360) to accumulate 4-coefficient spherical harmonics. The nice thing is you only need the depth information to do so. I used 3 FP16 buffers to accumulate the SH constants. Points and spots are evaluated by rendering the light geometry into the scene and evaluating the SH coefficients for the light direction via cube map lookup, and then attenuating as normal. For the directional light, I evaluate that in the apply lighting shader. I'm not rendering any shadows.

For diffuse lighting, it works pretty well, although due to the low number of SH coefficients, you will get some lighting wrapping around onto backfaces, which in practice just tends to give you "softer" lighting. That may or may not be desirable.

Even though the lighting buffer is quarter-sized, you don't really lose any normal detail since SH accumulates the lighting from all directions. In my test scene, the earth models are the only ones with normal maps (deferred on the left, SH on the right)

I found that when you upsample the lighting buffer during the apply lighting stage naively, you would get halos around the edges of objects. I fixed this using a bilateral filter aware of depth discontinuities.

I was able to fake specular by extracting a dominant light direction from the SH, dotting that with the half vector, raising to the specular power, and multiplying that times the diffuse lighting result. It doesn't really give you great results, but it looks specular-ish. I tried using the lighting looked up at the reflected view vector but found that gave worse results.

Performance-wise, in my little XNA program, which I'd hardly call optimized, the SH lighting is about the same as deferred lighting when I store specular luminance instead of a specular lighting color in the lighting buffer. Here's some screen shots showing 388 lights (384 points, 3 spots, and 1 directional):

Note that there is at least one major optimization that could be performed when I'm calculating the SH coefficients for a light. Currently, my SH lookup cube map is in world space, but my light vectors are calculated in view space for points and spots. This causes me to make a matrix multiplication against the inverse view matrix in all the lighting shaders. This could probably be sped up quite a bit by calculating the SH lookup cubemap in view space each frame.

All in all, it is an interesting technique. I'm not very happy with the specular results at all, and the softness of the lighting could be a benefit or a drawback depending on the look you are going for. Jon also points out that the lighting calculations could easily be moved to the CPU on some platforms, since they only depend on depth information. I'm probably not going to explore the technique much further but thought I'd post what I'd found from the limited investigation I did.

A Production Irradiance Volume Implementation Described

2009-12-04T15:06:00.002-06:00

On a previous title I worked on, the dynamic lighting system we had could best be described as "an emergency hack." We found ourselves approaching an E3 demo without a viable dynamic lighting system -- the one in the engine we were licensing required re-rendering geometry for each light. Even using a completely separate lighting rig for dynamic objects (with a much smaller number of lights), this was not practical on the hardware and ran too slow. The engine in question would eventually have a much better dynamic lighting system, but that would not come for some time, and we needed something that worked right away.

The solution was to limit the number of lights that could affect dynamic objects and render 3 lights in a single pass. The three lights were chosen based on the strength of their contribution at the object's center point, and hysteresis was used to avoid light popping. Shadows darkened the scene instead of blocking a specific light, which is an old technique, but worked "well enough."

I was never very happy with this solution, but it was good enough for us to ship with. It was too difficult for artists to light dynamic and static objects consistently due to the separate lighting rigs, and often the dynamic lighting would not match the static lighting very well. Dynamic lights did not take occlusion into account so you could often get bleeding through walls, which would require painful light placement and light channel tweaking.

After that project shipped, I very much wanted to make a better system that would solve most of the problems. I wanted consistent results between static and dynamic lighting, I wanted a single lighting rig, and I wanted a better shadowing solution.

A colleague on another title at the same studio was getting some good results with spherical harmonic-based lighting, albeit in a completely different genre. I had also recently read Natalya Tatarchuk's Irradiance Volumes for Games presentation, and I felt that this was a viable approach that would help achieve my goals.

The way it worked is artists placed arbitrary irradiance volumes in the map. An irradiance volume stores a point cloud of spherical harmonic samples describing incoming light. In the paper, they use an octree to store these samples, but I found that was not desirable since you had to subdivide in all three axes simultaneously -- thus if you needed more sampling detail in X and Z you were forced to also subdivide in Y. Our levels weren't very vertical, so those extra samples in Y were unnecessary and just took up memory.

Instead, I used a kd-tree, which allowed me to stop subdividing an axis once it had reached an artist-specified minimum resolution.

Another problem was what heuristic to use for choosing a sample set. The original paper used a GPU-based solution that rendered depth to determine if a cell contained geometry, and if so, subdivided. The idea is that places with geometry are going to have more lighting variation. The preexisting static lighting pipeline I was working in did not lend itself to GPU-based solution, so I did a similar approach using a CPU-side geometry database to determine if cells contained geometry. In practice, it was pretty fast.

I would subdivide in a breadth-first manner until either I hit an artist-controlled minimum sampling resolution or we hit the memory budget for that irradiance volume. This allowed me to have a fixed memory budget for my irradiance data, and basically the technique would produce as much detail as would fit in that budget for the volume. I also rendered a preview of the sampling points the heuristic would produce, allowing artists to visualize this before actually building lighting.

Once I had a set of points, I sent it off to Beast to calculate both direct and indirect lighting at each sample point. Once I had the initial SH dataset, I performed some postprocessing.

The first step was to window the lighting samples to reduce ringing artifacts (see Peter Pike Sloan's Stupid Spherical Harmonic Tricks). The amount of windowing was exposed to artists as a "smoothing parameter". I had set up the toolchain so in the editor, I stored both the original Beast-produced SH samples (which took a minute or so to generate), and the postprocessed values. This allowed the artists to change various postprocessing variables without recomputing the lighting, allowing for faster iteration.

What I did is remove redundant lighting samples. Within the KD-tree, the lighting samples are arranged as a series of 3D boxes -- finding the lighting at any arbitrary point within each box is done via trilinear interpolation. Because of the hierarchical nature of the KD-tree, each level split its box into two along one of the three axes. What I would do is compare the value at a "leaf" box point with the interpolated value from the parent box -- if the difference between these two SH coefficient sets was within a certain threshold, I would remove the leaf sample. After this process is done, we are only storing lighting samples for areas that actually have varying lighting.

Samples were referenced by index into a sample array at each node of the KD-tree, which allowed me to further combine samples that were nearly identical. Finally, I encoded the sample coefficients as FP16s, to further save on memory. I was later going to revisit this encoding, as it had some decoding expense at runtime, and there probably were cheaper, better options out there.

At runtime, each dynamically lit object would keep track of what irradiance volume it was in when it moved. Transitions between volumes were handled by having the artists make the volumes overlap when placing them -- since the sample data would essentially be the same in the areas of overlap, when you transitioned there would be no pop.

A dynamically lit object would not just sample one point for lighting, but several. I would take the object's bounding box, shrink it by a fixed percentage, and sample the centers of each face. I would also sample the center point. Dynamic lights would be added into the SH coefficient set analytically. I then extracted a dominant directional light from the SH set, and constructed a linear (4 coefficient) SH gradient + center sample. Rendering a directional light + a linear SH set achieves results similar to rendering a full 9 coefficient set, and is much faster on the GPU. Bungie used this same trick on Halo 3.

The gradient allowed me to get a first order approximation of changing lighting across the dynamic object, which was a big improvement in the quality of the lighting and really helped make the dynamic lighting consistent with the static lighting. Evaluating a 4 SH gradient + directional light was about the same cost as if I'd evaluated a full 9 coefficient SH on the GPU, but produced much higher quality.

The SH set for a dynamic object was constructed on another thread, and only happened if the object moved or its set of dynamic lights changed. This allowed us to support rendering a large number of dynamic objects.

Sometimes the kd-tree subdivision heuristic would not generate high enough detail of sampling for a specific area -- for these cases I allowed the artists to place "irradiance detail volumes", which allowed the artists to override the sampling parameters for specific area of the irradiance volume - either forcing more detail, or using a smaller minimum sample resolution.

Finally, for shadows, in outdoor areas we used a cascaded shadow map solution for the sun, and for interior areas, supported spotlights that cast shadows. The artists had to be careful placing these spotlights as we could not support a large number of shadow casting lights simultaneously. At the time we were rendering these lights as a separate geometry pass, but I had plans to support one shadow casting light + the SH lighting in a single pass.

The end result was for anything car-sized or smaller, with statically placed lights using the same lighting rig as produced the lightmaps, you would have a very difficult time telling which objects were static and which were dynamic. One interesting side effect that was technically a "bug" but actually helped produce good results was the fact that samples underneath the floor would almost always be black, since no light reached them. When constructing the gradient, these samples would usually be used for the bottom face of the bounding box. In practice, though, this just made the object gradually get a little darker toward the floor -- which was not unpleasant, helped ground the object in the scene, and was kind of fake AO. In ShaderX 7, the article about Crackdown's lighting describes a similar AO hack, although theirs was intentional. But we decided to keep the happy accident.

The biggest issue with the system was it didn't deal very well with very large dynamic objects, since a single gradient is not enough if your object spans tens of irradiance volume cells. For that game this wasn't a huge problem, but it might be for other games. Additionally, it still didn't solve the problem of things like muzzle flashes requiring multiple passes of geometry for statically lit items, and at the time I was starting to look to deferred lighting approaches to use for transient, high-frequency dynamic lights in general.

The artists were very happy with the lighting, particularly on characters, and we were producing good results. But at about this time, the plug on the project was pulled and I was shifted off to other duties, and eventually the company would go bankrupt and I would move on to 2K Boston. But I felt that lighting approach was viable in a production environment, and I've since seen other games making presentations on various irradiance volume systems.

Where is the game architecture research?

2009-10-17T12:00:00.001-05:00

I was reading this paper on EA's internal STL implementation, and it got me thinking -- where is the game architecture research?

There is a large amount of academic research poured into real-time graphics, experimental gameplay and entertainment, AI, and even MMO server design. But I find there are a number of architecture issues unique to games that are lacking in any sort of research. I've done searches and not come up with a whole lot, maybe I'm just not using the correct keywords.

Memory is not a solved problem for game consoles
Most if not all garbage collection research is focused on desktop or server based memory usage patterns. They assume virtual memory paging. Many gc algorithms are impractical for a fixed memory environment where utilization needs to be close to 100%. While some game engines use garbage collection, the algorithms are primitive compared to the state of the art generational collectors found in desktop environments, and the waste of memory resources is often 10-20% of total memory. Games generally can not afford large mark or sweep phases as they must execute at a smooth frame rate. Fragmentation can still be an issue in a fixed memory environment, although in this case many allocator strategies exist to combat this.

Multicore architectures for games
While this is still an active area of research for desktop and server applications, too, I've found exactly one paper that attempts some research in this area for game architectures. This is a particularly fruitful area for research since there are many competing ideas out there (message passing! software transactional memory! functional programming!), but very few researchers testing any of them in the context of building a game. It is difficult enough to make a game by itself, let alone test multiple techniques for exploiting multiple cores. I find this somewhat interesting because aside from servers and scientific processing, games are pushing the state of the art in multicore programming more than anything else.

Automated testing
This is something the EA STL paper brings up -- traditional automated testing techniques break down pretty quickly beyond unit testing lower level libraries. So much of the end result of game code is subjective and emergent that determining how to test even basic functionality automatically is a huge unsolved problem. This results in a large amount of manpower being used for game testing, particularly in the area of regression testing.

This research is being done as a part of production by many companies inside the industry. But it is always going to be the case that in a production environment, you just aren't going to have the time and resources to, say, try three different approaches to multicore architecture and compare them. Generally you make an educated guess and hope for the best. Additionally, because much of this research is done as part of product development, rarely are the results published, which means we're all out there doing the same work over and over.

An Ode to the GPU. Hopefully not an Epitaph.

2009-10-04T10:02:00.001-05:00

The last entry got me thinking about one area of game programming that has gotten unequivocally better over the last ten or fifteen years: graphics programming. From the advent of the GPU to programmable pipelines to the debugging and profiling tools available, things are for the most part way easier today than they were even five years ago.

I am not a graphics programmer. I'm a generalist who often finds himself programming graphics. So there are certainly gaps in the last ten or fifteen years where I wasn't really writing anything significant in graphics. There's a large gap between fixed-function gpus and when HLSL was introduced -- I don't think I've ever done assembly-level pixel shaders, for example.

While I do remember doing a lot of OpenGL in the early days of fixed-function, I didn't do much multipass rendering on fixed function hardware, where companies like Id essentially faked a programmable pixel pipeline with texture and blend ops. Frankly, I thought during that era it was more about fighting the hardware than interesting techniques -- the amount of bs you had to put up with made the area unattractive to me at the time.

Languages like HLSL and Cg piqued my interest in graphics again, and when you think about it, are a pretty impressive feat. They allow a programmer to harness massively parallel hardware without having to think about the parallelism much at all, and the last few years have been more about interesting algorithms and more efficient operations than about fighting hardware capabilities. Sure, you still run up against the remaining fixed function parts of the pipeline (namely, blending and texture filtering), but those can be worked around.

The tools have improved year over year. On the PC, things like PerfHUD have slowly gotten better, with more tools like it being made all the time. The gold standard still remains PIX on the 360 -- so much so that many programmers I know will do an implementation of a new graphics technique first on the 360 just because it is so easy to debug when things go wrong.

So let me just praise the GPU engineers, tools makers, and language and API designers who have done such a good job of taking a hard problem and making it constantly easier to deal with. I think it is rare to get such productivity gains for programmers in any area, and we shouldn't take for granted when it happens.

This is also why the dawn of fully programmable graphics hardware makes me nervous. Nvidia recently announced the Fermi architecture, which will allow the use of C++ on the GPU. Nvidia, AMD/ATI, and Intel are all converging on GPU architectures that allow more and more general computing, but is C++ really the answer here?

HLSL and its ilk make concurrent programming easy. The same can not be said for C++. While an architecture where the underlying threading architecture of a GPU is more open certainly will allow for a wider array of approaches, what is the cost? Are we blinded so much by the possibilities that we forget that the DirectX/OpenGL model is one of the few successes of hiding concurrency for programmers?

I have not really done much with CUDA or compute shaders, so perhaps I am being hasty in judgement. But when I see Intel or Nvidia touting that you can use C++ on their GPUs, I get a little worried. I am not sure that this will make things better, and in fact, may make things very much worse.

Am I just paranoid?

I'm Afraid the Grass is not Greener

2009-10-03T14:22:00.003-05:00

I started reading Coders At Work, and wow, it's rare that you run across a book about programming that's a page-turner, but this is it. I'm not very far into it, but a quote from Brad Fitzpatrick (LiveJournal, memcached, PerlBal) caught my attention. The context is he is bemoaning how it seems like computers are worse than they were ten years ago, that they feel slower even though under the hood they are faster, etc. Then this question and answer comes up:

Seibel: So maybe things are not as fast as they ought to be given the speed of computers. But ten years ago there was no way to do what people,as users, can do today with Google.

Fitzpatrick: Yeah. So some people are writing efficient code and making use of it. I don't play any games, but occasionally I'll see someone playing something and I'm like, "Holy shit, that's possible?" It just blows me away. Obviously, some people are doing it right.

We are? The funny thing is, I'm not sure a lot of game programmers would feel that we are doing things right. We work with imperfect middleware and engines, with hacks upon hacks piled upon them, all until the game we are working on is no longer broken and actually fun to play. We have code in our games we would describe as "shit" or "crap" or "I can't believe we shipped with that." When I was just starting out, I thought maybe it was just the games I was working on that had this problem, but any time you talk to anyone at other companies, it is the same story -- from the most successful games to the smallest ones, we can all list a huge litany of problems in the code bases in which we work or have written.

It's interesting reading this book because at least the first two programmers I've read are in totally different worlds than game development. Not better or worse, just different. The problems and constraints they have are somewhat alien to me.

I've often heard game developers say things like "game development is five to ten years behind the state of the art in 'straight' programming", referring to process or my least favorite term, "software engineering." I may have even said it myself.

The game industry often does a lot of navel gazing (like this entry!). We are constantly comparing ourselves to movies, theme parks, or how the rest of programmers work. Maybe we've got it all wrong. Maybe all along we've been figuring out how programming for games needs to work. If the world that Brad Fitzpatrick lives in feels alien to me and vice versa, then why would we ever think that the processes or techniques that work in one are automatically going to work for the other?

Food for thought.

Safety

2009-09-23T22:38:00.006-05:00

I recently came across two articles that tangentially talk about the same thing -- technologies that are safe. Safe as in usable and not likely to get yourself in trouble.

The first was 30 years of C over at DadHacker. The second is a Joel on Software article (nice to see him actually writing about technology instead of pimping FogBugz or whatever he's selling these days) called The Duct Tape Programmer.

Anyway, I thought I'd write some of my opinions of the language features mentioned in these two articles. For those of you who've known me a while, it just may surprise you where my thoughts have evolved over the years.

Let's cover the C++ features:

Exceptions - While I have no problems with exceptions in a language like Java or C#, in C++ they just don't work well. In games we turn them off for code size and performance reasons, but I would tend to avoid them in C++ even if there was zero hit in either area. It is just too difficult to write exception-safe code in C++. You have to do extra work to do it, and the things that can break are sometimes very subtle. Most importantly, the entire culture and ecosystem built around the language is not exception-friendly. Rare is the library that is exception-safe in my experience. So just say no.

RTTI - Not very useful in practice. Again, there are overhead concerns in games, although most games I've seen end up rolling their own. But the base implementation is rather inflexible -- it is reflection of only the most basic sort, and often in the places you do need run-time type information, you need a lot more than just class ids. It's a feature with its heart in the right place but it just doesn't come together very well. I think part of the problem is its all-or-nothing nature -- usually only portions of my architecture need any sort of reflection, and I don't want to pay for it on all the other classes.

Operator Overloading - Rarely useful outside of math libraries. I'm not even a huge fan of the iostreams model, to tell the truth.

Multiple inheritence - Only with pure virtual interfaces, and even then should be used rarely and avoided if possible. Sharing implementation via inheritance goes awry enough in single inheritance, adding more base class chains just makes the problem worse.

Templates - The big one. I'll admit to having a love affair with templates in my twenties. What can I say? They were fun and a shiny new toy. I sure had some excesses, but even my worst one (a cross-platform file system library) shipped in multiple products. Even then I hid them all behind a straight-C API, so only programmers who had to either debug or extend the library innards had to deal with the templates. If I had to do it again, I'd probably do it differently, but I could say that about any code I've written in my career, whatever the language. I do know that it was an improvement over the previous file system library that was in use, because the new one actually worked.

I can say with a degree of certainty that template metaprogramming is a bust for practical use. There are a few major problems with it: the language isn't really built for it (it's more a clever side effect than anything), there is no good way to debug it, and functional programming isn't very ingrained in the game development culture. Ironically, I think the last part is going to have to change as parallel programming creeps into larger and larger sections of the architecture, but that won't make template metaprogramming practical.

In any case, these days templates are just a tool in the toolbox, and not one I reach for that often. The code bases I've been working in recently all roll their own template container libraries* (provided for us from external vendors), and they do the job. My experiences with code sharing via templates is that more than often it isn't worth the trouble, but sometimes it is. Like anything we do, it is a tradeoff, and one I don't necessarily feel particularly passionate about either way.

*A somewhat amusing side note: I've done performance and code generation tests with one of the hand-rolled template container libraries I've encountered versus STL. STL came out on top for a lot of simple things like loop iteration overhead or sorting, on all the platforms I was interested in. Of course, I'm not about to rewrite hundreds of thousands of lines of code to use STL, and STL still is horrible for memory management. But I suppose that underscores the point "30 years of C" made -- even something as simple as a container library is hard to get right, even for experts. Which library I'm talking about shall remain anonymous for its own protection.

The Other Cost of Code Bloat

2009-09-23T08:14:00.004-05:00

The other day I almost wrote a redundant version of the exact same class that someone else on my project had written. In fact, if I hadn't have asked this person a couple general C# questions, and he hadn't put two and two together, I probably would have wrote that redundant class. Good detective work on his part, and shame on me for not doing a search of the code base to see if someone else had already tackled this problem. While I've got a pretty good feel of the C++ which makes up the majority of code in our engine/tools, I haven't looked at the C# side as much as I probably should have.

As the code bases we write get larger and larger, and the team sizes we deal with get larger and larger, the question of how to avoid this scenario becomes an important one. Ideally you hire programmers who perform the necessary code archeology to get a feel for where things are in the code base, or who will ask questions of people more familiar with the code when unsure. Getting a code base of a million or more lines "in your head" takes time, though. I've been working with our licensed engine for about four years now, and there are still nooks and crannies that are unfamiliar to me.

Better documentation should help, but in practice it is rarely read if it even exists. This is because usually such documentation is either nonexistant or if it does exist, horribly out of date. With a licensed engine, you are at the mercy of the little documentation you are provided, and at the end of the day, the code itself is the best documentation.

A sensible architecture with clear delineation of what should go where is often a bigger help. Knowing [where to look] is half the battle, said a saturday morning cartoon show. Again, with a licensed engine, you again are at the mercy of what you are provided. Finding existing functionality usually comes down to experience with the code base and code archeology skills.

Recently, Adrian Stone has been writing an excellent series on minimizing code bloat. Now while the techniques he describes aren't really about eliminating actual code and instead eliminating redundant generated and compiled code, the mindset is the same when you are removing actual lines of code. Aside from the important compile time, link time, and executable size benefits, there is another benefit to removing as much code as you possibly can -- the code will occupy less "head space."

Unused or dead code makes it that much harder to do code archeology. Dead code certainly can make it more difficult to make lower level changes to the engine or architecture, as it is one more support burden and implementation difficulty. In the past, removing large legacy systems (whether written internally or externally) has had unexpected benefits in simplifying the overall architecture -- often there are lower level features that only exist to support that one dormant system.

One of my favorite things to do is delete a lot of code without the end result of the tools or game losing any functionality. It's not only cleaning out the actual lines of code, but the corresponding head space that is wonderful feeling -- "I will never have to think about X again." With the scale of the code bases we deal with today, we don't have the brain power to spare over things we don't need.

Rogue Programming

2009-09-21T08:43:00.003-05:00

Gamasutra had an interesting article today titled Gaming the System: How to Really Get Ahead in the Game Industry. I found it probably had more to say about the political dysfunction that can often accompany game development rather than a how-to on being successful. To put it another way: if you find yourself having to follow the sneakier guidelines in this article too much, then you might want to consider a change in where you work.

The programming section is titled "Just Do It" and does have some truth to it. One of my leads and I came up with the term "rogue programming" for what he describes, which was half-joke, half-serious. Here's a quote:

As a programmer, it's not uncommon to see problems that you think should be fixed, or to see an opportunity to improve some piece of code, or speed up a process that takes a lot of time. It's also not uncommon for your suggestion to be ignored, or dismissed with an "it's not broke, so let's not fix it" response...

What should you do? You should just do it -- on your own time.

This is advice which is fraught with a lot of risk, because here's a hard-earned lesson for you: you don't always know best. I know, I know, you're superstar hotshot programmer, and you see something that is broken, so it must be fixed. Sure it is not in the schedule, but it'll just take a few hours, what's the harm? The code base will be so much better when you're done, or the artists and designers will have a feature they didn't have before. How can that not make the project better?

Let me give a cold-water splash of reality: when it is all said and done at the end of the project, you're going to ship with a lot of broken code. I'm not talking about obvious bugs in the shipped project, I just mean nasty, hack-filled, just-get-it-out-the-door brokenness in the code base, and some of that code will be code that you wrote. If this wasn't true, then a long-lived project like the Linux kernel wouldn't still have thousands of developers contributing to it -- obviously, there is still stuff that is "broken" and can be improved!

So in the big picture, a single section of brokenness is not going to make or break your project, and usually, there are bigger fish to fry on any given day, and its best to fry them. Because if your project is cancelled because a major feature was late, will it matter that you cleaned up the way you calculated checksums for identifiers?

That said, if after all of this, you still think something is worth doing, let me tell you how to successfully rogue program:

First, and most importantly, let's define "on your own time." On your own time means you are hitting your scheduled work on schedule, and that will not change if you fix or implement this one thing. If you're behind on your scheduled work, then you really shouldn't be doing any rogue programming. Whether not impacting your schedule means you work a saturday, do some exploration at home, or just have some slack in your schedule you'd like to exploit, if you don't complete the tasks you are supposed to be working on, you've done more damage than whatever improvement you're working on the side could benefit.

Additionally, you need co-conspirators. These days, programming is very collaborative process, and for the most part, the cowboy mentality is a dying thing. If you talk to your lead or other engineers about a problem ("Hey, X is pretty f'd up") and no one else agrees, or you can't make the case, then hey, maybe X really isn't that important! You really want to be working with a group of people that you can convince with sound arguments that something is a problem, and a lot of the time a little discussion about a problem can turn into a scheduled task -- and no rogue programming.

Often you'll be faced with something that everybody agrees *should* be done, but there's no time to do it. In these cases, I've found with good leads (which I've been blessed with the last few years), you can get tacit approval to do something "on your own time." This often takes trust -- I wouldn't recommend going off and "just doing it" until you've earned that trust.

If you've gotten to this point, you're in pretty good shape to go off and do some "rogue programming" -- because at this point (and this is where the joke comes in), it really isn't rogue at all.

Now if you're at a company where you constantly feel like you need to "go around people" to "just do things," then maybe you really do need a change of venue, because that is not a healthy team. I happen to know someone who is hiring.

Big O doesn't always matter

2009-08-27T08:36:00.004-05:00

The other day I was optimizing a bit of debug code which verifies the integrity of objects in memory. The details aren't super-important, but the gist is it is a function which runs periodically and makes sure that objects that should have been garbage collected were indeed purged from memory.

I don't make a general habit of optimizing debug code, but this was a special case -- before, this process only ran in the debug build. Artists and designers run a "development" build, which is an optimized build that still includes assertions and many other development checks.

We recently ran into a bug that would have been detected much earlier if this process had been running in the development build. While programmers run the debug build almost exclusively, we tend to stick to simpler test levels. Trying to debug an issue on a quick loading, small level is much easier than on a full-blown one.

The algorithm is pretty simple -- objects have somewhat of a tree structure, but for various reasons they only have parent links and not child links. For objects at the top-level of the tree, we know for sure whether they should be in memory or not. Objects at the bottom of the tree keep all their parents in memory if they are connected to the reference graph. So the debug check looks at each object and verifies that it is not parented (via an arbitrarily long parent chain) to an object which should have been purged.

First thing I did was measure how long the process was taking, and did some lower level profiling to get an idea of where time was spent. Most importantly, I also saw where I was running into cache misses.

The first pass of optimization -- the original loop was doing a lot of work per-object that was simply unnecessary. This was because it was using a generalized iterator that had more functionality than needed for this specific case -- for most operations, particularly at editor time, this extra overhead is not a big deal. Removing this extra work sped up the process and it was now took about 90% of the time of the original.

I then tried some high-level optimizations. There were two things I tried - one, the inner loop linearly checked each high-level object against an unsorted array of objects we know should be purged. I replaced this with a hash table from our container library. Finally, I realized that a memoizing approach should help here -- since I'm dealing with a tree, I could use a bit array to remember if I've already processed a parent object and deemed it OK. This would allow me to cut off traversal of the parent chain, which should eliminate a lot of work. Or so I thought.

The new algorithm was faster, but not by much - only 85% of the original running time. The additional complexity was not worth 5% of running time, so I went back to the simpler approach. This isn't unusual in optimization -- you often can try something you think will be a big help but turns out not to matter much. I've made mistakes in the past where I stuck with the more complicated implementation for a marginal gain -- but it was not worth it, and it made other optimizations that may have bigger impact harder to do.

As far as why the gain wasn't that much: The unsorted array was relatively small (a handful of elements), so a linear search was faster because it was simpler and had better cache behavior than the hash table implementation I was using. The tree structure of the objects was broad but not deep, so its obvious in hindsight why memoization would not be a win.

Now, one thing that is nice to have is a decent container and algorithm library. I have that at my disposal, so implementing these two changes was a matter of minutes instead of hours. With that kind of arsenal, it is easy to try out algorithmic changes, even if they end up not working out.

At this point, I took another look at the cache behavior from my profiling tools. I tried something incredibly simple -- prefetching the next object into the cache while I was processing the current. This resulted in the process now running at 50% of the time of the original -- a 2X speedup, and likely fast enough for me to enable this in the development build. I'm going to measure again, and see if there are any other easy wins like this to be had.

The processors we use are fast -- incredibly fast, and even with branch penalties on the in-order processors of today's consoles, they can still do a lot of work in the time it takes to retrieve data from main memory. So while on paper, I'm using "slow" algorithms with worse O(n) times, in practice, your memory access patterns can easily drown out any extra calculation. The key, as always, is to measure and test your theories, and not just assume that any given approach will make something faster.

Stencil states for rendering light volumes

2009-08-24T00:41:00.005-05:00

In the ShaderX 7 article "Designing a Renderer for Multiple Lights: The Light Pre-Pass Renderer", the author describes a number of approaches for rendering the lights into the lighting buffer. These are all pretty standard approaches for any deferred technique, but I thought the description of using stencil does not explain how to set up the stencil states very clearly. This was probably due to space constraints.

The way it is worded implies that you still need to change the depth comparison function. This is not the case, and is most of the point of the technique. As the article points out, changing the depth test makes many GPUs take their early-Z rejection and go home.

I'm sure you can find this detail elsewhere on the net, but my cursory searches did not find anything, and hopefully this will save at least one person some time. Standard caveats apply: I haven't extensively tested this stuff.

Assuming convex light volumes, this is what I found worked well:


// render backfaces so that only pixels in front of the backface have stencil incremented
AlphaBlendEnable = false
StencilEnable = true
ColorWriteChannels = None
CullMode = Clockwise
DepthBufferEnable = true
StencilFunction = Always
StencilPass = Keep
// If a pixel is front of the volume backface, then we want it lit
StencilDepthBufferFail = Increment

// render volume

// render frontfaces so that any pixel in back of them have stencil decremented
CullMode = CounterClockwise
// pass stencil test if reference value < buffer, so we only process pixels marked above. 
// Reference value is 0. This is not strictly necessary but an optimization
StencilFunction = Less
// If a pixel is in front of the volume frontface, then it is not inside the volume
StencilDepthBufferFail = Decrement;

// render volume

AlphaBlendEnable = true
ColorWriteChannels = RGB
// only process pixels with 0 < buffer
StencilFunction = Less
// zero out pixels for so we don't need a separate clear for next volume
StencilPass = Zero
// don't want to do anything if we fail the depth test
StencilDepthBufferFail = Keep

//render a screen space rectangle scissored to the projection of the light volume

Note that unlike shadow volumes, the light volume intersecting the near plane is not a concern here. We are rendering the frontfaces to find pixels that are in front of the light volume -- if parts of the light volume are in front of the near plane, by definition any pixels we're rendering are in back of those parts. So there is no need to render a cap in this case.

The light volume intersecting the far plane is a concern. One way to handle this case is to use a projection matrix with an infinite far plane, like shadow volumes do. Another way to handle it would be to detect this case and not use the stencil approach at all, instead rendering a screen space rectangle scissored to the light bounds.

Finally, I've had better luck switching to rendering the backfaces without depth testing when the camera is inside the light volume, instead of using a screen space rectangle. But I think this has more to do with a bug in my scissoring code than with any fundamental problem!

XNA: at times awesome, at times frustrating

2009-08-23T23:28:00.006-05:00

I'm not sure why, but Microsoft seems intent on crippling XNA for the 360. Perhaps they want to sell more dev kits.

I recently had some more time to work on my little toy project. After some work, I've now got a deferred lighting implementation on the PC.

For the lighting buffer construction, at first I was using a tiled approach similar to Uncharted, which did not require blending during the lighting stage. It did work for the most part, and allowed me to use LogLUV for encoding the lighting information, which was faster. But it had issues - I didn't have any lighting target ping-ponging set up, so I was stuck with a fixed limit of seven lights per tile. Also, even with smallish tiles, you end up doing a lot of work on pixels not actually affected by the lights in question. So I wanted to compare it to a straightforward blending approach, and switched back to an FP16 target, and render the light volumes directly (using the stencil approach detailed in ShaderX7's Light Pre-Pass article).

So this all worked great and my little toy is rendering 100 lights. Of course, on the 360, there's a problem. Microsoft, in its infinite wisdom, decided that the FP10 buffer format on 360 would blow people's minds and it is not supported in XNA. They are using an actual FP16 target, which does not support blending.

So I guess it is going to be back to alternate lighting buffer encoding schemes, bucketing, render target ping-ponging for me. It's not a huge deal, but it is frustrating.

It is a real shame that XNA gives the impression that the 360 GPU is crippled, when in reality it is anything but. Couple lack of FP10 support with inability to sample the z-buffer directly, and the lack of control of XNA's use of EDRAM, and they've managed to turn the 360 into a very weak, very old PC.

Least common denominator approaches generally haven't fared that well over the years. An XBLA title implemented in XNA is going to be at a fundamental disadvantage -- I don't think you are going to see anything approaching the richness of Shadow Complex, for example.

At the end of the day, Microsoft needs to figure out where they are going with XNA. If they are going to dumb it down and keep it as a toy for people who can't afford a real development kit (people who've been bumping into these low ceilings much longer than me), then they should keep on their current path.

The potential for XNA is really much more, though. Today I wrote a pretty decent menu system in about 45 minutes, that handles gamepad, keyboard, and mouse input seamlessly. I don't think I could write that in C++/DirectX anywhere near as fast. If you start looking down the road to future generations of hardware, I'm not worried about the overhead of C# being fundamentally limiting. Games today already use much less efficient scripting languages than C#, and while you are limited to the heavy lifting Microsoft has chosen to implement for you today, who is to say that a future version of XNA couldn't allow shelling out to C++ for really performance intensive stuff?

XNA has a chance to become something really great that would be very powerful for a large class of games. It remains to be seen if Microsoft will let it.

One has to have had inflated expectations to experience disillusionment

2009-08-19T09:01:00.003-05:00

A colleague sent along this item, which asks if Transactional Memory is beyond the "trough of disillusionment".

I've never had any expectations that STM would be some silver-bullet solution to concurrency, and from the get-go just viewed it as just another tool in the toolbox. Granted, it is a technique that I haven't had much practical experience with yet -- it's on my TODO list. Others might disagree with me, but I'm not even sure how much of a major factor it is going to be in writing games. Of course, if some major piece of middleware is built around it, I suppose a lot of people will end up using STM, but that doesn't necessarily make it a good idea.

The latest piece of evidence against STM as a silver bullet comes from conversations I've had with colleagues and friends who have a lot of experience building highly-scalable web or network servers. STM advocates hail transactions as a technique with decades of research, implementation, and use. About this they are correct. The programming model is stable, and the problems are well known. But what has struck me is how often my colleagues with much more experience in highly-scalable network servers try to avoid traditional transactional databases. If data can be stored outside of a database reliably, they do so. There are large swaths of open source software devoted to avoiding transactions with the database. The main thrust is to keep each layer independent and simple, and talk to a database as little as possible. The reasons? Scalability and cost. Transactional databases are costly to operate and very costly to scale to high load.

I found the link above a little too dismissive of the costs of STM, particularly with memory bandwidth. I've already discussed the memory wall before, but I see this as a serious problem down the road. We're already in a situation where memory access is a much more serious cost to performance than the actual computation we're doing, and that's with a small number of cores. I don't see this situation improving when we have 16 or more general-purpose cores.

A digression about GPUs. GPUs are often brought up as a counter-argument to the memory wall as they already have a very large number of cores. GPUs also have a very specialized memory access pattern that allow for this kind of scalability - for any given operation (i.e. draw call), they generally have a huge amount of read-only data and a relatively small amount of data they write to compared to the read set. Those two data areas are not the same within a draw call. With no contention between reads and writes, they avoid the memory issues that a more general purpose processor would have.

STM does not follow this memory access model, and I do not dismiss the concerns of having to do multiple reads and writes for a transaction. Again, we are today in a situation where just a single read or write is already hideously slow. If your memory access patterns are already bad, spreading it out over more cores and doubling or tripling the memory bandwidth isn't really going to help. Unlike people building scalable servers, we can't just spend some money on hardware -- we've got a fixed platform and have to use it the best we can.

I don't think that STM should be ignored -- some problems are simpler to express with transactions than with alternatives (functional programming, stream processing, message passing, traditional locks). But I wouldn't design a game architecture around the idea that all game code will use STM for all of its concurrency problems. To be fair, Sweeney isn't proposing that either, as he proposes a layered design that uses multiple techniques for different types of calculations.

What I worry about though is games are often written in a top-down fashion, with the needs at the gameplay level dictating the system support required. If at that high level the only tool being offered is STM with the expectation that it is always appropriate, I think it will be easy to find yourself in a situation where refactoring that code to use other methods for performance or fragility reasons may be very difficult and very expensive than if the problem had been tackled with a more general toolbox in the first place.

Concurrency is hard, and day to day I'm still dealing with the problems of the now, rather than four or five years down the road. So I will admit I have no fully thought out alternative to offer.

The one thing I think we underestimate is the ability of programmers to grow and tackle new challenges. The problems we deal with today are much harder and much more complex than those of just a decade ago. Yes, the tools are better for dealing with those problems, and the current set of tools for dealing with concurrency are weak.

That means we need to write better tools -- and more importantly, a better toolbox. Writing a lock-free sw/sr queue is much harder than using one. What I want is a bigger toolbox that includes a wide array of solutions for tackling concurrency (including STM), not a fruitless search for a silver bullet that I don't think exists, and not a rigid definition of what tools are appropriate for different types of game problems.