Saturday, December 4, 2010

Is Data-Oriented Design a Paradigm?

Recently there has been quite the flurry of tweets about OOP (objected oriented programming) and DoD (data oriented design). If you're unfamiliar with DoD, here's a nice presentation. If you're unfamiliar with OOP, I'd like to know what cave you've been living in for the last few decades.

DoD has caught on with game programmers because it puts a name to something anyone who has spent time optimizing a game already knew -- your data access patterns have a much bigger impact on your performance than the actual code you execute. I remember many an optimization session on Stranglehold where a reduction in L2 cache misses led to a perfectly correlated reduction in execution time.

DoD goes farther in that it presents a set of guidelines for writing code up front that will run under the reality of the memory-processor speed gap. This does set it apart from a simple optimization technique as it is something you can use before the fact, rather than after the fact. Follow these guidelines, your program will perform better. 

Dino Dini argues that this is nothing new, that game programmers have been doing this for decades. He's right. The underlying concepts are not that new, but giving it a name and a simple package of guidelines is new.  This has value, I think, because it helps educate programmers about these concepts. I am not discounting anyone's effort in this area, because I think a lot of programmers need to learn these concepts.

That said, I don't think DoD approaches what one would call a programming paradigm. The consensus definition of programming paradigm is a "fundamental style of programming." It certainly is a style of programming, but I don't think it is fundamental.

While I put on my flame retardant, let me explain what I mean. Structured/procedural and OOP are two programming paradigms that historically grew out of the need to manage software complexity. These are paradigms in which you could organize an entire code base. They contain methods for abstraction, and layered design. 

DoD says nothing about code complexity. It does not describe how to organize your entire code base. No matter what happens with the processor-memory gap, code complexity is a huge problem for any large project. DoD offers no tools for managing this complexity.

I can imagine a code base completely organized around the structured paradigm (and many exist). The same with OOP.  Many real world code bases mix a little bit of both paradigms -- platform APIs tend to be structured, application architecture these days tends to be OOP. 

I can see how DoD fits into either of these paradigms. I don't know what a code base completely organized around DoD would look like. I don't think that's even a question that makes sense, as it is not tackling the same set of problems. 

This is fine, and does not take away from DoD at all. In fact, I think it frees us to discuss the realities of writing software for today's hardware without having to waste time arguing about OOP vs DoD. They are apples and oranges.

Tuesday, March 9, 2010

GDC

I'll be at GDC this week. My tentative session schedule is thus

Session Title Date Start Time End Time Location
Designing for Performance, Scalability & Reliability: StarCraft II's Approach 2010-03-11 09:00:00 10:00:00 Room 306, South Hall
Go With the Flow! Fluid and Particle Physics in PixelJunk Shooter 2010-03-11 15:00:00 16:00:00 Room 306, South Hall
God of War III: Shadows 2010-03-11 16:30:00 17:30:00 Room 304, South Hall
Code and Complexity: Managing EVE's Expanding Universe 2010-03-12 09:00:00 10:00:00 Room 130, North Hall
Taking Fluid Simulation Out of the Box: Particle Effects in Dark Void 2010-03-12 09:00:00 10:00:00 Room 304, South Hall
Light, Perception, and the Modern Shader 2010-03-12 12:00:00 13:00:00 Esplanade Lobby, South Hall
Creating the Active Cinematic Experience of Uncharted 2: Among Thieves 2010-03-12 13:30:00 14:30:00 Room 305, South Hall
The Next Generation of Fighting Games: Physics & Animation in UFC 2009 Undisputed 2010-03-12 15:00:00 16:00:00 Room 135, North Hall
APB: Creating a Powerful Customisation System for a Persistent Online Action Game 2010-03-12 16:30:00 17:30:00 Room 135, North Hall
Three Big Lies: Typical Design Failures in Game Programming 2010-03-13 09:00:00 10:00:00 Room 125, North Hall
Texture compression in real-time, using the GPU 2010-03-13 10:30:00 10:55:00 Room 132, North Hall
R-Trees -- Adapting out-of-core techniques to modern memory architectures 2010-03-13 11:05:00 11:30:00 Room 132, North Hall
The Rendering Tools and Techniques of Splinter Cell: Conviction 2010-03-13 13:30:00 14:30:00 Room 303, South Hall
Uncharted 2: HDR Lighting 2010-03-13 15:00:00 16:00:00 Room 305, South Hall

I believe Irrational folk will be in and out of the bar at the Marriott quite a bit in the evenings, so if you find yourself in the vicinity and see a big guy with glasses there, that's probably me, so stop by and say hi. 

Saturday, February 20, 2010

Musings on Data-Oriented Design

Lately there has been a lot on the interwebs about "Data-Oriented Design." Mike Acton tackles the problems with textbook OOP with the provocative title Typical C++ Bullshit, Sony has an excellent presentation titled Pitfalls of Object Oriented Programming, and Games from Within discusses the subject here.  For any programmer wishing to write code that performs well on today's processors, I highly recommend reading all three.


The fundamental problem is pretty simple: C++ was designed during the early 80's, when the gap between processor performance and memory performance was small. Now that gap is large. Notice that the vertical scale on that graph is logarithmic -- the gap is nearly one thousand times larger than it was in the early 80's.

It is understandable that textbook OOP, which came to be under such different hardware performance characteristics, would have performance problems with today's hardware.

I've been thinking about this problem lately and my conclusion is we need better language and compiler support for the layout and access of data in systems languages. Whether that comes as modifications to C++ or as something new, I'm not going to wade into that swamp today.

Where we are


C itself is really just portable assembly language. It defines an abstract machine model but there is a pretty close mapping between C code and the assembly it generates. C++ kept this ability (as it is mostly a superset of C), but added in abstractions to help deal with large code bases. These abstractions necessarily came at a cost -- you can write C++ code that does not map very closely to the assembly it generates.

My proposition is that the data organization capabilities of both C and C++ are the equivalent of portable assembly language for data: a close mapping between the code and the data layout it generates. While the C++ standard does not actually specify a memory layout, the truth is the de facto standard in most compilers is the layout of structures or classes, minus some inserted vtable pointers, generally correspond 1-1 to how they are laid out in memory. Most operating system APIs depend on this fact, as you pass structures to them with strict memory layouts.

To see why this is a problem, let me make an analogy with instruction scheduling. As processors became pipelined and then superscalar, the scheduling of instructions to keep all those pipelines full became a big problem. The early C and C++ compilers did a very poor job of it, and people resorted to either reorganizing their code or dropping down to assembly language to take proper advantage. Compilers have gotten a lot better at scheduling instructions over time -- to the point that things like inline assembly hurt the ability of the compiler to reorder instructions. With the advent of compiler intrinsics, which the compiler understands and can schedule along with other instructions, you're better off sticking in C or C++ rather than using inline assembly these days. While even in C (which again, is portable assembly language), you still run into code that the compiler does not generate machine instructions as you'd like, the tools to detect such problems are quite good and the mechanisms to fix them are usually localized to a particular function.

Moving over to the data side, we are constantly stuck in a space equivalent to hand-scheduling instructions. I think this is the challenge of data-oriented techniques, is that you are forced to be in a head space where you are spending a fair amount of time doing analysis of data access and rearranging code and data structures rather than solving the actual problem your code is intended to solve. I'm sure there are people for which this comes quite naturally (I suspect Mike Acton is one), but for me, at least, this takes a considerable amount of mental effort.

Where we need to be


As I've thought about this more, I've realized that both C and C++ fail in offering any sort of tools to help the programmer tackle the problems of data organization. If the compiler is free to reschedule instructions, should we not let it be free to reorganize our data structures?

Obviously, the compiler can not do this alone. One recurring theme in these presentations is that textbook OOP tends to focus on singular entities. A class has a virtual function that deals with late dispatch on one object. A class defines the layout for one object. Obviously, you don't have to design your classes this way -- and in fact, the above presentations argue you shouldn't. But if you find yourself fighting with or avoiding the language abstractions rather than using them, what have you gained? In that sense, C++'s abstractions hurt us because they lull us into writing code that will run horribly. We need better abstractions.

Both of these presentations move away from the model of classes that deal with one thing and move to code that deals with sets of things. If you are going to do a sphere in frustum test, you're going to be doing it on many things, not just one. Even when sets are not homogeneous, we deal with that by sorting them by type, and executing our operations in bulk on each type.

We need more than sets, though, because different operations need different views on the data. Transform update may only be concerned with the matrix of a game entity, whereas higher level AI code may have a completely different view. We want our data to be laid out optimally for some of our operations, which may mean different data is stored in different places, or we may even have multiple copies of some data in order to support different operations.

One of those views is the view we use for debugging. In our head space, we tend to think about single entities in the game world -- this projectile, this character, this mesh. Textbook OOP tends to couple class layout with this debugging head-space, and is part of the attraction -- I don't have to care about what is going on with everything else in the program, I have everything I need to know about this mesh right here.

The organization the computer needs is much different, though -- when doing frustum culling, for example, what we really want is just a big array of AABBs. When debugging why a specific mesh is being culled, though, it really helps to see all the data about that entity in one place. Otherwise, you spend a lot of time traversing data structures in the watch window, just to find out what state an object got in that caused it to flip its visible bit to false. So the view of the data that humans need is another important piece of the puzzle.

This is the limit of my current musings. I want to write code that deals with sets of things as a natural part of the language and not just some templated container library. I want to be able to specify multiple views on my data, and have the compiler use this information to generate optimal data layout for certain operations. In the debugger, I want a debugging view which is similar to the textbook OOP view. I want a language that is designed to provide these things, and will tackle data layout as an optimization problem similar to register allocation, instruction scheduling, or inlining.

Perhaps this is too radical a departure for a low-level language such as C or C++.  I would hope there are some research languages out there that do the kind of things I am talking about -- other duties have prevented me from doing anything more than a cursory literature search. Given that the processor-memory gap is only likely to get worse, I'd certainly hope there is.

Tuesday, December 22, 2009

More Stencil States for Light Volume Rendering

A while back I wrote a short entry on stencil states for light volumes. The method I posted works but relies on using a zfail stencil operation. Shortly after, I quickly discovered that it ran considerably slower on ATI cards than on the original Nvidia card I had been writing on, and have been meaning to post an update.

On certain hardware, using anything but Keep in zfail can disable early stencil  -- specifically, ATI PC hardware, and this caused quite a slowdown.

The solution I figured out (and I'm sure others have) was to switch to a method which only relies on stencil pass operations:

AlphaBlendEnable = false
StencilEnable = true

ColorWriteChannels None

DepthBufferEnable = true

StencilDepthBufferFail Keep

// render frontfaces so that any pixel in back of them have stencil decremented
CullMode CounterClockwise
StencilFunction Always

// If a pixel is in back of the volume frontface, then it is potentially inside the volume
StencilPass Increment;

// render volume



// render backfaces so that only pixels in back of the backface have stencil decremented
CullMode Clockwise
// pass stencil test if reference value < buffer, so we only process pixels marked above.
// Reference value is 0. This is not strictly necessary but an optimization
StencilFunction Less
// If a pixel is back of the volume backface, then it is outside of the volume, and should not be considered
StencilPass 
Decrement

// render volume
AlphaBlendEnable = true
ColorWriteChannels RGB
// only process pixels with 0 < buffer
StencilFunction Less
// zero out pixels for so we don't need a separate clear for next volume
StencilPass Zero

//render a screen space rectangle scissored to the projection of the light volume


There is a problem with this method -- if the light volume intersects the near plane, it won't work, because the portion of the light volume in front of the near plane will never increment the stencil buffer.

My solution to this was pretty simple -- if the light volume intersects the near plane, I use the zfail method from the earlier post. Otherwise, I use the stencil pass operation. For most lights, we're using the fastest path on both the major brands of cards. I briefly scanned some papers and articles on shadow volumes (a very similar problem), hoping to find an alternate way to cap volumes intersecting the near plane, but didn't see anything that looked particularly easy to implement or would necessarily perform that well, and this method got performance on ATIs and Nvidias mostly on par.

What about two-sided stencil? This is a mode in DX9 where you can render both backfaces and frontfaces in one pass, with separate stencil operations on each. Because the stencil increment/decrement operations wrap around  (i.e. decrementing 0 becomes 255, incrementing 255 becomes 0), ordering doesn't really matter (although you have to make the StencilFunction Always on both). I did some quick tests using two sided stencil and my initial results showed it was actually slower than rendering both passes separately. I didn't spend much time on it so it is possible that I simply screwed something up, and plan to revisit it at some point.

Saturday, December 12, 2009

Screen Space Spherical Harmonic Lighting

A while ago Jon Greenberg brought up the idea of accumulating lighting in screen space using spherical harmonics, in a blog entry entitled "Has anyone tried this before?"

I've been doing deferred lighting experiments in XNA, and decided to give this technique a try. Please note I'm not doing any antialiasing and all screenshots are the definition of "programmer art" cobbled together from various freely available assets.

Screenshots show full resolution deferred lighting on top and the screen space SH technique at the bottom in the Sponza Atrium by Marko Dabrovic:



The technique was pretty simple to get up and going, and produces some interesting results. The above images are with 27 point lights, 3 spot lights, and 1 directional. The directional is the only light evaluated at full resolution per-pixel, in the apply lighting stage.

The basic idea is to use a quarter-sized lighting buffer (thus, in this case, 640x360) to accumulate 4-coefficient spherical harmonics. The nice thing is you only need the depth information to do so. I used 3 FP16 buffers to accumulate the SH constants. Points and spots are evaluated by rendering the light geometry into the scene and evaluating the SH coefficients for the light direction via cube map lookup, and then attenuating as normal. For the directional light, I evaluate that in the apply lighting shader. I'm not rendering any shadows.

For diffuse lighting, it works pretty well, although due to the low number of SH coefficients, you will get some lighting wrapping around onto backfaces, which in practice just tends to give you "softer" lighting. That may or may not be desirable.

Even though the lighting buffer is quarter-sized, you don't really lose any normal detail since SH accumulates the lighting from all directions. In my test scene, the earth models are the only ones with normal maps (deferred on the left, SH on the right)



I found that when you upsample the lighting buffer during the apply lighting stage naively, you would get halos around the edges of objects. I fixed this using a bilateral filter aware of depth discontinuities.

I was able to fake specular by extracting a dominant light direction from the SH, dotting that with the half vector, raising to the specular power, and multiplying that times the diffuse lighting result. It doesn't really give you great results, but it looks specular-ish. I tried using the lighting looked up at the reflected view vector but found that gave worse results.

Performance-wise, in my little XNA program, which I'd hardly call optimized, the SH lighting is about the same as deferred lighting when I store specular luminance instead of a specular lighting color in the lighting buffer. Here's some screen shots showing 388 lights (384 points, 3 spots, and 1 directional):


Note that there is at least one major optimization that could be performed when I'm calculating the SH coefficients for a light. Currently, my SH lookup cube map is in world space, but my light vectors are calculated in view space for points and spots. This causes me to make a matrix multiplication against the inverse view matrix in all the lighting shaders. This could probably be sped up quite a bit by calculating the SH lookup cubemap in view space each frame.

All in all, it is an interesting technique. I'm not very happy with the specular results at all, and the softness of the lighting could be a benefit or a drawback depending on the look you are going for. Jon also points out that the lighting calculations could easily be moved to the CPU on some platforms, since they only depend on depth information. I'm probably not going to explore the technique much further but thought I'd post what I'd found from the limited investigation I did.

Friday, December 4, 2009

A Production Irradiance Volume Implementation Described

On a previous title I worked on, the dynamic lighting system we had could best be described as "an emergency hack." We found ourselves approaching an E3 demo without a viable dynamic lighting system -- the one in the engine we were licensing required re-rendering geometry for each light. Even using a completely separate lighting rig for dynamic objects (with a much smaller number of lights), this was not practical on the hardware and ran too slow. The engine in question would eventually have a much better dynamic lighting system, but that would not come for some time, and we needed something that worked right away.

The solution was to limit the number of lights that could affect dynamic objects and render 3 lights in a single pass. The three lights were chosen based on the strength of their contribution at the object's center point, and hysteresis was used to avoid light popping. Shadows darkened the scene instead of blocking a specific light, which is an old technique, but worked "well enough."

I was never very happy with this solution, but it was good enough for us to ship with. It was too difficult for artists to light dynamic and static objects consistently due to the separate lighting rigs, and often the dynamic lighting would not match the static lighting very well. Dynamic lights did not take occlusion into account so you could often get bleeding through walls, which would require painful light placement and light channel tweaking.

After that project shipped, I very much wanted to make a better system that would solve most of the problems. I wanted consistent results between static and dynamic lighting, I wanted a single lighting rig, and I wanted a better shadowing solution.

A colleague on another title at the same studio was getting some good results with spherical harmonic-based lighting, albeit in a completely different genre. I had also recently read Natalya Tatarchuk's Irradiance Volumes for Games presentation, and I felt that this was a viable approach that would help achieve my goals.

The way it worked is artists placed arbitrary irradiance volumes in the map. An irradiance volume stores a point cloud of spherical harmonic samples describing incoming light. In the paper, they use an octree to store these samples, but I found that was not desirable since you had to subdivide in all three axes simultaneously -- thus if you needed more sampling detail in X and Z you were forced to also subdivide in Y. Our levels weren't very vertical, so those extra samples in Y were unnecessary and just took up memory.

Instead, I used a kd-tree, which allowed me to stop subdividing an axis once it had reached an artist-specified minimum resolution.

Another problem was what heuristic to use for choosing a sample set. The original paper used a GPU-based solution that rendered depth to determine if a cell contained geometry, and if so, subdivided. The idea is that places with geometry are going to have more lighting variation. The preexisting static lighting pipeline I was working in did not lend itself to GPU-based solution, so I did a similar approach using a CPU-side geometry database to determine if cells contained geometry. In practice, it was pretty fast.

I would subdivide in a breadth-first manner until either I hit an artist-controlled minimum sampling resolution or  we hit the memory budget for that irradiance volume. This allowed me to have a fixed memory budget for my irradiance data, and basically the technique would produce as much detail as would fit in that budget for the volume. I also rendered a preview of the sampling points the heuristic would produce, allowing artists to visualize this before actually building lighting.

Once I had a set of points, I sent it off to Beast to calculate both direct and indirect lighting at each sample point. Once I had the initial SH dataset, I performed some postprocessing.

The first step was to window the lighting samples to reduce ringing artifacts (see Peter Pike Sloan's Stupid Spherical Harmonic Tricks). The amount of windowing was exposed to artists as a "smoothing parameter". I had set up the toolchain so in the editor, I stored both the original Beast-produced SH samples (which took a minute or so to generate), and the postprocessed values. This allowed the artists to change various postprocessing variables without recomputing the lighting, allowing for faster iteration.

What I did is remove redundant lighting samples. Within the KD-tree, the lighting samples are arranged as a series of 3D boxes -- finding the lighting at any arbitrary point within each box is done via trilinear interpolation. Because of the hierarchical nature of the KD-tree, each level split its box into two along one of the three axes. What I would do is compare the value at a "leaf" box point with the interpolated value from the parent box -- if the difference between these two SH coefficient sets was within a certain threshold, I would remove the leaf sample. After this process is done, we are only storing lighting samples for areas that actually have varying lighting.

Samples were referenced by index into a sample array at each node of the KD-tree, which allowed me to further combine samples that were nearly identical. Finally, I encoded the sample coefficients as FP16s, to further save on memory. I was later going to revisit this encoding, as it had some decoding expense at runtime, and there probably were cheaper, better options out there.

At runtime, each dynamically lit object would keep track of what irradiance volume it was in when it moved. Transitions between volumes were handled by having the artists make the volumes overlap when placing them -- since the sample data would essentially be the same in the areas of overlap, when you transitioned there would be no pop.

A dynamically lit object would not just sample one point for lighting, but several. I would take the object's bounding box, shrink it by a fixed percentage, and sample the centers of each face. I would also sample the center point. Dynamic lights would be added into the SH coefficient set analytically. I then extracted a dominant directional light from the SH set, and constructed a linear (4 coefficient) SH gradient + center sample. Rendering a directional light + a linear SH set achieves results similar to rendering a full 9 coefficient set, and is much faster on the GPU. Bungie used this same trick on Halo 3.

The gradient allowed me to get a first order approximation of changing lighting across the dynamic object, which was a big improvement in the quality of the lighting and really helped make the dynamic lighting consistent with the static lighting. Evaluating a 4 SH gradient + directional light was about the same cost as if I'd evaluated a full 9 coefficient SH on the GPU, but produced much higher quality.

The SH set for a dynamic object was constructed on another thread, and only happened if the object moved or its set of dynamic lights changed. This allowed us to support rendering a large number of dynamic objects.

Sometimes the kd-tree subdivision heuristic would not generate high enough detail of sampling for a specific area -- for these cases I allowed the artists to place "irradiance detail volumes", which allowed the artists to override the sampling parameters for specific area of the irradiance volume - either forcing more detail, or using a smaller minimum sample resolution.

Finally, for shadows, in outdoor areas we used a cascaded shadow map solution for the sun, and for interior areas, supported spotlights that cast shadows. The artists had to be careful placing these spotlights as we could not support a large number of shadow casting lights simultaneously. At the time we were rendering these lights as a separate geometry pass, but I had plans to support one shadow casting light + the SH lighting in a single pass.

The end result was for anything car-sized or smaller, with statically placed lights using the same lighting rig as produced the lightmaps, you would have a very difficult time telling which objects were static and which were dynamic. One interesting side effect that was technically a "bug" but actually helped produce good results was the fact that samples underneath the floor would almost always be black, since no light reached them. When constructing the gradient, these samples would usually be used for the bottom face of the bounding box. In practice, though, this just made the object gradually get a little darker toward the floor -- which was not unpleasant, helped ground the object in the scene, and was kind of fake AO. In ShaderX 7, the article about Crackdown's lighting describes a similar AO hack, although theirs was intentional. But we decided to keep the happy accident.

The biggest issue with the system was it didn't deal very well with very large dynamic objects, since a single gradient is not enough if your object spans tens of irradiance volume cells. For that game this wasn't a huge problem, but it might be for other games. Additionally, it still didn't solve the problem of things like muzzle flashes requiring multiple passes of geometry for statically lit items, and at the time I was starting to look to deferred lighting approaches to use for transient, high-frequency dynamic lights in general.

The artists were very happy with the lighting, particularly on characters, and we were producing good results. But at about this time, the plug on the project was pulled and I was shifted off to other duties, and eventually the company would go bankrupt and I would move on to 2K Boston. But I felt that lighting approach was viable in a production environment, and I've since seen other games making presentations on various irradiance volume systems.

Saturday, October 17, 2009

Where is the game architecture research?

I was reading this paper on EA's internal STL implementation, and it got me thinking -- where is the game architecture research?

There is a large amount of academic research poured into real-time graphics, experimental gameplay and entertainment, AI, and even MMO server design. But I find there are a number of architecture issues unique to games that are lacking in any sort of research. I've done searches and not come up with a whole lot, maybe I'm just not using the correct keywords.

Memory is not a solved problem for game consoles
Most if not all garbage collection research is focused on desktop or server based memory usage patterns. They assume virtual memory paging. Many gc algorithms are impractical for a fixed memory environment where utilization needs to be close to 100%. While some game engines use garbage collection, the algorithms are primitive compared to the state of the art generational collectors found in desktop environments, and the waste of memory resources is often 10-20% of total memory. Games generally can not afford large mark or sweep phases as they must execute at a smooth frame rate. Fragmentation can still be an issue in a fixed memory environment, although in this case many allocator strategies exist to combat this.

Multicore architectures for games 
While this is still an active area of research for desktop and server applications, too, I've found exactly one paper that attempts some research in this area for game architectures. This is a particularly fruitful area for research since there are many competing ideas out there (message passing! software transactional memory! functional programming!), but very few researchers testing any of them in the context of building a game. It is difficult enough to make a game by itself, let alone test multiple techniques for exploiting multiple cores. I find this somewhat interesting because aside from servers and scientific processing, games are pushing the state of the art in multicore programming more than anything else.

Automated testing
This is something the EA STL paper brings up -- traditional automated testing techniques break down pretty quickly beyond unit testing lower level libraries. So much of the end result of game code is subjective and emergent that determining how to test even basic functionality automatically is a huge unsolved problem. This results in a large amount of manpower being used for game testing, particularly in the area of regression testing.

This research is being done as a part of production by many companies inside the industry. But it is always going to be the case that in a production environment, you just aren't going to have the time and resources to, say, try three different approaches to multicore architecture and compare them. Generally you make an educated guess and hope for the best. Additionally, because much of this research is done as part of product development, rarely are the results published, which means we're all out there doing the same work over and over.