Monday, March 3, 2014

BioShock Infinite Lighting

Programmers don't generally have reels, but we do have blogs. I've been explaining the rendering work I did on BioShock Infinite quite a bit due to recent events, and I thought it made sense to write some of it down here. For the bulk of development, I was the only on-site graphics programmer. As Principal Graphics Programmer I did quite a bit of implementation, but also coordinated and tasked any offsite rendering work.

One of our artists best described Infinite's style as "exaggerated reality." The world of Columbia was colorful, high saturation, and high contrast. We needed to handle both bright, sunny exteriors and dark, moody interiors simultaneously. We were definitely not going for photorealism.

The size of the levels were bigger than anything Irrational had attempted before. The previous game Irrational had worked on, BioShock, was more of an intimate corridor shooter. In contrast, we wanted Columbia to feel like a big city in the clouds. This meant much bigger and much more open spaces that still retained the high detail required for environmental story telling, because much of the story telling in a BioShock game was done via the world itself.

We wanted a streamlined lighting pipeline for level artists. It was obviously possible to get great results out of the stock UE3 forward lighting pipeline, but it was also very time consuming for artists. Many flags and settings had to be tweaked per-light, per-primitive or per-material. Irrational's level design was very iterative. Levels would be built and re-built to pre-alpha quality many, many times, and big changes were done as late as possible. As a consequence the amount of time we had to bring a level from pre-alpha to shipping quality was generally very short, and without a streamlined lighting pipeline would have been very difficult to accomplish.

Finally, all of this had to perform well on all of our platforms.
The end result

Hybrid Lighting System
The lighting system we came up with was a hybrid system between baked and dynamic lighting:
  • Direct lighting was primarily dynamic
  • Indirect lighting was baked in lightmaps and light volumes
  • Shadows were a mixture of baked shadows and dynamic shadows
  • The system handled both stationary and moving primitives.

Deferred Lighting
Dynamic lighting was handled primarily with a deferred lighting/light-pre pass renderer. This met our goals of high contrast/high saturation -- direct lighting baked into lightmaps tends to be flat, mostly because the specular approximations available were fairly limited. We went with the two-stage deferred lighting approach primarily because the information we needed for our BRDF and baked shadows would not fit in four render targets. We did not want to sacrifice UE3's per-pixel material parameterization, so something like a material id system to compact the G-Buffers was out of the question. This of course meant two passes on the geometry instead of one, which we dealt with by running render dispatch in parallel, instancing, and clever art.

There's been a ton written on this technique, so I'm just going to point out a few wrinkles about our approach.

Sunday, July 7, 2013

Code reviews

After reading Aras' Reviewing ALL the CODE entry, I was going to reply describing our process, but it was getting long so I decided to write it up here.

Our process is a little more lo-fi but effective. We use perforce's review daemon, and as part of programmer orientation we set new programmers up to subscribe to the source code folder and set up an outlook filter. That's right, every programmer on the team has a stream of emails for every changelist.

The emails are set up with the first line of the changelist description in the subject and the email of the changelist author in the reply-to field. The body of the email contains the full changelist comment and diffs of the change up to a certain size to avoid flooding our email system.

Reviews are handled by replying to the email, and cc'ing a code reviews email list which goes to everyone and is archived. This is so everyone gets the benefit of subsequent discussion.

We have a few senior engineers who at least read every changelist comment. Personally, I find it is something useful to do while waiting the couple minutes for a big compile to finish. But looking at our code reviews email list, quite a few programmers scan at least some of the changelists, usually looking for changes in code they are most familiar with.

This only works if you enforce meaningful changelist comments. "Fixed bug in renderer" would not be an acceptable changelist comment and would garner a review email to make a better one. A changelist comment should describe the problem being solved and how it was solved. It doesn't need to be a novel but it should be enough information so someone going back to this changelist 6 months from now can understand what was done and why.

I've worked on teams where this was the primary code review process, although currently we use it as a second line of defense - each changelist requires a primary reviewer.

We catch all sorts of issues with this review process - minor issues such as code that is unclear all the way to major bugs that were uncaught by both the author and the primary reviewer. This is a good method for orienting new programmers to the code base, teaching "code base lore", or pointing out bad naming. One of the things I particularly look for is badly named functions or variables - code without short, concise and meaningful names is usually an indicator of a larger problem. I could do a whole entry on names.

Beyond the day to day issues, I also find it is a useful toward improving the quality of the code base as a whole. If you see the same mistakes being made or people having trouble with a particular system, it gets you thinking about ways to prevent those mistakes, or make a system easier to use.

All in all, I find it useful for getting a feel for how the team as a whole is operating and also learning about parts of the code base you might not normally delve into. It's difficult to give people advice on how to solve problems if you don't have at least a cursory understanding of what they are working on. It is also low ceremony and a way to communicate what's going on across largish teams. If you're not doing it, give it a try.

Saturday, February 12, 2011

Virtual Addressing 101

If you haven't read Steven Tovey's excellent article on alternatives to new and malloc, you should. I'll wait.

All done? Good. One topic that was beyond the scope of that article is virtual addressing. Understanding virtual addressing is important to anyone implementing memory management on modern hardware. The PC and both next-gen consoles provide facilities for virtual address management, and it is important to understand the benefits and trade-offs of these facilities when doing memory management.

I am going to simplify many of the details and present a more abstracted view of some made-up hardware. A full discussion of virtual address handling specific to an architecture would be beyond the scope of this entry. The specific details of hardware and OS virtual addressing vary between different architectures, and even different processor generations within the same architecture. In practice, it is always important to read your processor and OS manuals to understand the specific implementation you are working with.

Physical Addressing
Often we like to think of memory in a machine as one big array, somewhat like this:

This is the physical memory map of the Solid Angle PlayBox, a console so spectacularly unsuccessful you probably have never heard of it (or it may just be the fact I made it up). It has 256 MB of memory, physically addressed from 0x0 to 0x10000000.

Real hardware doesn't necessary have one big contiguous lump of physical address space, or may have different physical address ranges mapping to the same memory, with different cache behavior. But again, we're trying to simplify things here.

So this seems great, but what are the problems? The problem is fragmentation. There are actually two types of fragmentation, and it is important to know the difference.

External Fragmentation
When you hear the unqualified word "fragmentation", most often what is being referred to is external fragmentation. External fragmentation occurs when memory has been partitioned into small, non-contiguous chunks, such that while the total amount of free memory is large enough for a big allocation, you can't actually fit it anywhere.

A simple example, using a first-fit heap. Say someone wrote loading code and didn't really consider memory management while doing so (tsk tsk!). This loading code starts by allocating a large temporary buffer for streaming:

Then the loading code reads into the temp buffer, and creates a bunch of permanent data structures.
The loading code then frees the temporary buffer
Repeated many times, with some varying temporary buffer sizes, we could end up with a heap like this:
Now a large allocation comes along, which we have enough memory for, but because memory is partitioned we do not have a large enough contiguous block to fit it. That is external fragmentation, it is fragmentation external to the allocations.
Internal Fragmentation
Internal fragmentation is the type of fragmentation you don't hear about much, or if you do, it is not usually described as fragmentation. Internal fragmentation occurs when the size of the memory manager's internal allocation is larger than what the application actually requested. This is fragmentation internal to the allocations.

An example can be found with fixed-size block allocators. Often you can have a system that makes many allocations, all slightly varying in size. One solution to this is to use a fixed-size block allocator that uses a block size larger than any of your potential allocations. This can lead to a situation where a small amount of memory is unused in each allocation:

Internal fragmentation can occur with other allocators, such as the buddy system.

Virtual Addressing
Most programmers at some point have heard the phrase "All problems in computer science can be solved by another level of indirection", attributed to Dan Wheeler. Many haven't heard the corollary "...except for the problem of too many layers of indirection." This is a shame because I think both together describe the condition of the modern programmer.

Virtual addressing is the direct application of this idea -- instead of accessing memory through its physical address, we add a level of indirection and access it through a virtual address. This indirection is performed in the hardware, so it is mostly transparent to the programmer, and fast, with caveats. Virtual addressing can mitigate many fragmentation issues.

First, an important public service announcement.

Virtual Addressing != Paging to hard drive
Do not confuse virtual addressing with virtual memory management systems that may page data to the hard drive (such as Windows or Linux). I think these concepts sometimes become confused because many descriptions lump the two things together into a heading of "virtual memory." They are not the same thing -- paging systems are built on top of virtual addressing, but you do not need to page memory to the hard drive to reap the benefits of virtual addressing. You don't even need a hard drive!

Virtual Address Space
Virtual addressing implementations are very specific to CPU architecture and OS, but they all share some common properties.

They all have the concept of a virtual address space. The address space may be much larger than the physical memory of the machine -- for example, in our hypothetical console, we may have only 256 MB of physical memory, but with 32 bit pointers we have a 4 GB address space. In practice, architectures and OSes may limit the address space available to applications, either reserving address space for the kernel, or using portions of the address space to for different types of memory access (such as non-cached reads/writes). On multi-process operating systems such as Windows or Linux, each process has its own address space.

Address space is allocated independently from physical memory, and you do not have to have physical memory backing an address space allocation.

The address space is divided into pages. Page sizes vary depending on architecture/OS, but common sizes are 4K, 64K, and 1 MB. Page sizes are always powers of two, as this simplifies the work of translating a virtual address into a physical one. A CPU/OS may only support a fixed page size, or may allow programmers to pick a page size when pages are allocated.

The Page Table
Virtual addresses are translated into physical addresses via a page table. A page table is a simple mapping between a virtual page and a physical page. Going back to our hypothetical console, which has a page size of 64KB, a page table might look like this (again, real world implementations vary):

Each entry in the page table maps a virtual address page to a physical address page. A virtual address allocation may span multiple contiguous address pages, but does not require contiguous physical pages.

When the CPU encounters an instruction which accesses a memory address, it must translate the virtual address into a physical address to know where the data is located in physical memory. With a 64KB page size, the upper 16 bits of a 32 bit address specify the page number, and the lower 16 bits the offset into the page. This is why page sizes are a power of 2 -- determining the page number becomes a simple bit mask and shift. The CPU looks up the virtual page entry in the page table, and finds the corresponding physical page number. This is done for every memory access.

Because this operation happens for every memory access, it needs to be fast and implemented in hardware.There's only one problem: the page table is far too big to be stored on the CPU chip.

Translation Lookaside Buffers
The solution is a special cache for address translation. Because the CPU can not fit the entire page table in on-chip memory, it uses a translation lookaside buffer (TLB), which is a special cache that holds the most recently used page table entries. TLBs can often hold enough page entries for a large amount of address space, usually larger than the amount of memory the L1 or L2 caches can hold.

Back to our memory access scenario, when the CPU must translate a virtual page into a physical page, it first looks in the TLB. If the page table entry is found, the address translation happens very quickly and the CPU continues on its work. If there is a TLB miss, this can often mean a TLB miss handler is invoked. This is actually a software handler provided by the operating system, as the entire page table is managed by the OS, not the CPU. Thus, TLB misses can be very expensive.

On most modern processors, the TLB is multi-level, similar to how L1 and L2 caches work.  Thus the CPU may check a smaller, faster address translation cache before consulting the larger, slower TLB, before it resorts to the software handler of a full TLB miss.

The expense of a TLB miss is another reason data locality is very important to performance. If you are hitting data structures willy nilly in address space, aside from the cache misses you will incur, you may incur a lot of TLB misses, too. This is a double-whammy of not keeping data accesses local!

Memory Protection
Most CPUs also add the capability to specify what kind of access to a page is allowed. Page table entries can be constructed which disallow writes, or disallow code execution on some architectures. The former can be used to make sure application-level code does not overwrite kernel data structures, and the latter can be used to help protect against buffer overrun attacks by not making it possible for the CPU to jump into data-only memory. When invalid accesses occur, a HW exception is raised.

You can often specify the memory protection for a page with API calls, which can sometimes be useful for debugging tricky memory overwrite problems, by protecting pages against writes and writing a custom HW exception handler.

Memory protection is also how OSes implement demand-paging of memory from the hard drive. When the OS moves a physical page of memory to the hard drive, it modifies the virtual page table entry to prevent reads and writes. If that page is accessed, a HW exception occurs which the OS handles by loading the appropriate data from the hard drive into a physical page, and setting the page table entry to point to that physical page. Execution of the program then continues from where the exception was fired.

Virtual Addressing-Aware Memory Management
The presence of virtual addressing has a great impact on memory management. While it does not necessarily change the fundamental behavior of many allocator types, it is important to understand when physical memory is actually committed. Physical pages returned to the OS can be used to make up much larger, contiguous allocations, so at the system level, many problems with external fragmentation are severely reduced.

Direct Page Allocation for Large Blocks
For large allocations (> page size), the best memory allocation strategy is sometimes to allocate virtual address space and physical pages directly from the operating system. These types of allocations are often rare, happening when loading data. The advantage is you will not suffer external fragmentation from this allocation strategy, as the OS can always remap physical pages to a contiguous virtual address space if they are available, even if they are not contiguous in physical address space.

The trade-off for doing this is internal fragmentation. Your large allocation may not be an exact multiple of page size, leading to memory that is wasted. First, you want to pick a good threshold for when to do direct page allocation -- this is not a good strategy for things that are not much larger than the page size.Wasted memory can be also be mitigated by choosing an appropriate page size for the allocation on architectures that allow this. For example, where waste would be a significant percentage of the allocation, you may want to choose 4K pages rather than 64K pages. The trade-off here is smaller pages mean many more TLB misses, which can hurt performance.

Stack Allocators
One key thing with virtual addressing is you can allocate large regions of address space without committing physical memory to it. Stack allocators can be implemented by allocating a large region of address space, but only committing physical pages as the stack allocator pointer advances.
The advantage here is you can choose a large maximum stack size without actually committing physical memory to it. While if you do hit the peak, those physical pages must come from somewhere, it allows for situations where your peak may be at a point where those pages are free from other systems (loading comes to mind).

It should be noted that the C++/C call stack on Windows works exactly like this - when you specify a stack size for an application, you are specifying the size of the address space allocation, not the physical allocation. As the stack grows, the runtime allocates physical pages. This is done transparently with a special page called a guard page, which triggers a HW exception when it is accessed by the code, which causes an OS handler to execute which allocates physical memory for that page and set the next virtual page as the guard page.

Pooled allocators
For small allocations, fixed-size pools are often a good solution. Virtual addressing can allow us to have multiple pools of different sizes without fragmenting overall memory.

Basically, we implement our fixed-size pool as a linked list of mini-pools, each some multiple of the page size. On our hypothetical console, 64KB may be a good mini-pool size. If a mini-pool is full, we allocate another set of pages from the OS. If a mini-pool becomes empty, we return the page to the OS. Again, because physical pages do not need to be contiguous when mapped to virtual address pages, these freed pages can be used for any size of allocation, from anywhere in the system.

General Advice
When dealing with virtual allocations, the general rule of thumb is "return physical pages to the operating system whenever you can." If a physical page is allocated but not being used, the OS can not use it for some other, larger allocation that may need to occur. The days of allocating an entire console's memory space in one block and managing it yourself are largely gone, unless you wish to write your own page allocator (which can and has been done). There are some caveats to this, such as with page allocation thrashing, and allocations that are required to be physically contiguous (see below).

Virtual Addressing Problems
Physically Contiguous Requirements
Your particular platform may require certain allocations be performed in contiguous physical memory, such as GPU resources. This is often the case on consoles. Virtual addressing only mitigates external fragmentation for virtual allocations -- for these physical allocations, you still have to deal with fragmentation at the physical page level. Often the way to handle this is to set aside memory for physical resources up front in your application, and manage them separately from your virtual allocations. 

Page Allocation Thrashing
Allocating virtual address space and committing physical pages are not cheap operations. Particularly with stack allocators and pools, you want to avoid thrashing -- cases where a repeated pattern of allocs/frees cause pages to be allocated and freed in rapid succession. This can be worked around by thresholding when you free a physical page to the OS - for example, with a pool, you may require that some percentage of the previous physical page be free before freeing the next, totally free one. Additional strategies are only doing page frees at specific, known points where the performance hit is predictable.

Page Size and TLB Misses
Page size can have a very important impact on performance. On platforms which allow you to choose page size when performing a virtual address space allocations, you want to pick the largest page size possible, as larger pages cause far less TLB misses.This is often a tricky balance between wasting memory due to internal fragmentation, and losing performance due to TLB misses. As always, data locality helps to reduce TLB misses.

Page Size and Physical Fragmentation
On platforms with variable page sizes, you can run into problems where you can not allocate a large page even though the memory is free. This is due to external fragmentation of the physical pages themselves - if you allocate a large amount of 4K pages, free them, and try to allocate a 1MB page, it may not have enough contiguous physical memory to successfully allocate a 1 MB page. I've even seen some platforms that will not coalesce smaller pages into larger ones even if they are contiguous (i.e. once you've allocated physical memory as a 64KB page, it will never be coalesced into a 1 MB page). This can be mitigated similar to physical allocation restrictions -- allocate your large pages up front, and do your own page allocator that your other allocators work on top of.

Address Space Fragmentation
It is possible to fragment virtual address space itself. One should be careful of reserving too much virtual address space for things like stack allocators, or leaking address space. While on console the address space is many times larger than the physical memory, and thus usually has enough slack to make up for carelessness, on PC, particularly when writing tools in 32 bit, you can run into situations where you fragment the virtual address space itself.

My hope is anyone reading this who did not have a good understanding of virtual addressing now understands a little better what is going on under the hood with memory management, at least at a basic level. As always, platform details differ, and if you are doing any kind of memory management work, you really should read the CPU and OS docs on memory management, virtual addressing, and the TLB for your specific platform.

Even programmers who are not writing custom memory managers can benefit from understanding how virtual addressing works. Almost every memory access performs address translation -- and this translation is another important reason to keep data accesses local when designing data structures.

Sunday, February 6, 2011

Lazy Logging Parameter Evaluation With Variadic Macros

This entry is not rocket science, and probably won't be that informative to experienced programmers, but I've seen commercial code bases get something as simple as this wrong. It requires compiler support for variadic macros, which have been in Visual C++ for a while and are also supported by later versions of GCC.  

Most games have some sort of logging system. Debugging by printf is one of the first debugging tools most programmers learn. While there are many other tools in the debugging toolbox, this particular one is usually not that far out of reach. Some problems just lend themselves to being solved by logging.

We want to minimize the performance impact of logging code, without having to limit the number of logging statements we place in code. We do not want to constantly recompile different configurations of the game with or without logging enabled. While compile time stripping of logging during development will have the least performance impact, there are many times when you may be at a tester, designer or artist's desk and need to log key information. Providing them with a custom build is a productivity hit for everyone involved. 

There are two main performance hits for logging:

1. The cost of the logging itself (writing to the debug window, to a file, to a console, etc)
2. The cost of parameter evaluation

Anyone who has put a log statement in a piece of code executed many times a frame knows it can absolutely kill performance, just by the fact that logging to a file or debug output can be time consuming itself. This first cost can be solved by a channel system that can selectively enable logging. Even beyond the performance cost, it is useful to enable different types of logging at different times. If you're debugging AI code, you are probably not interested in logging from the streaming system.  Log statements specify which channel they are on (say, by integer ID), and the logging code checks if that channel is enabled.

Where should this check occur? I've seen some code bases that do this in the logging function itself. This is a mistake, because even if you do not actually output anything, you are still paying the second cost, the cost of parameter evaluation.

Logging, by nature, is very string-intensive. Often you will output human-readable debug names for various assets and entities when logging. Strings computed as parameters to log statements often incur performance penalties - memory allocations, string operations, etc. In addition to string overhead, some information you may wish to log may not be particularly cheap to calculate.

float ThisFunctionIsVeryExpensiveToEvaluate()

LogPrintf(LOG_INFO, "Expensive but precious debug info: %g\n", ThisFunctionIsVeryExpensiveToEvaluate());

What we want is for the expensive function to only be evaluated if the LOG_INFO channel is enabled.

The way to do this is to put the channel check in the macro itself, and only call the log function if the check succeeds. Here's some sample code that accomplishes this using variadic macros:

// Define this to 0 to disable logging

const int LOG_ERROR=0x1;
const int LOG_WARNING=0x2;
const int LOG_INFO=0x4;

   // Simple channel system (you want IsLoggingChannelEnabled to be cheap,
   // but there are other ways to implement something like this)
   static int GlobalEnabledLogChannels;

   // Make sure your compiler inlines this function, as it will be called many
   // times
   // You may want to force it to be inlined using whatever compiler-specific
   // syntax is available to you.
   inline bool IsLoggingChannelEnabled(int channel)
       return 0 != (GlobalEnabledLogChannels & channel);

   // This overload is present to handle the case where the channel argument
   // is optional
   inline bool IsLoggingChannelEnabled(const char*)
       return true;

   // Note: I've seen many logging systems which make the log channel optional.
   // I'm going to handle this case to show how it is done, but if you always
   // require a log channel, this code becomes simpler (for instance, you can
   // make format a required argument to the macro, and not need the ## handling)
   void MyLogPrintf(const char* format, ...);
   void MyLogPrintf(int channel, const char * format,...);

   // The ## is some syntax magic to make GCC ignore the preceding
   // comma if no arguments after channel are present
   // This can happen if no channel is specified in the log print, as it is optional
   #define LogPrintf(channel, ...) \
      if(!IsLoggingChannelEnabled(channel)) {} else MyLogPrintf(channel, ##__VA_ARGS__)


   #if _MSC_VER
      // __noop is Visual C++ specific syntax for "do nothing".
      #define LogPrintf(...) __noop
      // Compiler should strip this out - but always look at the disassembly to make sure!      inline void Noop()

      #define LogPrintf(...) Noop()


// example log statements
void SomeFunction()
    LogPrintf("Hello world!\n");
    LogPrintf(LOG_ERROR, "You should see this very important error\n");
    LogPrintf(LOG_INFO, "Expensive info: %s\n",

Hopefully blogger didn't mangle the formatting of all that.

The key concept is to call IsLoggingChannelEnabled() in the macro itself. The if syntax it uses is specially constructed -- done this way it will not change the semantics of an if statement without braces. For example:

if (rand()%2 == 0)
    LogPrintf(LOG_INFO,"Rand was even\n");
    LogPrintf(LOG_INFO, "Rand was odd\n");

If we did something like this:

#define LogPrintf(channel, ...) \
if(IsLoggingChannelEnabled(channel)) MyLogPrintf(channel, ##__VA_ARGS__)

that would change the meaning of the above if statement, and the else case would be on if(IsLoggingChannelEnabled(channel)), not the original rand check!

A note on why I made the channel optional: In a particular legacy code base I was dealing with, the channel argument was optional on logging statements, and I had to handle that case without changing every log statement in the application. I wanted to show how you could support something like that.

The main drawback with this approach is an increase in executable size due to all the log channel checks being inlined and the performance hit of the check itself on each log statement. It really depends on your particular game whether you are willing to pay these costs in your development/in-house build or not.

Saturday, December 4, 2010

Is Data-Oriented Design a Paradigm?

Recently there has been quite the flurry of tweets about OOP (objected oriented programming) and DoD (data oriented design). If you're unfamiliar with DoD, here's a nice presentation. If you're unfamiliar with OOP, I'd like to know what cave you've been living in for the last few decades.

DoD has caught on with game programmers because it puts a name to something anyone who has spent time optimizing a game already knew -- your data access patterns have a much bigger impact on your performance than the actual code you execute. I remember many an optimization session on Stranglehold where a reduction in L2 cache misses led to a perfectly correlated reduction in execution time.

DoD goes farther in that it presents a set of guidelines for writing code up front that will run under the reality of the memory-processor speed gap. This does set it apart from a simple optimization technique as it is something you can use before the fact, rather than after the fact. Follow these guidelines, your program will perform better. 

Dino Dini argues that this is nothing new, that game programmers have been doing this for decades. He's right. The underlying concepts are not that new, but giving it a name and a simple package of guidelines is new.  This has value, I think, because it helps educate programmers about these concepts. I am not discounting anyone's effort in this area, because I think a lot of programmers need to learn these concepts.

That said, I don't think DoD approaches what one would call a programming paradigm. The consensus definition of programming paradigm is a "fundamental style of programming." It certainly is a style of programming, but I don't think it is fundamental.

While I put on my flame retardant, let me explain what I mean. Structured/procedural and OOP are two programming paradigms that historically grew out of the need to manage software complexity. These are paradigms in which you could organize an entire code base. They contain methods for abstraction, and layered design. 

DoD says nothing about code complexity. It does not describe how to organize your entire code base. No matter what happens with the processor-memory gap, code complexity is a huge problem for any large project. DoD offers no tools for managing this complexity.

I can imagine a code base completely organized around the structured paradigm (and many exist). The same with OOP.  Many real world code bases mix a little bit of both paradigms -- platform APIs tend to be structured, application architecture these days tends to be OOP. 

I can see how DoD fits into either of these paradigms. I don't know what a code base completely organized around DoD would look like. I don't think that's even a question that makes sense, as it is not tackling the same set of problems. 

This is fine, and does not take away from DoD at all. In fact, I think it frees us to discuss the realities of writing software for today's hardware without having to waste time arguing about OOP vs DoD. They are apples and oranges.

Tuesday, March 9, 2010


I'll be at GDC this week. My tentative session schedule is thus

Session Title Date Start Time End Time Location
Designing for Performance, Scalability & Reliability: StarCraft II's Approach 2010-03-11 09:00:00 10:00:00 Room 306, South Hall
Go With the Flow! Fluid and Particle Physics in PixelJunk Shooter 2010-03-11 15:00:00 16:00:00 Room 306, South Hall
God of War III: Shadows 2010-03-11 16:30:00 17:30:00 Room 304, South Hall
Code and Complexity: Managing EVE's Expanding Universe 2010-03-12 09:00:00 10:00:00 Room 130, North Hall
Taking Fluid Simulation Out of the Box: Particle Effects in Dark Void 2010-03-12 09:00:00 10:00:00 Room 304, South Hall
Light, Perception, and the Modern Shader 2010-03-12 12:00:00 13:00:00 Esplanade Lobby, South Hall
Creating the Active Cinematic Experience of Uncharted 2: Among Thieves 2010-03-12 13:30:00 14:30:00 Room 305, South Hall
The Next Generation of Fighting Games: Physics & Animation in UFC 2009 Undisputed 2010-03-12 15:00:00 16:00:00 Room 135, North Hall
APB: Creating a Powerful Customisation System for a Persistent Online Action Game 2010-03-12 16:30:00 17:30:00 Room 135, North Hall
Three Big Lies: Typical Design Failures in Game Programming 2010-03-13 09:00:00 10:00:00 Room 125, North Hall
Texture compression in real-time, using the GPU 2010-03-13 10:30:00 10:55:00 Room 132, North Hall
R-Trees -- Adapting out-of-core techniques to modern memory architectures 2010-03-13 11:05:00 11:30:00 Room 132, North Hall
The Rendering Tools and Techniques of Splinter Cell: Conviction 2010-03-13 13:30:00 14:30:00 Room 303, South Hall
Uncharted 2: HDR Lighting 2010-03-13 15:00:00 16:00:00 Room 305, South Hall

I believe Irrational folk will be in and out of the bar at the Marriott quite a bit in the evenings, so if you find yourself in the vicinity and see a big guy with glasses there, that's probably me, so stop by and say hi. 

Saturday, February 20, 2010

Musings on Data-Oriented Design

Lately there has been a lot on the interwebs about "Data-Oriented Design." Mike Acton tackles the problems with textbook OOP with the provocative title Typical C++ Bullshit, Sony has an excellent presentation titled Pitfalls of Object Oriented Programming, and Games from Within discusses the subject here.  For any programmer wishing to write code that performs well on today's processors, I highly recommend reading all three.

The fundamental problem is pretty simple: C++ was designed during the early 80's, when the gap between processor performance and memory performance was small. Now that gap is large. Notice that the vertical scale on that graph is logarithmic -- the gap is nearly one thousand times larger than it was in the early 80's.

It is understandable that textbook OOP, which came to be under such different hardware performance characteristics, would have performance problems with today's hardware.

I've been thinking about this problem lately and my conclusion is we need better language and compiler support for the layout and access of data in systems languages. Whether that comes as modifications to C++ or as something new, I'm not going to wade into that swamp today.

Where we are

C itself is really just portable assembly language. It defines an abstract machine model but there is a pretty close mapping between C code and the assembly it generates. C++ kept this ability (as it is mostly a superset of C), but added in abstractions to help deal with large code bases. These abstractions necessarily came at a cost -- you can write C++ code that does not map very closely to the assembly it generates.

My proposition is that the data organization capabilities of both C and C++ are the equivalent of portable assembly language for data: a close mapping between the code and the data layout it generates. While the C++ standard does not actually specify a memory layout, the truth is the de facto standard in most compilers is the layout of structures or classes, minus some inserted vtable pointers, generally correspond 1-1 to how they are laid out in memory. Most operating system APIs depend on this fact, as you pass structures to them with strict memory layouts.

To see why this is a problem, let me make an analogy with instruction scheduling. As processors became pipelined and then superscalar, the scheduling of instructions to keep all those pipelines full became a big problem. The early C and C++ compilers did a very poor job of it, and people resorted to either reorganizing their code or dropping down to assembly language to take proper advantage. Compilers have gotten a lot better at scheduling instructions over time -- to the point that things like inline assembly hurt the ability of the compiler to reorder instructions. With the advent of compiler intrinsics, which the compiler understands and can schedule along with other instructions, you're better off sticking in C or C++ rather than using inline assembly these days. While even in C (which again, is portable assembly language), you still run into code that the compiler does not generate machine instructions as you'd like, the tools to detect such problems are quite good and the mechanisms to fix them are usually localized to a particular function.

Moving over to the data side, we are constantly stuck in a space equivalent to hand-scheduling instructions. I think this is the challenge of data-oriented techniques, is that you are forced to be in a head space where you are spending a fair amount of time doing analysis of data access and rearranging code and data structures rather than solving the actual problem your code is intended to solve. I'm sure there are people for which this comes quite naturally (I suspect Mike Acton is one), but for me, at least, this takes a considerable amount of mental effort.

Where we need to be

As I've thought about this more, I've realized that both C and C++ fail in offering any sort of tools to help the programmer tackle the problems of data organization. If the compiler is free to reschedule instructions, should we not let it be free to reorganize our data structures?

Obviously, the compiler can not do this alone. One recurring theme in these presentations is that textbook OOP tends to focus on singular entities. A class has a virtual function that deals with late dispatch on one object. A class defines the layout for one object. Obviously, you don't have to design your classes this way -- and in fact, the above presentations argue you shouldn't. But if you find yourself fighting with or avoiding the language abstractions rather than using them, what have you gained? In that sense, C++'s abstractions hurt us because they lull us into writing code that will run horribly. We need better abstractions.

Both of these presentations move away from the model of classes that deal with one thing and move to code that deals with sets of things. If you are going to do a sphere in frustum test, you're going to be doing it on many things, not just one. Even when sets are not homogeneous, we deal with that by sorting them by type, and executing our operations in bulk on each type.

We need more than sets, though, because different operations need different views on the data. Transform update may only be concerned with the matrix of a game entity, whereas higher level AI code may have a completely different view. We want our data to be laid out optimally for some of our operations, which may mean different data is stored in different places, or we may even have multiple copies of some data in order to support different operations.

One of those views is the view we use for debugging. In our head space, we tend to think about single entities in the game world -- this projectile, this character, this mesh. Textbook OOP tends to couple class layout with this debugging head-space, and is part of the attraction -- I don't have to care about what is going on with everything else in the program, I have everything I need to know about this mesh right here.

The organization the computer needs is much different, though -- when doing frustum culling, for example, what we really want is just a big array of AABBs. When debugging why a specific mesh is being culled, though, it really helps to see all the data about that entity in one place. Otherwise, you spend a lot of time traversing data structures in the watch window, just to find out what state an object got in that caused it to flip its visible bit to false. So the view of the data that humans need is another important piece of the puzzle.

This is the limit of my current musings. I want to write code that deals with sets of things as a natural part of the language and not just some templated container library. I want to be able to specify multiple views on my data, and have the compiler use this information to generate optimal data layout for certain operations. In the debugger, I want a debugging view which is similar to the textbook OOP view. I want a language that is designed to provide these things, and will tackle data layout as an optimization problem similar to register allocation, instruction scheduling, or inlining.

Perhaps this is too radical a departure for a low-level language such as C or C++.  I would hope there are some research languages out there that do the kind of things I am talking about -- other duties have prevented me from doing anything more than a cursory literature search. Given that the processor-memory gap is only likely to get worse, I'd certainly hope there is.