Saturday, February 20, 2010

Musings on Data-Oriented Design

Lately there has been a lot on the interwebs about "Data-Oriented Design." Mike Acton tackles the problems with textbook OOP with the provocative title Typical C++ Bullshit, Sony has an excellent presentation titled Pitfalls of Object Oriented Programming, and Games from Within discusses the subject here.  For any programmer wishing to write code that performs well on today's processors, I highly recommend reading all three.

The fundamental problem is pretty simple: C++ was designed during the early 80's, when the gap between processor performance and memory performance was small. Now that gap is large. Notice that the vertical scale on that graph is logarithmic -- the gap is nearly one thousand times larger than it was in the early 80's.

It is understandable that textbook OOP, which came to be under such different hardware performance characteristics, would have performance problems with today's hardware.

I've been thinking about this problem lately and my conclusion is we need better language and compiler support for the layout and access of data in systems languages. Whether that comes as modifications to C++ or as something new, I'm not going to wade into that swamp today.

Where we are

C itself is really just portable assembly language. It defines an abstract machine model but there is a pretty close mapping between C code and the assembly it generates. C++ kept this ability (as it is mostly a superset of C), but added in abstractions to help deal with large code bases. These abstractions necessarily came at a cost -- you can write C++ code that does not map very closely to the assembly it generates.

My proposition is that the data organization capabilities of both C and C++ are the equivalent of portable assembly language for data: a close mapping between the code and the data layout it generates. While the C++ standard does not actually specify a memory layout, the truth is the de facto standard in most compilers is the layout of structures or classes, minus some inserted vtable pointers, generally correspond 1-1 to how they are laid out in memory. Most operating system APIs depend on this fact, as you pass structures to them with strict memory layouts.

To see why this is a problem, let me make an analogy with instruction scheduling. As processors became pipelined and then superscalar, the scheduling of instructions to keep all those pipelines full became a big problem. The early C and C++ compilers did a very poor job of it, and people resorted to either reorganizing their code or dropping down to assembly language to take proper advantage. Compilers have gotten a lot better at scheduling instructions over time -- to the point that things like inline assembly hurt the ability of the compiler to reorder instructions. With the advent of compiler intrinsics, which the compiler understands and can schedule along with other instructions, you're better off sticking in C or C++ rather than using inline assembly these days. While even in C (which again, is portable assembly language), you still run into code that the compiler does not generate machine instructions as you'd like, the tools to detect such problems are quite good and the mechanisms to fix them are usually localized to a particular function.

Moving over to the data side, we are constantly stuck in a space equivalent to hand-scheduling instructions. I think this is the challenge of data-oriented techniques, is that you are forced to be in a head space where you are spending a fair amount of time doing analysis of data access and rearranging code and data structures rather than solving the actual problem your code is intended to solve. I'm sure there are people for which this comes quite naturally (I suspect Mike Acton is one), but for me, at least, this takes a considerable amount of mental effort.

Where we need to be

As I've thought about this more, I've realized that both C and C++ fail in offering any sort of tools to help the programmer tackle the problems of data organization. If the compiler is free to reschedule instructions, should we not let it be free to reorganize our data structures?

Obviously, the compiler can not do this alone. One recurring theme in these presentations is that textbook OOP tends to focus on singular entities. A class has a virtual function that deals with late dispatch on one object. A class defines the layout for one object. Obviously, you don't have to design your classes this way -- and in fact, the above presentations argue you shouldn't. But if you find yourself fighting with or avoiding the language abstractions rather than using them, what have you gained? In that sense, C++'s abstractions hurt us because they lull us into writing code that will run horribly. We need better abstractions.

Both of these presentations move away from the model of classes that deal with one thing and move to code that deals with sets of things. If you are going to do a sphere in frustum test, you're going to be doing it on many things, not just one. Even when sets are not homogeneous, we deal with that by sorting them by type, and executing our operations in bulk on each type.

We need more than sets, though, because different operations need different views on the data. Transform update may only be concerned with the matrix of a game entity, whereas higher level AI code may have a completely different view. We want our data to be laid out optimally for some of our operations, which may mean different data is stored in different places, or we may even have multiple copies of some data in order to support different operations.

One of those views is the view we use for debugging. In our head space, we tend to think about single entities in the game world -- this projectile, this character, this mesh. Textbook OOP tends to couple class layout with this debugging head-space, and is part of the attraction -- I don't have to care about what is going on with everything else in the program, I have everything I need to know about this mesh right here.

The organization the computer needs is much different, though -- when doing frustum culling, for example, what we really want is just a big array of AABBs. When debugging why a specific mesh is being culled, though, it really helps to see all the data about that entity in one place. Otherwise, you spend a lot of time traversing data structures in the watch window, just to find out what state an object got in that caused it to flip its visible bit to false. So the view of the data that humans need is another important piece of the puzzle.

This is the limit of my current musings. I want to write code that deals with sets of things as a natural part of the language and not just some templated container library. I want to be able to specify multiple views on my data, and have the compiler use this information to generate optimal data layout for certain operations. In the debugger, I want a debugging view which is similar to the textbook OOP view. I want a language that is designed to provide these things, and will tackle data layout as an optimization problem similar to register allocation, instruction scheduling, or inlining.

Perhaps this is too radical a departure for a low-level language such as C or C++.  I would hope there are some research languages out there that do the kind of things I am talking about -- other duties have prevented me from doing anything more than a cursory literature search. Given that the processor-memory gap is only likely to get worse, I'd certainly hope there is.


  1. I noted a slam on references in "Typical C++ Bullshit". Was he being crotchety or is there some penalty for refs that I'm unaware of?

    In a previous job I worked with a Navy Mil spec language that allowed tremendous control over how data was laid out and accessed, particularly tables, which were language supported. Ancient stuff, but it seems somewhat applicable...

    But what do I know. I'm just trying to sound relevant. :P

  2. The only penalty I know with references is you can't use __restrict on them. But I don't necessarily agree with that point -- unlike most of the others it didn't really have any justification.

  3. @Hizzle Partially I was just being crotchety, sure. But the most important point about references is not that they incur any special performance cost per se, but that they invariably (and by design) refer to a single object thus reinforcing the monolithic-object/class model. One can infer that data is likely to be poorly designed or under-designed by the overuse of references.

  4. Interesting point, I hadn't considered that before.

  5. To your point that we need "better language and compiler support" - I think that fundamentally the problem of actually solving for the data appropriately is a very different problem from a code-centric approach of describing a (generalized) process. In particular, it is not solvable statically. At all. Solving the problem requires insight in to the data, the access patterns and context (e.g. debugging vs. various actual transforms.) So if anything, I think we could use some real data and statistical analysis tools. e.g. In this context, what's the statistical likelihood that when A is loaded, B will also be needed? If very high, A and B are bundled together. But they may need to be separated in other contexts. Anyway, I could imagine a kind of language-toolchain that would include that, but we're a long, long way off from that in practice, I think.

  6. I kind of regret picking the instruction scheduling analogy because I agree with you that this is a fundamentally different problem, and I'm not implying that the compiler can do this analysis on its own. What we have is languages, tools, and profilers that are solving the wrong problem -- they are process oriented vs data-oriented. Doing the analysis is cumbersome because profilers today are still code-centric, and expressing solutions to those problems feels like you are fighting the language rather than it helping you.

    I look to the GPU side for inspiration. Shaders, while a very constrained, domain-specific problem, have a structure of INPUT -> XFORM -> OUTPUT. From the language design to the compilers and profilers to the hardware itself, they have explicit knowledge that you are doing this transform on many things, not just one. They present a headspace to the programmer where you describe the transform in terms of a single vertex or pixel, and turn that into code which executes on pixel and vertex vectors. The hard choices of exactly what to put in the textures/vertex streams/interpolators/constants/frame buffers are still up to the programmer, but the toolchain has much more knowledge about what you are doing. Additionally, even inexperienced shader programmers are forced to write code that deals with many things instead of just one.

    GPU analysis tools are far more sophisticated than what we have for C or C++ these days. Static analysis of shaders can determine if you are texture-bound, ALU-bound, or output bandwidth bound. These are obviously estimates, but tools like PIX and GPAD can give you close to actual numbers. More importantly, PIX gives you a snapshot of every single transformation done on every piece of input and output data in an entire frame.

    The tools on the C and C++ side are still incredibly code-oriented. You can get raw L2 cache misses and even see where they occur in the code, but I wonder if there are better ways to present this data. As you say, it'd be more interesting to have this data presented on the data structures rather than the code (if A and B were bundled, your code would be X% faster). Would a profiler that told you straight up, similar to PIX, that "this part of your program is read-bound" rather than raw L2 cache misses be more useful? What kind of static analysis can we do if the language allowed us to express data transformations in a way similar to shaders?

    Most compilers are notoriously bad at generating vectorized code. I think this is more a language problem than a backend optimization problem -- the compiler has no idea whether a function will be called on one, a thousand, or a million things, and we have no way to express that to it. Perhaps things like compute shaders are the answer -- I'll admit investigating their capabilities has been on my TODO list for a while. Or perhaps we need more functional aspects of systems languages. I just feel their is a lot more that could be done in this area.

  7. I've been thinking about this as well, to the point where I'm looking at languages that are by necessity DOD ("pure" functional languages for example). I feel though that modern game programming programming really requires a mix of OOP and DOD, both for understandability (the debugging view you mentioned) and performance reasons.

    As I see it, the data that has to be operated on in batches should be held in batches / arrays, but objects that can be held or operated on in a heavy weight thread (the main loop or the PPU), can be OOP. In addition, the OOP side can then be used to create the debug view you were talking about. The OOP side can point to each individual piece of data that it needs and, done correctly, it can do this without allocating memory.

    For example, An actor could have positional data, state data, and other world data in three separate arrays. A final array contains a "lookup" table that's numbered to the Actor. You can then get an Actor "class" by looking up that actor by number e.g.

    Actor::Actor(int actorId) {
    m_info = GetActorInfoById(actorId);
    m_posInfo = PositionTable::FromId(m_info->posId);
    m_stateInfo = AIStateTable::FromId(m_info->stateId);
    m_worldInfo = WorldState::FromId(m_info->worldId);

    By using static constructors instead of new, you save on memory allocations and still get a "class-like" interface. Your XXXTables can then become namespaces for the DOD functions that run in parallel.

    Never implemented anything like this, but I feel like it would ease development in a combined OOP and DOD environment.

  8. Great post! This really got me thinking about OO in performance critical designs.

  9. Couple of quick thoughts....

    1) What you're looking for is a dataflow language, I think - you're not interested in data layout, but how the data is traversing through your system. Unfortunately, the only one I know myself is VHDL - and I can't see anybody writing a game in that without going nuts ;)

    2) As Mike said, it's not necessarily possible for the compiler to change the layout. Access patterns very much depend on runtime behavior. I agree with him that we need much better tools for analyzing those patterns. Funnily, it's important on a larger scale, too. Optimizing a streaming world means analyzing data access patterns.

    Now I'm strangely tempted to re-read a ton of papers from the 60's/70's, because that was another time where data access was much more expensive than data processing.

    3) For the visibility issues with SoA data (at least as far as I can tell, this is what this is boiling down to), I'd think VS.NET's support for data visualizers could be rather helpful. In general, our debugging tools are woefully inadequate for what we're trying to do, and we'll have to roll a lot ourselves. (I know, spoken like a tools programmer ;)

    4) Jeff's comment gives me weird ideas... To me, this pattern seems to indicate that we might need "array of objects" as a language construct. Or at least a way to dynamically rewrite access patterns so we can flip between SoA/AoS without having to rebuild our structures. After all, we *do* have the type information available at compile time, and would "just" have to change layouts.

    You could probably achieve something to that effect using a reflection system....

  10. Reading posts on component based entity systems got me reading posts on data oriented design. I also had the idea that we need a language that is OOP and Functional, but compiles down to very data-oriented code.
    I'm going to experiment with creating such a language (compiled with LLVM). But this is my first attempt at language design and creating a compiler so it will probably take many failed attempts before I have something to share. My current thinking is to allow specifying data layout(s) for each class and semi-automatically creating pools for allocation.

  11. +1 that there's a programming language problem to be solved. I tried to design a small game in a completely data-oriented manner, and I quickly ran into a couple fundamental language limitations:

    If I only want to iterate over an array once per frame (to avoid loading it multiple times), the body of the 'for' loop must perform every transformation that uses the array as input.

    This scales well in terms of efficiency, but poorly in terms of code complexity and separation of concerns. All consumers of an array have to be merged into a single loop.

    Another problem arises when you change your mind about which data to iterate over in the outer loop; that is, when you decide that it's more important to iterate over BigHugeArray exactly once, while allowing multiple iterations over ModeratelySizedArray in an inner loop. This can be a substantial refactor. (Traditional programming styles also have this problem, but it's a fundamental operation in DoD that needs better support.)

    To address these problems, I'm working on a mini-language (really just a preprocessor) where you define data nodes, and write transformations in C++ to consume inputs and produce outputs. The programmer provides hints about an optimal order of operations, and the compiler produces the 'for' loops, manages data dependencies, and merges loops where possible.

    That way, multiple transformations can use the same input data without disturbing each other, and swapping your inner and outer loops can be done by providing a different hint to the compiler.