I was reading this paper on EA's internal STL implementation, and it got me thinking -- where is the game architecture research?
There is a large amount of academic research poured into real-time graphics, experimental gameplay and entertainment, AI, and even MMO server design. But I find there are a number of architecture issues unique to games that are lacking in any sort of research. I've done searches and not come up with a whole lot, maybe I'm just not using the correct keywords.
Memory is not a solved problem for game consoles
Most if not all garbage collection research is focused on desktop or server based memory usage patterns. They assume virtual memory paging. Many gc algorithms are impractical for a fixed memory environment where utilization needs to be close to 100%. While some game engines use garbage collection, the algorithms are primitive compared to the state of the art generational collectors found in desktop environments, and the waste of memory resources is often 10-20% of total memory. Games generally can not afford large mark or sweep phases as they must execute at a smooth frame rate. Fragmentation can still be an issue in a fixed memory environment, although in this case many allocator strategies exist to combat this.
Multicore architectures for games
While this is still an active area of research for desktop and server applications, too, I've found exactly one paper that attempts some research in this area for game architectures. This is a particularly fruitful area for research since there are many competing ideas out there (message passing! software transactional memory! functional programming!), but very few researchers testing any of them in the context of building a game. It is difficult enough to make a game by itself, let alone test multiple techniques for exploiting multiple cores. I find this somewhat interesting because aside from servers and scientific processing, games are pushing the state of the art in multicore programming more than anything else.
Automated testing
This is something the EA STL paper brings up -- traditional automated testing techniques break down pretty quickly beyond unit testing lower level libraries. So much of the end result of game code is subjective and emergent that determining how to test even basic functionality automatically is a huge unsolved problem. This results in a large amount of manpower being used for game testing, particularly in the area of regression testing.
This research is being done as a part of production by many companies inside the industry. But it is always going to be the case that in a production environment, you just aren't going to have the time and resources to, say, try three different approaches to multicore architecture and compare them. Generally you make an educated guess and hope for the best. Additionally, because much of this research is done as part of product development, rarely are the results published, which means we're all out there doing the same work over and over.
Saturday, October 17, 2009
Sunday, October 4, 2009
An Ode to the GPU. Hopefully not an Epitaph.
The last entry got me thinking about one area of game programming that has gotten unequivocally better over the last ten or fifteen years: graphics programming. From the advent of the GPU to programmable pipelines to the debugging and profiling tools available, things are for the most part way easier today than they were even five years ago.
I am not a graphics programmer. I'm a generalist who often finds himself programming graphics. So there are certainly gaps in the last ten or fifteen years where I wasn't really writing anything significant in graphics. There's a large gap between fixed-function gpus and when HLSL was introduced -- I don't think I've ever done assembly-level pixel shaders, for example.
While I do remember doing a lot of OpenGL in the early days of fixed-function, I didn't do much multipass rendering on fixed function hardware, where companies like Id essentially faked a programmable pixel pipeline with texture and blend ops. Frankly, I thought during that era it was more about fighting the hardware than interesting techniques -- the amount of bs you had to put up with made the area unattractive to me at the time.
Languages like HLSL and Cg piqued my interest in graphics again, and when you think about it, are a pretty impressive feat. They allow a programmer to harness massively parallel hardware without having to think about the parallelism much at all, and the last few years have been more about interesting algorithms and more efficient operations than about fighting hardware capabilities. Sure, you still run up against the remaining fixed function parts of the pipeline (namely, blending and texture filtering), but those can be worked around.
The tools have improved year over year. On the PC, things like PerfHUD have slowly gotten better, with more tools like it being made all the time. The gold standard still remains PIX on the 360 -- so much so that many programmers I know will do an implementation of a new graphics technique first on the 360 just because it is so easy to debug when things go wrong.
So let me just praise the GPU engineers, tools makers, and language and API designers who have done such a good job of taking a hard problem and making it constantly easier to deal with. I think it is rare to get such productivity gains for programmers in any area, and we shouldn't take for granted when it happens.
This is also why the dawn of fully programmable graphics hardware makes me nervous. Nvidia recently announced the Fermi architecture, which will allow the use of C++ on the GPU. Nvidia, AMD/ATI, and Intel are all converging on GPU architectures that allow more and more general computing, but is C++ really the answer here?
HLSL and its ilk make concurrent programming easy. The same can not be said for C++. While an architecture where the underlying threading architecture of a GPU is more open certainly will allow for a wider array of approaches, what is the cost? Are we blinded so much by the possibilities that we forget that the DirectX/OpenGL model is one of the few successes of hiding concurrency for programmers?
I have not really done much with CUDA or compute shaders, so perhaps I am being hasty in judgement. But when I see Intel or Nvidia touting that you can use C++ on their GPUs, I get a little worried. I am not sure that this will make things better, and in fact, may make things very much worse.
Am I just paranoid?
Labels:
C++,
concurrency,
DirectX,
game programming,
GPU,
HLSL,
Intel,
Nvidia,
Nvidia Fermi,
OpenGL
Saturday, October 3, 2009
I'm Afraid the Grass is not Greener
I started reading Coders At Work, and wow, it's rare that you run across a book about programming that's a page-turner, but this is it. I'm not very far into it, but a quote from Brad Fitzpatrick (LiveJournal, memcached, PerlBal) caught my attention. The context is he is bemoaning how it seems like computers are worse than they were ten years ago, that they feel slower even though under the hood they are faster, etc. Then this question and answer comes up:
Seibel: So maybe things are not as fast as they ought to be given the speed of computers. But ten years ago there was no way to do what people,as users, can do today with Google.
Fitzpatrick: Yeah. So some people are writing efficient code and making use of it. I don't play any games, but occasionally I'll see someone playing something and I'm like, "Holy shit, that's possible?" It just blows me away. Obviously, some people are doing it right.
We are? The funny thing is, I'm not sure a lot of game programmers would feel that we are doing things right. We work with imperfect middleware and engines, with hacks upon hacks piled upon them, all until the game we are working on is no longer broken and actually fun to play. We have code in our games we would describe as "shit" or "crap" or "I can't believe we shipped with that." When I was just starting out, I thought maybe it was just the games I was working on that had this problem, but any time you talk to anyone at other companies, it is the same story -- from the most successful games to the smallest ones, we can all list a huge litany of problems in the code bases in which we work or have written.
It's interesting reading this book because at least the first two programmers I've read are in totally different worlds than game development. Not better or worse, just different. The problems and constraints they have are somewhat alien to me.
I've often heard game developers say things like "game development is five to ten years behind the state of the art in 'straight' programming", referring to process or my least favorite term, "software engineering." I may have even said it myself.
The game industry often does a lot of navel gazing (like this entry!). We are constantly comparing ourselves to movies, theme parks, or how the rest of programmers work. Maybe we've got it all wrong. Maybe all along we've been figuring out how programming for games needs to work. If the world that Brad Fitzpatrick lives in feels alien to me and vice versa, then why would we ever think that the processes or techniques that work in one are automatically going to work for the other?
Food for thought.
Wednesday, September 23, 2009
Safety
I recently came across two articles that tangentially talk about the same thing -- technologies that are safe. Safe as in usable and not likely to get yourself in trouble.
The first was 30 years of C over at DadHacker. The second is a Joel on Software article (nice to see him actually writing about technology instead of pimping FogBugz or whatever he's selling these days) called The Duct Tape Programmer.
Anyway, I thought I'd write some of my opinions of the language features mentioned in these two articles. For those of you who've known me a while, it just may surprise you where my thoughts have evolved over the years.
Let's cover the C++ features:
Exceptions - While I have no problems with exceptions in a language like Java or C#, in C++ they just don't work well. In games we turn them off for code size and performance reasons, but I would tend to avoid them in C++ even if there was zero hit in either area. It is just too difficult to write exception-safe code in C++. You have to do extra work to do it, and the things that can break are sometimes very subtle. Most importantly, the entire culture and ecosystem built around the language is not exception-friendly. Rare is the library that is exception-safe in my experience. So just say no.
RTTI - Not very useful in practice. Again, there are overhead concerns in games, although most games I've seen end up rolling their own. But the base implementation is rather inflexible -- it is reflection of only the most basic sort, and often in the places you do need run-time type information, you need a lot more than just class ids. It's a feature with its heart in the right place but it just doesn't come together very well. I think part of the problem is its all-or-nothing nature -- usually only portions of my architecture need any sort of reflection, and I don't want to pay for it on all the other classes.
Operator Overloading - Rarely useful outside of math libraries. I'm not even a huge fan of the iostreams model, to tell the truth.
Multiple inheritence - Only with pure virtual interfaces, and even then should be used rarely and avoided if possible. Sharing implementation via inheritance goes awry enough in single inheritance, adding more base class chains just makes the problem worse.
Templates - The big one. I'll admit to having a love affair with templates in my twenties. What can I say? They were fun and a shiny new toy. I sure had some excesses, but even my worst one (a cross-platform file system library) shipped in multiple products. Even then I hid them all behind a straight-C API, so only programmers who had to either debug or extend the library innards had to deal with the templates. If I had to do it again, I'd probably do it differently, but I could say that about any code I've written in my career, whatever the language. I do know that it was an improvement over the previous file system library that was in use, because the new one actually worked.
I can say with a degree of certainty that template metaprogramming is a bust for practical use. There are a few major problems with it: the language isn't really built for it (it's more a clever side effect than anything), there is no good way to debug it, and functional programming isn't very ingrained in the game development culture. Ironically, I think the last part is going to have to change as parallel programming creeps into larger and larger sections of the architecture, but that won't make template metaprogramming practical.
In any case, these days templates are just a tool in the toolbox, and not one I reach for that often. The code bases I've been working in recently all roll their own template container libraries* (provided for us from external vendors), and they do the job. My experiences with code sharing via templates is that more than often it isn't worth the trouble, but sometimes it is. Like anything we do, it is a tradeoff, and one I don't necessarily feel particularly passionate about either way.
*A somewhat amusing side note: I've done performance and code generation tests with one of the hand-rolled template container libraries I've encountered versus STL. STL came out on top for a lot of simple things like loop iteration overhead or sorting, on all the platforms I was interested in. Of course, I'm not about to rewrite hundreds of thousands of lines of code to use STL, and STL still is horrible for memory management. But I suppose that underscores the point "30 years of C" made -- even something as simple as a container library is hard to get right, even for experts. Which library I'm talking about shall remain anonymous for its own protection.
Labels:
C++,
exceptions,
game programming,
RTTI,
STL,
templates
The Other Cost of Code Bloat
The other day I almost wrote a redundant version of the exact same class that someone else on my project had written. In fact, if I hadn't have asked this person a couple general C# questions, and he hadn't put two and two together, I probably would have wrote that redundant class. Good detective work on his part, and shame on me for not doing a search of the code base to see if someone else had already tackled this problem. While I've got a pretty good feel of the C++ which makes up the majority of code in our engine/tools, I haven't looked at the C# side as much as I probably should have.
As the code bases we write get larger and larger, and the team sizes we deal with get larger and larger, the question of how to avoid this scenario becomes an important one. Ideally you hire programmers who perform the necessary code archeology to get a feel for where things are in the code base, or who will ask questions of people more familiar with the code when unsure. Getting a code base of a million or more lines "in your head" takes time, though. I've been working with our licensed engine for about four years now, and there are still nooks and crannies that are unfamiliar to me.
Better documentation should help, but in practice it is rarely read if it even exists. This is because usually such documentation is either nonexistant or if it does exist, horribly out of date. With a licensed engine, you are at the mercy of the little documentation you are provided, and at the end of the day, the code itself is the best documentation.
A sensible architecture with clear delineation of what should go where is often a bigger help. Knowing [where to look] is half the battle, said a saturday morning cartoon show. Again, with a licensed engine, you again are at the mercy of what you are provided. Finding existing functionality usually comes down to experience with the code base and code archeology skills.
Recently, Adrian Stone has been writing an excellent series on minimizing code bloat. Now while the techniques he describes aren't really about eliminating actual code and instead eliminating redundant generated and compiled code, the mindset is the same when you are removing actual lines of code. Aside from the important compile time, link time, and executable size benefits, there is another benefit to removing as much code as you possibly can -- the code will occupy less "head space."
Unused or dead code makes it that much harder to do code archeology. Dead code certainly can make it more difficult to make lower level changes to the engine or architecture, as it is one more support burden and implementation difficulty. In the past, removing large legacy systems (whether written internally or externally) has had unexpected benefits in simplifying the overall architecture -- often there are lower level features that only exist to support that one dormant system.
One of my favorite things to do is delete a lot of code without the end result of the tools or game losing any functionality. It's not only cleaning out the actual lines of code, but the corresponding head space that is wonderful feeling -- "I will never have to think about X again." With the scale of the code bases we deal with today, we don't have the brain power to spare over things we don't need.
Monday, September 21, 2009
Rogue Programming
Gamasutra had an interesting article today titled Gaming the System: How to Really Get Ahead in the Game Industry. I found it probably had more to say about the political dysfunction that can often accompany game development rather than a how-to on being successful. To put it another way: if you find yourself having to follow the sneakier guidelines in this article too much, then you might want to consider a change in where you work.
The programming section is titled "Just Do It" and does have some truth to it. One of my leads and I came up with the term "rogue programming" for what he describes, which was half-joke, half-serious. Here's a quote:
As a programmer, it's not uncommon to see problems that you think should be fixed, or to see an opportunity to improve some piece of code, or speed up a process that takes a lot of time. It's also not uncommon for your suggestion to be ignored, or dismissed with an "it's not broke, so let's not fix it" response...
What should you do? You should just do it -- on your own time.
This is advice which is fraught with a lot of risk, because here's a hard-earned lesson for you: you don't always know best. I know, I know, you're superstar hotshot programmer, and you see something that is broken, so it must be fixed. Sure it is not in the schedule, but it'll just take a few hours, what's the harm? The code base will be so much better when you're done, or the artists and designers will have a feature they didn't have before. How can that not make the project better?
Let me give a cold-water splash of reality: when it is all said and done at the end of the project, you're going to ship with a lot of broken code. I'm not talking about obvious bugs in the shipped project, I just mean nasty, hack-filled, just-get-it-out-the-door brokenness in the code base, and some of that code will be code that you wrote. If this wasn't true, then a long-lived project like the Linux kernel wouldn't still have thousands of developers contributing to it -- obviously, there is still stuff that is "broken" and can be improved!
So in the big picture, a single section of brokenness is not going to make or break your project, and usually, there are bigger fish to fry on any given day, and its best to fry them. Because if your project is cancelled because a major feature was late, will it matter that you cleaned up the way you calculated checksums for identifiers?
That said, if after all of this, you still think something is worth doing, let me tell you how to successfully rogue program:
First, and most importantly, let's define "on your own time." On your own time means you are hitting your scheduled work on schedule, and that will not change if you fix or implement this one thing. If you're behind on your scheduled work, then you really shouldn't be doing any rogue programming. Whether not impacting your schedule means you work a saturday, do some exploration at home, or just have some slack in your schedule you'd like to exploit, if you don't complete the tasks you are supposed to be working on, you've done more damage than whatever improvement you're working on the side could benefit.
Additionally, you need co-conspirators. These days, programming is very collaborative process, and for the most part, the cowboy mentality is a dying thing. If you talk to your lead or other engineers about a problem ("Hey, X is pretty f'd up") and no one else agrees, or you can't make the case, then hey, maybe X really isn't that important! You really want to be working with a group of people that you can convince with sound arguments that something is a problem, and a lot of the time a little discussion about a problem can turn into a scheduled task -- and no rogue programming.
Often you'll be faced with something that everybody agrees *should* be done, but there's no time to do it. In these cases, I've found with good leads (which I've been blessed with the last few years), you can get tacit approval to do something "on your own time." This often takes trust -- I wouldn't recommend going off and "just doing it" until you've earned that trust.
If you've gotten to this point, you're in pretty good shape to go off and do some "rogue programming" -- because at this point (and this is where the joke comes in), it really isn't rogue at all.
Now if you're at a company where you constantly feel like you need to "go around people" to "just do things," then maybe you really do need a change of venue, because that is not a healthy team. I happen to know someone who is hiring.
Thursday, August 27, 2009
Big O doesn't always matter
The other day I was optimizing a bit of debug code which verifies the integrity of objects in memory. The details aren't super-important, but the gist is it is a function which runs periodically and makes sure that objects that should have been garbage collected were indeed purged from memory.
I don't make a general habit of optimizing debug code, but this was a special case -- before, this process only ran in the debug build. Artists and designers run a "development" build, which is an optimized build that still includes assertions and many other development checks.
We recently ran into a bug that would have been detected much earlier if this process had been running in the development build. While programmers run the debug build almost exclusively, we tend to stick to simpler test levels. Trying to debug an issue on a quick loading, small level is much easier than on a full-blown one.
The algorithm is pretty simple -- objects have somewhat of a tree structure, but for various reasons they only have parent links and not child links. For objects at the top-level of the tree, we know for sure whether they should be in memory or not. Objects at the bottom of the tree keep all their parents in memory if they are connected to the reference graph. So the debug check looks at each object and verifies that it is not parented (via an arbitrarily long parent chain) to an object which should have been purged.
First thing I did was measure how long the process was taking, and did some lower level profiling to get an idea of where time was spent. Most importantly, I also saw where I was running into cache misses.
The first pass of optimization -- the original loop was doing a lot of work per-object that was simply unnecessary. This was because it was using a generalized iterator that had more functionality than needed for this specific case -- for most operations, particularly at editor time, this extra overhead is not a big deal. Removing this extra work sped up the process and it was now took about 90% of the time of the original.
I then tried some high-level optimizations. There were two things I tried - one, the inner loop linearly checked each high-level object against an unsorted array of objects we know should be purged. I replaced this with a hash table from our container library. Finally, I realized that a memoizing approach should help here -- since I'm dealing with a tree, I could use a bit array to remember if I've already processed a parent object and deemed it OK. This would allow me to cut off traversal of the parent chain, which should eliminate a lot of work. Or so I thought.
The new algorithm was faster, but not by much - only 85% of the original running time. The additional complexity was not worth 5% of running time, so I went back to the simpler approach. This isn't unusual in optimization -- you often can try something you think will be a big help but turns out not to matter much. I've made mistakes in the past where I stuck with the more complicated implementation for a marginal gain -- but it was not worth it, and it made other optimizations that may have bigger impact harder to do.
As far as why the gain wasn't that much: The unsorted array was relatively small (a handful of elements), so a linear search was faster because it was simpler and had better cache behavior than the hash table implementation I was using. The tree structure of the objects was broad but not deep, so its obvious in hindsight why memoization would not be a win.
Now, one thing that is nice to have is a decent container and algorithm library. I have that at my disposal, so implementing these two changes was a matter of minutes instead of hours. With that kind of arsenal, it is easy to try out algorithmic changes, even if they end up not working out.
At this point, I took another look at the cache behavior from my profiling tools. I tried something incredibly simple -- prefetching the next object into the cache while I was processing the current. This resulted in the process now running at 50% of the time of the original -- a 2X speedup, and likely fast enough for me to enable this in the development build. I'm going to measure again, and see if there are any other easy wins like this to be had.
The processors we use are fast -- incredibly fast, and even with branch penalties on the in-order processors of today's consoles, they can still do a lot of work in the time it takes to retrieve data from main memory. So while on paper, I'm using "slow" algorithms with worse O(n) times, in practice, your memory access patterns can easily drown out any extra calculation. The key, as always, is to measure and test your theories, and not just assume that any given approach will make something faster.
I don't make a general habit of optimizing debug code, but this was a special case -- before, this process only ran in the debug build. Artists and designers run a "development" build, which is an optimized build that still includes assertions and many other development checks.
We recently ran into a bug that would have been detected much earlier if this process had been running in the development build. While programmers run the debug build almost exclusively, we tend to stick to simpler test levels. Trying to debug an issue on a quick loading, small level is much easier than on a full-blown one.
The algorithm is pretty simple -- objects have somewhat of a tree structure, but for various reasons they only have parent links and not child links. For objects at the top-level of the tree, we know for sure whether they should be in memory or not. Objects at the bottom of the tree keep all their parents in memory if they are connected to the reference graph. So the debug check looks at each object and verifies that it is not parented (via an arbitrarily long parent chain) to an object which should have been purged.
First thing I did was measure how long the process was taking, and did some lower level profiling to get an idea of where time was spent. Most importantly, I also saw where I was running into cache misses.
The first pass of optimization -- the original loop was doing a lot of work per-object that was simply unnecessary. This was because it was using a generalized iterator that had more functionality than needed for this specific case -- for most operations, particularly at editor time, this extra overhead is not a big deal. Removing this extra work sped up the process and it was now took about 90% of the time of the original.
I then tried some high-level optimizations. There were two things I tried - one, the inner loop linearly checked each high-level object against an unsorted array of objects we know should be purged. I replaced this with a hash table from our container library. Finally, I realized that a memoizing approach should help here -- since I'm dealing with a tree, I could use a bit array to remember if I've already processed a parent object and deemed it OK. This would allow me to cut off traversal of the parent chain, which should eliminate a lot of work. Or so I thought.
The new algorithm was faster, but not by much - only 85% of the original running time. The additional complexity was not worth 5% of running time, so I went back to the simpler approach. This isn't unusual in optimization -- you often can try something you think will be a big help but turns out not to matter much. I've made mistakes in the past where I stuck with the more complicated implementation for a marginal gain -- but it was not worth it, and it made other optimizations that may have bigger impact harder to do.
As far as why the gain wasn't that much: The unsorted array was relatively small (a handful of elements), so a linear search was faster because it was simpler and had better cache behavior than the hash table implementation I was using. The tree structure of the objects was broad but not deep, so its obvious in hindsight why memoization would not be a win.
Now, one thing that is nice to have is a decent container and algorithm library. I have that at my disposal, so implementing these two changes was a matter of minutes instead of hours. With that kind of arsenal, it is easy to try out algorithmic changes, even if they end up not working out.
At this point, I took another look at the cache behavior from my profiling tools. I tried something incredibly simple -- prefetching the next object into the cache while I was processing the current. This resulted in the process now running at 50% of the time of the original -- a 2X speedup, and likely fast enough for me to enable this in the development build. I'm going to measure again, and see if there are any other easy wins like this to be had.
The processors we use are fast -- incredibly fast, and even with branch penalties on the in-order processors of today's consoles, they can still do a lot of work in the time it takes to retrieve data from main memory. So while on paper, I'm using "slow" algorithms with worse O(n) times, in practice, your memory access patterns can easily drown out any extra calculation. The key, as always, is to measure and test your theories, and not just assume that any given approach will make something faster.
Subscribe to:
Posts (Atom)