Showing posts with label game programming. Show all posts
Showing posts with label game programming. Show all posts

Saturday, February 12, 2011

Virtual Addressing 101

If you haven't read Steven Tovey's excellent article on alternatives to new and malloc, you should. I'll wait.

All done? Good. One topic that was beyond the scope of that article is virtual addressing. Understanding virtual addressing is important to anyone implementing memory management on modern hardware. The PC and both next-gen consoles provide facilities for virtual address management, and it is important to understand the benefits and trade-offs of these facilities when doing memory management.

I am going to simplify many of the details and present a more abstracted view of some made-up hardware. A full discussion of virtual address handling specific to an architecture would be beyond the scope of this entry. The specific details of hardware and OS virtual addressing vary between different architectures, and even different processor generations within the same architecture. In practice, it is always important to read your processor and OS manuals to understand the specific implementation you are working with.

Physical Addressing
Often we like to think of memory in a machine as one big array, somewhat like this:

This is the physical memory map of the Solid Angle PlayBox, a console so spectacularly unsuccessful you probably have never heard of it (or it may just be the fact I made it up). It has 256 MB of memory, physically addressed from 0x0 to 0x10000000.

Real hardware doesn't necessary have one big contiguous lump of physical address space, or may have different physical address ranges mapping to the same memory, with different cache behavior. But again, we're trying to simplify things here.

So this seems great, but what are the problems? The problem is fragmentation. There are actually two types of fragmentation, and it is important to know the difference.

External Fragmentation
When you hear the unqualified word "fragmentation", most often what is being referred to is external fragmentation. External fragmentation occurs when memory has been partitioned into small, non-contiguous chunks, such that while the total amount of free memory is large enough for a big allocation, you can't actually fit it anywhere.

A simple example, using a first-fit heap. Say someone wrote loading code and didn't really consider memory management while doing so (tsk tsk!). This loading code starts by allocating a large temporary buffer for streaming:

Then the loading code reads into the temp buffer, and creates a bunch of permanent data structures.
The loading code then frees the temporary buffer
Repeated many times, with some varying temporary buffer sizes, we could end up with a heap like this:
Now a large allocation comes along, which we have enough memory for, but because memory is partitioned we do not have a large enough contiguous block to fit it. That is external fragmentation, it is fragmentation external to the allocations.
Internal Fragmentation
Internal fragmentation is the type of fragmentation you don't hear about much, or if you do, it is not usually described as fragmentation. Internal fragmentation occurs when the size of the memory manager's internal allocation is larger than what the application actually requested. This is fragmentation internal to the allocations.

An example can be found with fixed-size block allocators. Often you can have a system that makes many allocations, all slightly varying in size. One solution to this is to use a fixed-size block allocator that uses a block size larger than any of your potential allocations. This can lead to a situation where a small amount of memory is unused in each allocation:

Internal fragmentation can occur with other allocators, such as the buddy system.

Virtual Addressing
Most programmers at some point have heard the phrase "All problems in computer science can be solved by another level of indirection", attributed to Dan Wheeler. Many haven't heard the corollary "...except for the problem of too many layers of indirection." This is a shame because I think both together describe the condition of the modern programmer.

Virtual addressing is the direct application of this idea -- instead of accessing memory through its physical address, we add a level of indirection and access it through a virtual address. This indirection is performed in the hardware, so it is mostly transparent to the programmer, and fast, with caveats. Virtual addressing can mitigate many fragmentation issues.

First, an important public service announcement.

Virtual Addressing != Paging to hard drive
Do not confuse virtual addressing with virtual memory management systems that may page data to the hard drive (such as Windows or Linux). I think these concepts sometimes become confused because many descriptions lump the two things together into a heading of "virtual memory." They are not the same thing -- paging systems are built on top of virtual addressing, but you do not need to page memory to the hard drive to reap the benefits of virtual addressing. You don't even need a hard drive!

Virtual Address Space
Virtual addressing implementations are very specific to CPU architecture and OS, but they all share some common properties.

They all have the concept of a virtual address space. The address space may be much larger than the physical memory of the machine -- for example, in our hypothetical console, we may have only 256 MB of physical memory, but with 32 bit pointers we have a 4 GB address space. In practice, architectures and OSes may limit the address space available to applications, either reserving address space for the kernel, or using portions of the address space to for different types of memory access (such as non-cached reads/writes). On multi-process operating systems such as Windows or Linux, each process has its own address space.

Address space is allocated independently from physical memory, and you do not have to have physical memory backing an address space allocation.

The address space is divided into pages. Page sizes vary depending on architecture/OS, but common sizes are 4K, 64K, and 1 MB. Page sizes are always powers of two, as this simplifies the work of translating a virtual address into a physical one. A CPU/OS may only support a fixed page size, or may allow programmers to pick a page size when pages are allocated.

The Page Table
Virtual addresses are translated into physical addresses via a page table. A page table is a simple mapping between a virtual page and a physical page. Going back to our hypothetical console, which has a page size of 64KB, a page table might look like this (again, real world implementations vary):


Each entry in the page table maps a virtual address page to a physical address page. A virtual address allocation may span multiple contiguous address pages, but does not require contiguous physical pages.

When the CPU encounters an instruction which accesses a memory address, it must translate the virtual address into a physical address to know where the data is located in physical memory. With a 64KB page size, the upper 16 bits of a 32 bit address specify the page number, and the lower 16 bits the offset into the page. This is why page sizes are a power of 2 -- determining the page number becomes a simple bit mask and shift. The CPU looks up the virtual page entry in the page table, and finds the corresponding physical page number. This is done for every memory access.

Because this operation happens for every memory access, it needs to be fast and implemented in hardware.There's only one problem: the page table is far too big to be stored on the CPU chip.

Translation Lookaside Buffers
The solution is a special cache for address translation. Because the CPU can not fit the entire page table in on-chip memory, it uses a translation lookaside buffer (TLB), which is a special cache that holds the most recently used page table entries. TLBs can often hold enough page entries for a large amount of address space, usually larger than the amount of memory the L1 or L2 caches can hold.

Back to our memory access scenario, when the CPU must translate a virtual page into a physical page, it first looks in the TLB. If the page table entry is found, the address translation happens very quickly and the CPU continues on its work. If there is a TLB miss, this can often mean a TLB miss handler is invoked. This is actually a software handler provided by the operating system, as the entire page table is managed by the OS, not the CPU. Thus, TLB misses can be very expensive.

On most modern processors, the TLB is multi-level, similar to how L1 and L2 caches work.  Thus the CPU may check a smaller, faster address translation cache before consulting the larger, slower TLB, before it resorts to the software handler of a full TLB miss.

The expense of a TLB miss is another reason data locality is very important to performance. If you are hitting data structures willy nilly in address space, aside from the cache misses you will incur, you may incur a lot of TLB misses, too. This is a double-whammy of not keeping data accesses local!

Memory Protection
Most CPUs also add the capability to specify what kind of access to a page is allowed. Page table entries can be constructed which disallow writes, or disallow code execution on some architectures. The former can be used to make sure application-level code does not overwrite kernel data structures, and the latter can be used to help protect against buffer overrun attacks by not making it possible for the CPU to jump into data-only memory. When invalid accesses occur, a HW exception is raised.

You can often specify the memory protection for a page with API calls, which can sometimes be useful for debugging tricky memory overwrite problems, by protecting pages against writes and writing a custom HW exception handler.

Memory protection is also how OSes implement demand-paging of memory from the hard drive. When the OS moves a physical page of memory to the hard drive, it modifies the virtual page table entry to prevent reads and writes. If that page is accessed, a HW exception occurs which the OS handles by loading the appropriate data from the hard drive into a physical page, and setting the page table entry to point to that physical page. Execution of the program then continues from where the exception was fired.


Virtual Addressing-Aware Memory Management
The presence of virtual addressing has a great impact on memory management. While it does not necessarily change the fundamental behavior of many allocator types, it is important to understand when physical memory is actually committed. Physical pages returned to the OS can be used to make up much larger, contiguous allocations, so at the system level, many problems with external fragmentation are severely reduced.

Direct Page Allocation for Large Blocks
For large allocations (> page size), the best memory allocation strategy is sometimes to allocate virtual address space and physical pages directly from the operating system. These types of allocations are often rare, happening when loading data. The advantage is you will not suffer external fragmentation from this allocation strategy, as the OS can always remap physical pages to a contiguous virtual address space if they are available, even if they are not contiguous in physical address space.

The trade-off for doing this is internal fragmentation. Your large allocation may not be an exact multiple of page size, leading to memory that is wasted. First, you want to pick a good threshold for when to do direct page allocation -- this is not a good strategy for things that are not much larger than the page size.Wasted memory can be also be mitigated by choosing an appropriate page size for the allocation on architectures that allow this. For example, where waste would be a significant percentage of the allocation, you may want to choose 4K pages rather than 64K pages. The trade-off here is smaller pages mean many more TLB misses, which can hurt performance.

Stack Allocators
One key thing with virtual addressing is you can allocate large regions of address space without committing physical memory to it. Stack allocators can be implemented by allocating a large region of address space, but only committing physical pages as the stack allocator pointer advances.
The advantage here is you can choose a large maximum stack size without actually committing physical memory to it. While if you do hit the peak, those physical pages must come from somewhere, it allows for situations where your peak may be at a point where those pages are free from other systems (loading comes to mind).

It should be noted that the C++/C call stack on Windows works exactly like this - when you specify a stack size for an application, you are specifying the size of the address space allocation, not the physical allocation. As the stack grows, the runtime allocates physical pages. This is done transparently with a special page called a guard page, which triggers a HW exception when it is accessed by the code, which causes an OS handler to execute which allocates physical memory for that page and set the next virtual page as the guard page.


Pooled allocators
For small allocations, fixed-size pools are often a good solution. Virtual addressing can allow us to have multiple pools of different sizes without fragmenting overall memory.

Basically, we implement our fixed-size pool as a linked list of mini-pools, each some multiple of the page size. On our hypothetical console, 64KB may be a good mini-pool size. If a mini-pool is full, we allocate another set of pages from the OS. If a mini-pool becomes empty, we return the page to the OS. Again, because physical pages do not need to be contiguous when mapped to virtual address pages, these freed pages can be used for any size of allocation, from anywhere in the system.

General Advice
When dealing with virtual allocations, the general rule of thumb is "return physical pages to the operating system whenever you can." If a physical page is allocated but not being used, the OS can not use it for some other, larger allocation that may need to occur. The days of allocating an entire console's memory space in one block and managing it yourself are largely gone, unless you wish to write your own page allocator (which can and has been done). There are some caveats to this, such as with page allocation thrashing, and allocations that are required to be physically contiguous (see below).

Virtual Addressing Problems
Physically Contiguous Requirements
Your particular platform may require certain allocations be performed in contiguous physical memory, such as GPU resources. This is often the case on consoles. Virtual addressing only mitigates external fragmentation for virtual allocations -- for these physical allocations, you still have to deal with fragmentation at the physical page level. Often the way to handle this is to set aside memory for physical resources up front in your application, and manage them separately from your virtual allocations. 



Page Allocation Thrashing
Allocating virtual address space and committing physical pages are not cheap operations. Particularly with stack allocators and pools, you want to avoid thrashing -- cases where a repeated pattern of allocs/frees cause pages to be allocated and freed in rapid succession. This can be worked around by thresholding when you free a physical page to the OS - for example, with a pool, you may require that some percentage of the previous physical page be free before freeing the next, totally free one. Additional strategies are only doing page frees at specific, known points where the performance hit is predictable.

Page Size and TLB Misses
Page size can have a very important impact on performance. On platforms which allow you to choose page size when performing a virtual address space allocations, you want to pick the largest page size possible, as larger pages cause far less TLB misses.This is often a tricky balance between wasting memory due to internal fragmentation, and losing performance due to TLB misses. As always, data locality helps to reduce TLB misses.

Page Size and Physical Fragmentation
On platforms with variable page sizes, you can run into problems where you can not allocate a large page even though the memory is free. This is due to external fragmentation of the physical pages themselves - if you allocate a large amount of 4K pages, free them, and try to allocate a 1MB page, it may not have enough contiguous physical memory to successfully allocate a 1 MB page. I've even seen some platforms that will not coalesce smaller pages into larger ones even if they are contiguous (i.e. once you've allocated physical memory as a 64KB page, it will never be coalesced into a 1 MB page). This can be mitigated similar to physical allocation restrictions -- allocate your large pages up front, and do your own page allocator that your other allocators work on top of.

Address Space Fragmentation
It is possible to fragment virtual address space itself. One should be careful of reserving too much virtual address space for things like stack allocators, or leaking address space. While on console the address space is many times larger than the physical memory, and thus usually has enough slack to make up for carelessness, on PC, particularly when writing tools in 32 bit, you can run into situations where you fragment the virtual address space itself.

Summary
My hope is anyone reading this who did not have a good understanding of virtual addressing now understands a little better what is going on under the hood with memory management, at least at a basic level. As always, platform details differ, and if you are doing any kind of memory management work, you really should read the CPU and OS docs on memory management, virtual addressing, and the TLB for your specific platform.

Even programmers who are not writing custom memory managers can benefit from understanding how virtual addressing works. Almost every memory access performs address translation -- and this translation is another important reason to keep data accesses local when designing data structures.

Sunday, February 6, 2011

Lazy Logging Parameter Evaluation With Variadic Macros

This entry is not rocket science, and probably won't be that informative to experienced programmers, but I've seen commercial code bases get something as simple as this wrong. It requires compiler support for variadic macros, which have been in Visual C++ for a while and are also supported by later versions of GCC.  

Most games have some sort of logging system. Debugging by printf is one of the first debugging tools most programmers learn. While there are many other tools in the debugging toolbox, this particular one is usually not that far out of reach. Some problems just lend themselves to being solved by logging.

We want to minimize the performance impact of logging code, without having to limit the number of logging statements we place in code. We do not want to constantly recompile different configurations of the game with or without logging enabled. While compile time stripping of logging during development will have the least performance impact, there are many times when you may be at a tester, designer or artist's desk and need to log key information. Providing them with a custom build is a productivity hit for everyone involved. 

There are two main performance hits for logging:

1. The cost of the logging itself (writing to the debug window, to a file, to a console, etc)
2. The cost of parameter evaluation

Anyone who has put a log statement in a piece of code executed many times a frame knows it can absolutely kill performance, just by the fact that logging to a file or debug output can be time consuming itself. This first cost can be solved by a channel system that can selectively enable logging. Even beyond the performance cost, it is useful to enable different types of logging at different times. If you're debugging AI code, you are probably not interested in logging from the streaming system.  Log statements specify which channel they are on (say, by integer ID), and the logging code checks if that channel is enabled.

Where should this check occur? I've seen some code bases that do this in the logging function itself. This is a mistake, because even if you do not actually output anything, you are still paying the second cost, the cost of parameter evaluation.

Logging, by nature, is very string-intensive. Often you will output human-readable debug names for various assets and entities when logging. Strings computed as parameters to log statements often incur performance penalties - memory allocations, string operations, etc. In addition to string overhead, some information you may wish to log may not be particularly cheap to calculate.

float ThisFunctionIsVeryExpensiveToEvaluate()

LogPrintf(LOG_INFO, "Expensive but precious debug info: %g\n", ThisFunctionIsVeryExpensiveToEvaluate());

What we want is for the expensive function to only be evaluated if the LOG_INFO channel is enabled.

The way to do this is to put the channel check in the macro itself, and only call the log function if the check succeeds. Here's some sample code that accomplishes this using variadic macros:


// Define this to 0 to disable logging
#define ENABLE_LOGGING 1

const int LOG_ERROR=0x1;
const int LOG_WARNING=0x2;
const int LOG_INFO=0x4;

#if ENABLE_LOGGING
   // Simple channel system (you want IsLoggingChannelEnabled to be cheap,
   // but there are other ways to implement something like this)
   static int GlobalEnabledLogChannels;

   // Make sure your compiler inlines this function, as it will be called many
   // times
   // You may want to force it to be inlined using whatever compiler-specific
   // syntax is available to you.
   inline bool IsLoggingChannelEnabled(int channel)
   {
       return 0 != (GlobalEnabledLogChannels & channel);
   }

   // This overload is present to handle the case where the channel argument
   // is optional
   inline bool IsLoggingChannelEnabled(const char*)
   {
       return true;
   }

   // Note: I've seen many logging systems which make the log channel optional.
   // I'm going to handle this case to show how it is done, but if you always
   // require a log channel, this code becomes simpler (for instance, you can
   // make format a required argument to the macro, and not need the ## handling)
   void MyLogPrintf(const char* format, ...);
   void MyLogPrintf(int channel, const char * format,...);

   // The ## is some syntax magic to make GCC ignore the preceding
   // comma if no arguments after channel are present
   // This can happen if no channel is specified in the log print, as it is optional
   #define LogPrintf(channel, ...) \
      if(!IsLoggingChannelEnabled(channel)) {} else MyLogPrintf(channel, ##__VA_ARGS__)


#else

   #if _MSC_VER
      // __noop is Visual C++ specific syntax for "do nothing".
      #define LogPrintf(...) __noop
   #else
      // Compiler should strip this out - but always look at the disassembly to make sure!      inline void Noop()
     {
     }

      #define LogPrintf(...) Noop()
   #endif

#endif

// example log statements
void SomeFunction()
{
    LogPrintf("Hello world!\n");
    LogPrintf(LOG_ERROR, "You should see this very important error\n");
    LogPrintf(LOG_INFO, "Expensive info: %s\n",
    ThisFunctionIsVeryExpensiveToEvaluate().c_str());
}

Hopefully blogger didn't mangle the formatting of all that.

The key concept is to call IsLoggingChannelEnabled() in the macro itself. The if syntax it uses is specially constructed -- done this way it will not change the semantics of an if statement without braces. For example:

if (rand()%2 == 0)
    LogPrintf(LOG_INFO,"Rand was even\n");
else
    LogPrintf(LOG_INFO, "Rand was odd\n");

If we did something like this:

#define LogPrintf(channel, ...) \
if(IsLoggingChannelEnabled(channel)) MyLogPrintf(channel, ##__VA_ARGS__)

that would change the meaning of the above if statement, and the else case would be on if(IsLoggingChannelEnabled(channel)), not the original rand check!

A note on why I made the channel optional: In a particular legacy code base I was dealing with, the channel argument was optional on logging statements, and I had to handle that case without changing every log statement in the application. I wanted to show how you could support something like that.

The main drawback with this approach is an increase in executable size due to all the log channel checks being inlined and the performance hit of the check itself on each log statement. It really depends on your particular game whether you are willing to pay these costs in your development/in-house build or not.

Tuesday, December 22, 2009

More Stencil States for Light Volume Rendering

A while back I wrote a short entry on stencil states for light volumes. The method I posted works but relies on using a zfail stencil operation. Shortly after, I quickly discovered that it ran considerably slower on ATI cards than on the original Nvidia card I had been writing on, and have been meaning to post an update.

On certain hardware, using anything but Keep in zfail can disable early stencil  -- specifically, ATI PC hardware, and this caused quite a slowdown.

The solution I figured out (and I'm sure others have) was to switch to a method which only relies on stencil pass operations:

AlphaBlendEnable = false
StencilEnable = true

ColorWriteChannels None

DepthBufferEnable = true

StencilDepthBufferFail Keep

// render frontfaces so that any pixel in back of them have stencil decremented
CullMode CounterClockwise
StencilFunction Always

// If a pixel is in back of the volume frontface, then it is potentially inside the volume
StencilPass Increment;

// render volume



// render backfaces so that only pixels in back of the backface have stencil decremented
CullMode Clockwise
// pass stencil test if reference value < buffer, so we only process pixels marked above.
// Reference value is 0. This is not strictly necessary but an optimization
StencilFunction Less
// If a pixel is back of the volume backface, then it is outside of the volume, and should not be considered
StencilPass 
Decrement

// render volume
AlphaBlendEnable = true
ColorWriteChannels RGB
// only process pixels with 0 < buffer
StencilFunction Less
// zero out pixels for so we don't need a separate clear for next volume
StencilPass Zero

//render a screen space rectangle scissored to the projection of the light volume


There is a problem with this method -- if the light volume intersects the near plane, it won't work, because the portion of the light volume in front of the near plane will never increment the stencil buffer.

My solution to this was pretty simple -- if the light volume intersects the near plane, I use the zfail method from the earlier post. Otherwise, I use the stencil pass operation. For most lights, we're using the fastest path on both the major brands of cards. I briefly scanned some papers and articles on shadow volumes (a very similar problem), hoping to find an alternate way to cap volumes intersecting the near plane, but didn't see anything that looked particularly easy to implement or would necessarily perform that well, and this method got performance on ATIs and Nvidias mostly on par.

What about two-sided stencil? This is a mode in DX9 where you can render both backfaces and frontfaces in one pass, with separate stencil operations on each. Because the stencil increment/decrement operations wrap around  (i.e. decrementing 0 becomes 255, incrementing 255 becomes 0), ordering doesn't really matter (although you have to make the StencilFunction Always on both). I did some quick tests using two sided stencil and my initial results showed it was actually slower than rendering both passes separately. I didn't spend much time on it so it is possible that I simply screwed something up, and plan to revisit it at some point.

Saturday, October 17, 2009

Where is the game architecture research?

I was reading this paper on EA's internal STL implementation, and it got me thinking -- where is the game architecture research?

There is a large amount of academic research poured into real-time graphics, experimental gameplay and entertainment, AI, and even MMO server design. But I find there are a number of architecture issues unique to games that are lacking in any sort of research. I've done searches and not come up with a whole lot, maybe I'm just not using the correct keywords.

Memory is not a solved problem for game consoles
Most if not all garbage collection research is focused on desktop or server based memory usage patterns. They assume virtual memory paging. Many gc algorithms are impractical for a fixed memory environment where utilization needs to be close to 100%. While some game engines use garbage collection, the algorithms are primitive compared to the state of the art generational collectors found in desktop environments, and the waste of memory resources is often 10-20% of total memory. Games generally can not afford large mark or sweep phases as they must execute at a smooth frame rate. Fragmentation can still be an issue in a fixed memory environment, although in this case many allocator strategies exist to combat this.

Multicore architectures for games 
While this is still an active area of research for desktop and server applications, too, I've found exactly one paper that attempts some research in this area for game architectures. This is a particularly fruitful area for research since there are many competing ideas out there (message passing! software transactional memory! functional programming!), but very few researchers testing any of them in the context of building a game. It is difficult enough to make a game by itself, let alone test multiple techniques for exploiting multiple cores. I find this somewhat interesting because aside from servers and scientific processing, games are pushing the state of the art in multicore programming more than anything else.

Automated testing
This is something the EA STL paper brings up -- traditional automated testing techniques break down pretty quickly beyond unit testing lower level libraries. So much of the end result of game code is subjective and emergent that determining how to test even basic functionality automatically is a huge unsolved problem. This results in a large amount of manpower being used for game testing, particularly in the area of regression testing.

This research is being done as a part of production by many companies inside the industry. But it is always going to be the case that in a production environment, you just aren't going to have the time and resources to, say, try three different approaches to multicore architecture and compare them. Generally you make an educated guess and hope for the best. Additionally, because much of this research is done as part of product development, rarely are the results published, which means we're all out there doing the same work over and over.

Sunday, October 4, 2009

An Ode to the GPU. Hopefully not an Epitaph.

The last entry got me thinking about one area of game programming that has gotten unequivocally better over the last ten or fifteen years: graphics programming. From the advent of the GPU to programmable pipelines to the debugging and profiling tools available, things are for the most part way easier today than they were even five years ago.

I am not a graphics programmer. I'm a generalist who often finds himself programming graphics. So there are certainly gaps in the last ten or fifteen years where I wasn't really writing anything significant in graphics. There's a large gap between fixed-function gpus and when HLSL was introduced -- I don't think I've ever done assembly-level pixel shaders, for example.

While I do remember doing a lot of OpenGL in the early days of fixed-function, I didn't do much multipass rendering on fixed function hardware, where companies like Id essentially faked a programmable pixel pipeline with texture and blend ops. Frankly, I thought during that era it was more about fighting the hardware than interesting techniques -- the amount of bs you had to put up with made the area unattractive to me at the time.

Languages like HLSL and Cg piqued my interest in graphics again, and when you think about it, are a pretty impressive feat. They allow a programmer to harness massively parallel hardware without having to think about the parallelism much at all, and the last few years have been more about interesting algorithms and more efficient operations than about fighting hardware capabilities. Sure, you still run up against the remaining fixed function parts of the pipeline (namely, blending and texture filtering), but those can be worked around.

The tools have improved year over year. On the PC, things like PerfHUD have slowly gotten better, with more tools like it being made all the time. The gold standard still remains PIX on the 360 -- so much so that many programmers I know will do an implementation of a new graphics technique first on the 360 just because it is so easy to debug when things go wrong.

So let me just praise the GPU engineers, tools makers, and language and API designers who have done such a good job of taking a hard problem and making it constantly easier to deal with. I think it is rare to get such productivity gains for programmers in any area, and we shouldn't take for granted when it happens.

This is also why the dawn of fully programmable graphics hardware makes me nervous. Nvidia recently announced the Fermi architecture, which will allow the use of C++ on the GPU. Nvidia, AMD/ATI, and Intel are all converging on GPU architectures that allow more and more general computing, but is C++ really the answer here?

HLSL and its ilk make concurrent programming easy. The same can not be said for C++. While an architecture where the underlying threading architecture of a GPU is more open certainly will allow for a wider array of approaches, what is the cost? Are we blinded so much by the possibilities that we forget that the DirectX/OpenGL model is one of the few successes of hiding concurrency for programmers?

I have not really done much with CUDA or compute shaders, so perhaps I am being hasty in judgement. But when I see Intel or Nvidia touting that you can use C++ on their GPUs, I get a little worried. I am not sure that this will make things better, and in fact, may make things very much worse.

Am I just paranoid?

Saturday, October 3, 2009

I'm Afraid the Grass is not Greener

I started reading Coders At Work, and wow, it's rare that you run across a book about programming that's a page-turner, but this is it. I'm not very far into it, but a quote from Brad Fitzpatrick (LiveJournal, memcached, PerlBal) caught my attention. The context is he is bemoaning how it seems like computers are worse than they were ten years ago, that they feel slower even though under the hood they are faster, etc. Then this question and answer comes up:

Seibel: So maybe things are not as fast as they ought to be given the speed of computers. But ten years ago there was no way to do what people,as users, can do today with Google.

Fitzpatrick: Yeah. So some people are writing efficient code and making use of it. I don't play any games, but occasionally I'll see someone playing something and I'm like, "Holy shit, that's possible?" It just blows me away. Obviously, some people are doing it right.

We are? The funny thing is, I'm not sure a lot of game programmers would feel that we are doing things right. We work with imperfect middleware and engines, with hacks upon hacks piled upon them, all until the game we are working on is no longer broken and actually fun to play. We have code in our games we would describe as "shit" or "crap" or "I can't believe we shipped with that." When I was just starting out, I thought maybe it was just the games I was working on that had this problem, but any time you talk to anyone at other companies, it is the same story -- from the most successful games to the smallest ones, we can all list a huge litany of problems in the code bases in which we work or have written.

It's interesting reading this book because at least the first two programmers I've read are in totally different worlds than game development. Not better or worse, just different. The problems and constraints they have are somewhat alien to me.

I've often heard game developers say things like "game development is five to ten years behind the state of the art in 'straight' programming", referring to process or my least favorite term, "software engineering." I may have even said it myself.

The game industry often does a lot of navel gazing (like this entry!). We are constantly comparing ourselves to movies, theme parks, or how the rest of programmers work. Maybe we've got it all wrong. Maybe all along we've been figuring out how programming for games needs to work. If the world that Brad Fitzpatrick lives in feels alien to me and vice versa, then why would we ever think that the processes or techniques that work in one are automatically going to work for the other?

Food for thought.

Wednesday, September 23, 2009

Safety

I recently came across two articles that tangentially talk about the same thing -- technologies that are safe. Safe as in usable and not likely to get yourself in trouble.

The first was 30 years of C over at DadHacker. The second is a Joel on Software article (nice to see him actually writing about technology instead of pimping FogBugz or whatever he's selling these days) called The Duct Tape Programmer.

Anyway, I thought I'd write some of my opinions of the language features mentioned in these two articles. For those of you who've known me a while, it just may surprise you where my thoughts have evolved over the years.

Let's cover the C++ features:

Exceptions - While I have no problems with exceptions in a language like Java or C#, in C++ they just don't work well. In games we turn them off for code size and performance reasons, but I would tend to avoid them in C++ even if there was zero hit in either area. It is just too difficult to write exception-safe code in C++. You have to do extra work to do it, and the things that can break are sometimes very subtle. Most importantly, the entire culture and ecosystem built around the language is not exception-friendly. Rare is the library that is exception-safe in my experience. So just say no.

RTTI - Not very useful in practice. Again, there are overhead concerns in games, although most games I've seen end up rolling their own. But the base implementation is rather inflexible -- it is reflection of only the most basic sort, and often in the places you do need run-time type information, you need a lot more than just class ids. It's a feature with its heart in the right place but it just doesn't come together very well. I think part of the problem is its all-or-nothing nature -- usually only portions of my architecture need any sort of reflection, and I don't want to pay for it on all the other classes.

Operator Overloading - Rarely useful outside of math libraries. I'm not even a huge fan of the iostreams model, to tell the truth.

Multiple inheritence - Only with pure virtual interfaces, and even then should be used rarely and avoided if possible. Sharing implementation via inheritance goes awry enough in single inheritance, adding more base class chains just makes the problem worse.

Templates - The big one. I'll admit to having a love affair with templates in my twenties. What can I say? They were fun and a shiny new toy. I sure had some excesses, but even my worst one (a cross-platform file system library) shipped in multiple products. Even then I hid them all behind a straight-C API, so only programmers who had to either debug or extend the library innards had to deal with the templates. If I had to do it again, I'd probably do it differently, but I could say that about any code I've written in my career, whatever the language. I do know that it was an improvement over the previous file system library that was in use, because the new one actually worked.

I can say with a degree of certainty that template metaprogramming is a bust for practical use. There are a few major problems with it: the language isn't really built for it (it's more a clever side effect than anything), there is no good way to debug it, and functional programming isn't very ingrained in the game development culture. Ironically, I think the last part is going to have to change as parallel programming creeps into larger and larger sections of the architecture, but that won't make template metaprogramming practical.

In any case, these days templates are just a tool in the toolbox, and not one I reach for that often. The code bases I've been working in recently all roll their own template container libraries* (provided for us from external vendors), and they do the job. My experiences with code sharing via templates is that more than often it isn't worth the trouble, but sometimes it is. Like anything we do, it is a tradeoff, and one I don't necessarily feel particularly passionate about either way.

*A somewhat amusing side note: I've done performance and code generation tests with one of the hand-rolled template container libraries I've encountered versus STL. STL came out on top for a lot of simple things like loop iteration overhead or sorting, on all the platforms I was interested in. Of course, I'm not about to rewrite hundreds of thousands of lines of code to use STL, and STL still is horrible for memory management. But I suppose that underscores the point "30 years of C" made -- even something as simple as a container library is hard to get right, even for experts. Which library I'm talking about shall remain anonymous for its own protection.

The Other Cost of Code Bloat

The other day I almost wrote a redundant version of the exact same class that someone else on my project had written. In fact, if I hadn't have asked this person a couple general C# questions, and he hadn't put two and two together, I probably would have wrote that redundant class. Good detective work on his part, and shame on me for not doing a search of the code base to see if someone else had already tackled this problem. While I've got a pretty good feel of the C++ which makes up the majority of code in our engine/tools, I haven't looked at the C# side as much as I probably should have.

As the code bases we write get larger and larger, and the team sizes we deal with get larger and larger, the question of how to avoid this scenario becomes an important one. Ideally you hire programmers who perform the necessary code archeology to get a feel for where things are in the code base, or who will ask questions of people more familiar with the code when unsure. Getting a code base of a million or more lines "in your head" takes time, though. I've been working with our licensed engine for about four years now, and there are still nooks and crannies that are unfamiliar to me.

Better documentation should help, but in practice it is rarely read if it even exists. This is because usually such documentation is either nonexistant or if it does exist, horribly out of date. With a licensed engine, you are at the mercy of the little documentation you are provided, and at the end of the day, the code itself is the best documentation.

A sensible architecture with clear delineation of what should go where is often a bigger help. Knowing [where to look] is half the battle, said a saturday morning cartoon show. Again, with a licensed engine, you again are at the mercy of what you are provided. Finding existing functionality usually comes down to experience with the code base and code archeology skills.

Recently, Adrian Stone has been writing an excellent series on minimizing code bloat. Now while the techniques he describes aren't really about eliminating actual code and instead eliminating redundant generated and compiled code, the mindset is the same when you are removing actual lines of code. Aside from the important compile time, link time, and executable size benefits, there is another benefit to removing as much code as you possibly can -- the code will occupy less "head space."

Unused or dead code makes it that much harder to do code archeology. Dead code certainly can make it more difficult to make lower level changes to the engine or architecture, as it is one more support burden and implementation difficulty. In the past, removing large legacy systems (whether written internally or externally) has had unexpected benefits in simplifying the overall architecture -- often there are lower level features that only exist to support that one dormant system.

One of my favorite things to do is delete a lot of code without the end result of the tools or game losing any functionality. It's not only cleaning out the actual lines of code, but the corresponding head space that is wonderful feeling -- "I will never have to think about X again." With the scale of the code bases we deal with today, we don't have the brain power to spare over things we don't need.

Monday, September 21, 2009

Rogue Programming

Gamasutra had an interesting article today titled Gaming the System: How to Really Get Ahead in the Game Industry. I found it probably had more to say about the political dysfunction that can often accompany game development rather than a how-to on being successful. To put it another way: if you find yourself having to follow the sneakier guidelines in this article too much, then you might want to consider a change in where you work.

The programming section is titled "Just Do It" and does have some truth to it. One of my leads and I came up with the term "rogue programming" for what he describes, which was half-joke, half-serious. Here's a quote:

As a programmer, it's not uncommon to see problems that you think should be fixed, or to see an opportunity to improve some piece of code, or speed up a process that takes a lot of time. It's also not uncommon for your suggestion to be ignored, or dismissed with an "it's not broke, so let's not fix it" response...

What should you do? You should just do it -- on your own time.

This is advice which is fraught with a lot of risk, because here's a hard-earned lesson for you: you don't always know best. I know, I know, you're superstar hotshot programmer, and you see something that is broken, so it must be fixed. Sure it is not in the schedule, but it'll just take a few hours, what's the harm? The code base will be so much better when you're done, or the artists and designers will have a feature they didn't have before. How can that not make the project better?

Let me give a cold-water splash of reality: when it is all said and done at the end of the project, you're going to ship with a lot of broken code. I'm not talking about obvious bugs in the shipped project, I just mean nasty, hack-filled, just-get-it-out-the-door brokenness in the code base, and some of that code will be code that you wrote. If this wasn't true, then a long-lived project like the Linux kernel wouldn't still have thousands of developers contributing to it -- obviously, there is still stuff that is "broken" and can be improved!

So in the big picture, a single section of brokenness is not going to make or break your project, and usually, there are bigger fish to fry on any given day, and its best to fry them. Because if your project is cancelled because a major feature was late, will it matter that you cleaned up the way you calculated checksums for identifiers?

That said, if after all of this, you still think something is worth doing, let me tell you how to successfully rogue program:

First, and most importantly, let's define "on your own time." On your own time means you are hitting your scheduled work on schedule, and that will not change if you fix or implement this one thing. If you're behind on your scheduled work, then you really shouldn't be doing any rogue programming. Whether not impacting your schedule means you work a saturday, do some exploration at home, or just have some slack in your schedule you'd like to exploit, if you don't complete the tasks you are supposed to be working on, you've done more damage than whatever improvement you're working on the side could benefit.

Additionally, you need co-conspirators. These days, programming is very collaborative process, and for the most part, the cowboy mentality is a dying thing. If you talk to your lead or other engineers about a problem ("Hey, X is pretty f'd up") and no one else agrees, or you can't make the case, then hey, maybe X really isn't that important! You really want to be working with a group of people that you can convince with sound arguments that something is a problem, and a lot of the time a little discussion about a problem can turn into a scheduled task -- and no rogue programming.

Often you'll be faced with something that everybody agrees *should* be done, but there's no time to do it. In these cases, I've found with good leads (which I've been blessed with the last few years), you can get tacit approval to do something "on your own time." This often takes trust -- I wouldn't recommend going off and "just doing it" until you've earned that trust.

If you've gotten to this point, you're in pretty good shape to go off and do some "rogue programming" -- because at this point (and this is where the joke comes in), it really isn't rogue at all.

Now if you're at a company where you constantly feel like you need to "go around people" to "just do things," then maybe you really do need a change of venue, because that is not a healthy team. I happen to know someone who is hiring.

Thursday, August 27, 2009

Big O doesn't always matter

The other day I was optimizing a bit of debug code which verifies the integrity of objects in memory. The details aren't super-important, but the gist is it is a function which runs periodically and makes sure that objects that should have been garbage collected were indeed purged from memory.

I don't make a general habit of optimizing debug code, but this was a special case -- before, this process only ran in the debug build. Artists and designers run a "development" build, which is an optimized build that still includes assertions and many other development checks.

We recently ran into a bug that would have been detected much earlier if this process had been running in the development build. While programmers run the debug build almost exclusively, we tend to stick to simpler test levels. Trying to debug an issue on a quick loading, small level is much easier than on a full-blown one.

The algorithm is pretty simple -- objects have somewhat of a tree structure, but for various reasons they only have parent links and not child links. For objects at the top-level of the tree, we know for sure whether they should be in memory or not. Objects at the bottom of the tree keep all their parents in memory if they are connected to the reference graph. So the debug check looks at each object and verifies that it is not parented (via an arbitrarily long parent chain) to an object which should have been purged.

First thing I did was measure how long the process was taking, and did some lower level profiling to get an idea of where time was spent. Most importantly, I also saw where I was running into cache misses.

The first pass of optimization -- the original loop was doing a lot of work per-object that was simply unnecessary. This was because it was using a generalized iterator that had more functionality than needed for this specific case -- for most operations, particularly at editor time, this extra overhead is not a big deal. Removing this extra work sped up the process and it was now took about 90% of the time of the original.

I then tried some high-level optimizations. There were two things I tried - one, the inner loop linearly checked each high-level object against an unsorted array of objects we know should be purged. I replaced this with a hash table from our container library. Finally, I realized that a memoizing approach should help here -- since I'm dealing with a tree, I could use a bit array to remember if I've already processed a parent object and deemed it OK. This would allow me to cut off traversal of the parent chain, which should eliminate a lot of work. Or so I thought.

The new algorithm was faster, but not by much - only 85% of the original running time. The additional complexity was not worth 5% of running time, so I went back to the simpler approach. This isn't unusual in optimization -- you often can try something you think will be a big help but turns out not to matter much. I've made mistakes in the past where I stuck with the more complicated implementation for a marginal gain -- but it was not worth it, and it made other optimizations that may have bigger impact harder to do.

As far as why the gain wasn't that much: The unsorted array was relatively small (a handful of elements), so a linear search was faster because it was simpler and had better cache behavior than the hash table implementation I was using. The tree structure of the objects was broad but not deep, so its obvious in hindsight why memoization would not be a win.

Now, one thing that is nice to have is a decent container and algorithm library. I have that at my disposal, so implementing these two changes was a matter of minutes instead of hours. With that kind of arsenal, it is easy to try out algorithmic changes, even if they end up not working out.

At this point, I took another look at the cache behavior from my profiling tools. I tried something incredibly simple -- prefetching the next object into the cache while I was processing the current. This resulted in the process now running at 50% of the time of the original -- a 2X speedup, and likely fast enough for me to enable this in the development build. I'm going to measure again, and see if there are any other easy wins like this to be had.

The processors we use are fast -- incredibly fast, and even with branch penalties on the in-order processors of today's consoles, they can still do a lot of work in the time it takes to retrieve data from main memory. So while on paper, I'm using "slow" algorithms with worse O(n) times, in practice, your memory access patterns can easily drown out any extra calculation. The key, as always, is to measure and test your theories, and not just assume that any given approach will make something faster.

Monday, August 24, 2009

Stencil states for rendering light volumes

In the ShaderX 7 article "Designing a Renderer for Multiple Lights: The Light Pre-Pass Renderer", the author describes a number of approaches for rendering the lights into the lighting buffer. These are all pretty standard approaches for any deferred technique, but I thought the description of using stencil does not explain how to set up the stencil states very clearly. This was probably due to space constraints.

The way it is worded implies that you still need to change the depth comparison function. This is not the case, and is most of the point of the technique. As the article points out, changing the depth test makes many GPUs take their early-Z rejection and go home.

I'm sure you can find this detail elsewhere on the net, but my cursory searches did not find anything, and hopefully this will save at least one person some time. Standard caveats apply: I haven't extensively tested this stuff.

Assuming convex light volumes, this is what I found worked well:


// render backfaces so that only pixels in front of the backface have stencil incremented
AlphaBlendEnable = false
StencilEnable = true
ColorWriteChannels = None
CullMode
= Clockwise
DepthBufferEnable
= true
StencilFunction = Always
StencilPass
= Keep
// If a pixel is front of the volume backface, then we want it lit
StencilDepthBufferFail = Increment

// render volume

// render frontfaces so that any pixel in back of them have stencil decremented
CullMode = CounterClockwise
// pass stencil test if reference value < buffer, so we only process pixels marked above.
// Reference value is 0. This is not strictly necessary but an optimization
StencilFunction = Less
// If a pixel is in front of the volume frontface, then it is not inside the volume
StencilDepthBufferFail = Decrement;

// render volume

AlphaBlendEnable = true
ColorWriteChannels = RGB
// only process pixels with 0 < buffer
StencilFunction = Less
// zero out pixels for so we don't need a separate clear for next volume
StencilPass = Zero
// don't want to do anything if we fail the depth test
StencilDepthBufferFail = Keep

//render a screen space rectangle scissored to the projection of the light volume


Note that unlike shadow volumes, the light volume intersecting the near plane is not a concern here. We are rendering the frontfaces to find pixels that are in front of the light volume -- if parts of the light volume are in front of the near plane, by definition any pixels we're rendering are in back of those parts. So there is no need to render a cap in this case.

The light volume intersecting the far plane is a concern. One way to handle this case is to use a projection matrix with an infinite far plane, like shadow volumes do. Another way to handle it would be to detect this case and not use the stencil approach at all, instead rendering a screen space rectangle scissored to the light bounds.

Finally, I've had better luck switching to rendering the backfaces without depth testing when the camera is inside the light volume, instead of using a screen space rectangle. But I think this has more to do with a bug in my scissoring code than with any fundamental problem!