Showing posts with label multicore. Show all posts
Showing posts with label multicore. Show all posts

Friday, January 9, 2009

Joe Duffy on CAS operations and performance

Interesting results in a blog entry from Joe Duffy about the scalability of compare-and-swap instructions across x86 architectures. I found this via the gdalgorithms list. The money graph:

The most common occurrence of a CAS is upon lock entrance and exit. Although a lock can be built with a single CAS operation, CLR monitors use two (one for Enter and another for Exit). Lock-free algorithms often use CAS in place of locks, but due to memory reordering such algorithms often need explicit fences that are typically encoded as CAS instructions. Although locks are evil, most good developers know to keep lock hold times small. As a result, one of the nastiest impediments to performance and scaling has nothing to do with locks at all; it has to do with the number, frequency, and locality of CAS operations.

One thing that has been talked about on the recent gdalgorithms thread is understanding that your only options are not between lock-free and heavy operating system lock primitives. Some operating systems provide lighter weight synchronization primitives (critical sections on Win32 with a spincount come to mind). In the past on some consoles I've come up with faster mutex implementations than what the operating system provides, although this is not something I'd recommend to the weak of heart -- it is unbelievably easy to get this stuff wrong, even for very experienced developers.

I think the lesson from this entry is to always measure. Maybe for a specific application a fancy lock-free algorithm may be faster, but maybe not -- as we see from above it can depend on a lot of factors. The other thing to avoid is premature optimization -- get your algorithm correct and without races first using locks, and then evaluate whether a lock-free algorithm is appropriate and faster.

Saturday, January 3, 2009

Good Middleware revisited

I was having a conversation with a good friend about the Good Middleware entry, and he made a very interesting point. He pointed out that in the future, good middleware is going to allow you to hook their multicore job scheduling just like you can hook memory allocation.

Let's start out with a quick review of why we hook memory allocation, as it will help us understand why we're going to want to hook job scheduling.
  1. Consoles are fixed memory environments. It is critical to balance overall memory utilization for your title. If you can provide five more megabytes for game data over the built-in system allocator, then that allows you to have more stuff in your game, which generally can lead to a better game.
  2. Different titles have different memory needs. A 2D side-scroller is going to have different memory requirements than a world game. Some types of games lend themselves to pre-allocation strategies, where you know ahead of time how many assets of what type you are going to have. Every game I've worked on has spent time tuning the memory allocator to squeeze out as much memory as possible, and what works in one title may not be applicable to another. 
  3. The system allocator may perform badly, particularly when accessed from multiple threads. Often, console operating systems will have poor allocator implementations that have a lot of overhead. Sometimes they will use allocators that are designed for a much wider range of application than games, or designed for systems with virtual memory. Others explicitly punt on providing anything but systems designed for large, bulk allocations and expect you to handle smaller allocations on your own. Finally, no matter how good an allocator may be, it does not have knowledge of your application's memory needs or usage patterns, which are things you can use to optimize your allocator
  4. The system allocator may not provide good leak detection or heap debugging tools. It may provide none at all. If you're doing a console game, I know of no standalone leak detection software such as BoundsChecker that is available. Often you have to provide these facilities yourself.

 CPU core usage can be thought of as a system resource that is arbitrated just like memory. Most middleware that does use jobs spread across multiple cores I've seen so far usually has its own internal scheduler running on top of a couple hardware threads. In the early days of multicore middleware, you couldn't even set the CPU affinity of these vendor threads without recompiling the library, something which is key on the 360. Increasingly, I think games are going to want to own their job scheduling and not just leave it up to the operating system to arbitrate CPU scheduling at the thread level. 
  1. Consoles have a fixed number of cores. Goes without saying but worth touching on. You have a fixed amount of CPU processing you can do in a 30hz/60hz frame. Ideally you want to maximize the amount of CPU time spent on real game work and minimize the amount spent on bookkeeping/scheduling. 
  2. You know more than the operating system does about your application. One approach to scheduling jobs would be to just create a thread per job and let them go. You could do that, but you would then watch your performance tank. Forget the fact that creating/destroying operating system threads is not a cheap task, and focus on the cost of context switches. Thread context switches are very expensive and with CPU architectures adding more and more registers, they are not going to get any cheaper. A lot of game work are actually small tasks more amenable to an architecture where a thread pool sits and just executes these jobs one after another. In this architecture, you do not have any thread context switches between jobs or while jobs are running, which means you are spending more time doing real game work and less time on scheduling.
  3. It is difficult to balance between competing job schedulers. Imagine a world with 16 to 32 cores, and both your middleware and your game need to schedule numerous jobs to a thread pool. It is going to be very difficult to avoid stalls or idle time due to the fact that you have two job schedulers in your system -- one from the middleware and one from your game. Either you are "doubling up" thread allocation on cores and paying the cost of context switches from one thread pool to the next, or you allocate each thread pool to a different set of cores and hope that the work balances out. Unfortunately, I don't think the latter is going to scale very well and you will end up with a lot of unused CPU capacity.
The third reason is the clincher as to why good middleware in the future will allow you to replace their job scheduler with your own. It's the only solution I can think of that will allow you to balance workloads across many, many cores. It is possible in the future that operating systems will provde good enough thread pool/microjob APIs that we won't have to be writing our own, but even in that case, you'd still want the hooks for debugging or profiling purposes. 

I haven't yet seen any middleware that really allows you to hook this stuff easily. Have you?



Friday, December 26, 2008

Magical missteps and the memory wall

Two links today:



I've spent a lot of time thinking about where multicore hardware is going and how that is going to affect software architectures. My current thinking is that designing software for UMA (uniform memory access) is not going to scale. This is hardly a revolutionary statement.

Let's start with the IEEE Spectrum article first. The article details a study by Sandia National Labs showing that performance of general-purpose multiple cores really starts to deteriorate for certain applications after 8-16 cores due to the memory wall. The "memory wall" is the fact that while our ability to cram transistors on a die keeps ever-increasing, memory bandwidth grows at a piddling rate. Eventually, you've got a large number of cores starving for data.

Now, an important piece of information is these results only apply to certain types of workloads. For instance, simulating a nuclear bomb does not hit the memory wall because the data and calculations can be partitioned to a small amount of memory corresponding to a small spatial area. Thus you don't have a lot of memory bandwidth used because all of your data is in-cache and you are happy.

What the study says is for certain types of applications which involve traversing massive amounts of data and can not be sufficiently partitioned, your performance goes out the window. It then goes on to talk about fancy hardware (stacked memory) that may fix the problem. This hardware is sufficiently far in the future that I'm not going to bank on it.

The challenge for game programmers with future multicore designs will be to architect our systems and engines to use well-partitioned and locally coherent memory access. This will become increasingly more important as the number of cores increases. Right now, with systems such as the 360 having only 6 cores, or consumer PCs only having 4 cores, memory bottlenecks, while troublesome, are not at the severity they will be with 16, 32, or 64 cores. 

Which brings me back to why designing for UMA is troublesome. UMA makes the promise that you can access any memory you need at any time -- which is true, but it doesn't make any promises about how fast that access will be. Cache logic is not going to be sufficient to hide all latencies of poorly designed algorithms and systems. Obviously, this has been true for many years even with single core designs, but the point is the problem is about to get many times worse. 

If you design algorithms such that they can be run on processors with a small local memory, those will run well on UMA designs . The reverse is not true. Local memory designs force you to make good decisions about memory accesses, and also prepare you for architectures which are not UMA, which for all we know right now the next crop of consoles may very well be.

So what does all of this have to do with the Magical Wasteland article on the PS3? Due to NDAs, I won't comment on the article beyond saying I think its characterization of the PS3 is fair. But I will say this: it may very well be the case that five or ten years from now we realize that the Cell architecture was ahead of its time. Maybe not in the specific implementation, but just keeping the idea alive that, hey, maybe touching every cache line in the system in a garbage collection algorithm isn't the best of ideas. Because sooner or later, we'll all be programming with a local memory model, even if we don't actually have local memory.