The other day I was optimizing a bit of debug code which verifies the integrity of objects in memory. The details aren't super-important, but the gist is it is a function which runs periodically and makes sure that objects that should have been garbage collected were indeed purged from memory.
I don't make a general habit of optimizing debug code, but this was a special case -- before, this process only ran in the debug build. Artists and designers run a "development" build, which is an optimized build that still includes assertions and many other development checks.
We recently ran into a bug that would have been detected much earlier if this process had been running in the development build. While programmers run the debug build almost exclusively, we tend to stick to simpler test levels. Trying to debug an issue on a quick loading, small level is much easier than on a full-blown one.
The algorithm is pretty simple -- objects have somewhat of a tree structure, but for various reasons they only have parent links and not child links. For objects at the top-level of the tree, we know for sure whether they should be in memory or not. Objects at the bottom of the tree keep all their parents in memory if they are connected to the reference graph. So the debug check looks at each object and verifies that it is not parented (via an arbitrarily long parent chain) to an object which should have been purged.
First thing I did was measure how long the process was taking, and did some lower level profiling to get an idea of where time was spent. Most importantly, I also saw where I was running into cache misses.
The first pass of optimization -- the original loop was doing a lot of work per-object that was simply unnecessary. This was because it was using a generalized iterator that had more functionality than needed for this specific case -- for most operations, particularly at editor time, this extra overhead is not a big deal. Removing this extra work sped up the process and it was now took about 90% of the time of the original.
I then tried some high-level optimizations. There were two things I tried - one, the inner loop linearly checked each high-level object against an unsorted array of objects we know should be purged. I replaced this with a hash table from our container library. Finally, I realized that a memoizing approach should help here -- since I'm dealing with a tree, I could use a bit array to remember if I've already processed a parent object and deemed it OK. This would allow me to cut off traversal of the parent chain, which should eliminate a lot of work. Or so I thought.
The new algorithm was faster, but not by much - only 85% of the original running time. The additional complexity was not worth 5% of running time, so I went back to the simpler approach. This isn't unusual in optimization -- you often can try something you think will be a big help but turns out not to matter much. I've made mistakes in the past where I stuck with the more complicated implementation for a marginal gain -- but it was not worth it, and it made other optimizations that may have bigger impact harder to do.
As far as why the gain wasn't that much: The unsorted array was relatively small (a handful of elements), so a linear search was faster because it was simpler and had better cache behavior than the hash table implementation I was using. The tree structure of the objects was broad but not deep, so its obvious in hindsight why memoization would not be a win.
Now, one thing that is nice to have is a decent container and algorithm library. I have that at my disposal, so implementing these two changes was a matter of minutes instead of hours. With that kind of arsenal, it is easy to try out algorithmic changes, even if they end up not working out.
At this point, I took another look at the cache behavior from my profiling tools. I tried something incredibly simple -- prefetching the next object into the cache while I was processing the current. This resulted in the process now running at 50% of the time of the original -- a 2X speedup, and likely fast enough for me to enable this in the development build. I'm going to measure again, and see if there are any other easy wins like this to be had.
The processors we use are fast -- incredibly fast, and even with branch penalties on the in-order processors of today's consoles, they can still do a lot of work in the time it takes to retrieve data from main memory. So while on paper, I'm using "slow" algorithms with worse O(n) times, in practice, your memory access patterns can easily drown out any extra calculation. The key, as always, is to measure and test your theories, and not just assume that any given approach will make something faster.