The most common occurrence of a CAS is upon lock entrance and exit. Although a lock can be built with a single CAS operation, CLR monitors use two (one for Enter and another for Exit). Lock-free algorithms often use CAS in place of locks, but due to memory reordering such algorithms often need explicit fences that are typically encoded as CAS instructions. Although locks are evil, most good developers know to keep lock hold times small. As a result, one of the nastiest impediments to performance and scaling has nothing to do with locks at all; it has to do with the number, frequency, and locality of CAS operations.
One thing that has been talked about on the recent gdalgorithms thread is understanding that your only options are not between lock-free and heavy operating system lock primitives. Some operating systems provide lighter weight synchronization primitives (critical sections on Win32 with a spincount come to mind). In the past on some consoles I've come up with faster mutex implementations than what the operating system provides, although this is not something I'd recommend to the weak of heart -- it is unbelievably easy to get this stuff wrong, even for very experienced developers.
I think the lesson from this entry is to always measure. Maybe for a specific application a fancy lock-free algorithm may be faster, but maybe not -- as we see from above it can depend on a lot of factors. The other thing to avoid is premature optimization -- get your algorithm correct and without races first using locks, and then evaluate whether a lock-free algorithm is appropriate and faster.