There's a few things you need to know about gpus. The first is that if they're discrete devices they typically have their own memory controller and video memory. The second is that there's an impressive number of ways that you can end up touching that memory. The third is that they tend to get upset if something tries to touch that memory and the memory controller is in the middle of reclocking at the time.
The first and most obvious use of video memory is by the gpu itself. Accelerated operations on radeon are carried out by sending a command packet to the command processor. This is achieved by sharing a ring buffer between the cpu and the gpu, with the gpu reading packets out of that ring buffer and performing the operations contained within them. Many of these operations will touch video memory (that being the nature of most things you want a gpu to do), and if that happens bad things occur. Like the card locking up and potentially taking your PCI bus with it.
So, obviously, we don't want that to happen. The first thing we do is take a mutex that blocks any further accelerated operations from being submitted by userspace. Then we wait until we get an interrupt from the gpu telling us that the display engine has gone idle. The problem here is that we don't have a terribly good idea of how many more operations there are to complete and we don't know how long each of those operations is going to take, but this is less bad than some of the alternatives. Jerome Glisse has some ideas on how to improve this to require less waiting, but the effects should still be pretty much invisible to the average user.
So we've stopped the command processor touching ram. Everything's good, right?
Well, not really. The obvious problem is that users typically want to display something, so there's a separate chunk of chip that's repeatedly copying video memory over to your monitor. That's got to go too. Thankfully, there's a convenient bit in the crtc registers that lets us turn that off, but the pretty unsurprising downside is that your screen goes blank while that's happening. So we don't want to do that. Instead, we try to perform the reclock while there's nothing being displayed on the screen - that is, while we're in the screen region where a crt's electron gun would be scanning back from the bottom of the screen to the top. It turns out that rather a lot of display assumptions depend on this happening even if there's no crt, no electron gun and no thick sheet of glass with a decent approximation of vacuum behind it, so we get to do this even if we're displaying to an LVDS. And we have about 400-500 microseconds to do it - an almost generous amount of time.
So we ask the hardware to generate an interrupt when we enter vblank and then we reclock. Except the hardware has an irritating habit of lying - sometimes we get the interrupt a line or two before vblank, sometimes we get it after we've already gone out the other side. Vexing, and not entirely solved yet - so sometimes you'll still get a single blank frame during reclock. But there are plans, and they'll probably even work.
At this point the acceleration hardware isn't touching the memory and the scanout hardware isn't touching the memory. Except it still crashes under some workloads. This one took me longer to track down, but the answer turned out to be pretty straightforward. Not all operations are accelerated. When they're not accelerated they have to be done in software. That means that the CPU has to write to the video memory itself. I'm sure you can see where this is going. This was fixed without too much trouble once I'd finished picking through the driver to work out every location where objects might be mapped into the CPU's address space, at which point it's a simple matter of unmapping them and blocking the fault handler from remapping them until the reclock is finished. Linux, thankfully, has lots of synchronisation primitives. And now everything works.
Except when it doesn't. This took a final period of head scratching, followed by the discovery that ttm (the memory allocator used by radeon) has a background thread that would occasionally fire to clean up objects. And by clean up objects, I mean change their mapping - which means updating their status in the gart, which means touching video memory. So, let's block that. And that tripped me off to the fact that even if it couldn't submit new commands, the CPU could still create or destroy objects - with the same consequences.
So, once all of these are blocked, video memory is quiescent and we can do what we want. And we do, at least once I'd sorted out the bits where I was taking locks in the wrong order and deadlocking. Depending on the powerplay tables available on your card we'll chose different rates and so your power savings will vary heavily depending on the values that your vendor provided, but the card I'm testing on sees a handy 30W drop at idle. Right now we're only changing clocks and not dropping voltage so there's potentially a little more to come.
While getting this stable was pretty miserable, the documented entry points for clock changing made a lot of this easier than it would otherwise have been. It's also probably worth noting that Intel's clock configuration registers are entirely missing from any public documentation and the dirver Intel submitted to make them work in their latest chips appeared to have been deliberately obfuscated, so thanks to AMD for making all of this possible.
 It's possible to insert various command packets that either indicate when they've passed or stall until a register value gets updated, but these either cause awkward problems with the cp locking or mean that the gui idle functionality never goes idle, so they're not ideal either.