This is clearly less than ideal.
Recent Radeons all support dynamic clock gating, a technology where the clocks to various bits of the chip are turned off when not in use. Unfortunately it seems that this is generally already enabled by the BIOS on most hardware, so playing with that didn't give me any power savings. Next I looked at Powerplay, the AMD technology for reducing clocks and voltages. It turns out that my desktop hardware doesn't provide any Powerplay tables, so no joy there either. What next?
Radeons all carry a ROM containing a bunch of tables and scripts written in a straightforward bytecode language called Atom. The idea is that OS-specific drivers can call the Atom tables to perform tasks that are hardware dependent, even without knowledge of the specific low-level nature of the hardware they're driving. You can use Atom to do several things, from card initialisation through mode setting to (crucially) setting the clock frequencies. Jerome Glisse wrote a small utility called Atomtools that lets you execute Atom scripts and set the core and RAM frequencies. Playing with this showed that it was possible to save the best part of 5W by underclocking the graphics core, and about the same again by reducing the memory clock. A total saving of 9-10W was pretty significant.
The main problem with reducing the memory clock was that doing it while the screen is being scanned out results in memory corruption, showing up as big ugly graphical artifacts on the screen. I'm a fan of doing power management as aggressively as possible, which means reclocking the memory whenever the system is idle. Turning the screen off to reclock the memory would avoid the graphical corruption but introduce irritating flicker, so that wasn't really an option. The next plan was to synchronise the memory reclocking to the vertical refresh interval, the period of time between the bottom of a frame and the top of the next frame being drawn. Unfortunately setting the memory frequency took somewhere between 2 and 20 milliseconds, far too long to finish inside that time period.
So. Just using Atom was clearly not going to be possible. The next step was to try writing the registers directly. Looking at the R500 register documentation showed that the MPLL_FUNC_CNTL register contained the PLL dividers for the memory clock. Simply smacking a new value in here would allow changing the frequency of the memory clock with a single register write. It even worked. Almost. I could change the frequency within small ranges, but going any further resulted in increasingly severe graphical corruption. Unlike the sort I got with the Atom approach to changing the frequency, this corruption manifested itself as a range of effects from shimmering on the screen down to blocks of image gradually disappearing in an impressively trippy (though somewhat disturbing) way.
Next step was to perform a register dump before and after changing the frequencies via Atom, and compare them to the registers I was programming. MC_ARB_RATIO_CLK_SEQ was consistently different, which is where things got interesting. The AMD docs helpfully describe this register as "Magic field, please use the excel programming guide. Sets the hclk/sclk ratio in the arbiter", about as helpful as being told that the register contents are defined by careful examination of a series of butterflies kept somewhere in Taiwan. Now what?
Back to Atomtools. Enabling debugging let me watch a dump of the Atom script as it ran. The relevant part of the dump is here. The most significant point was:
MOVE_REG @ 0xBC09 src: ID[0x0000+B39E].[31:0] -> 0xFF7FFF7F dst: REG[0xFE16].[31:0] <- 0xFF7FFF7F, showing that the value in question was being read out of a table in the video BIOS (ID[0x0000+B39E] indicating the base of the ROM plus 0xB39E). Looking further back showed that WS[0x40] contained a number that was used as an index into the table. Grepping the header files gave 0x40 as ATOM_WS_QUOTIENT, containing the quotient of a division operation immediately beforehand. Working back from there showed that the value was derived from a formula involving the divider frequencies of the memory PLL and the source PLL. Reimplementing that was trivial, and now I could program the same register values. Hurrah!
It didn't work, of course. These things never do. It looked like modifying this value didn't actually do anything unless the memory controller was reinitialised. Looking through the Atom dump showed that this was achieved by calling the MemoryDeviceInit script. Reimplementing this from scratch was one option, but it had a bunch of branches and frankly I'm lazy and that's why I work on this Linux stuff rather than getting a proper job. This particular script was fast, so there was no real reason to do it by hand instead of just using the interpreter. Timing showed that doing so could easily be done within the vblank interval. This time, it even worked.
I've done a proof of concept that involved wedging this into the Radeon DRM code with extreme prejudice, but it needs some rework. However, it demonstrates that it's possible to downclock the memory whenever the screen is idle without there being any observable screen flicker. Combine that with GPU downclocking and we can save about 10W without any noticable degradation in performance or output. Victory!
I gave the code to someone with an X1300 and it promptly corrupted their screen and locked their machine up. Oh well. Turns out that they have a different memory controller or some such madness.
So, obviously, there's more work to be done on this. I've put some test code here. It's a small program that should be run as root. It should reprogram an Atom-based discrete graphics card to half its memory clock. Running it again will halve it again. I don't recommend doing that. You'll need to reboot to get the full clock back. This isn't vblank synced, so it may introduce some graphical corruption. If the corruption is static (ie, isn't moving or flickering) then that's fine. If it's moving then I (and/or the docs) suck and there's still work to be done. If your machine hangs then I'm interested in knowing what hardware you have and may have some further debugging code to be run. Unless you have an X1300, in which case it's known to break and what were you thinking running this code you crazy mad fool.
Once this is stable it shouldn't take long to integrate it into the DRM and X layers. I'm also trying to get hold of some mobile AMD hardware to test what impact we can have on laptops.
 Shockingly enough, it's somewhat harder to underclock graphics memory on a shared memory system