Matthew Garrett (mjg59) wrote,
Matthew Garrett

Radeon reclocking

Alex released another set of Radeon power management patches over the weekend, and I've been adding my own code on top of that (Alex's patches go on top of drm-next, mine go on top of there). I've left it stress-testing for a couple of hours without it falling over, which tells me that it's stable enough that I can feel smug. This is a pleasing counterpoint to the previous experiences I've been having, which have been rife with a heady mixture of chip lockups or kernel deadlocks. It turns out that gpus are hard.

There's a few things you need to know about gpus. The first is that if they're discrete devices they typically have their own memory controller and video memory. The second is that there's an impressive number of ways that you can end up touching that memory. The third is that they tend to get upset if something tries to touch that memory and the memory controller is in the middle of reclocking at the time.

The first and most obvious use of video memory is by the gpu itself. Accelerated operations on radeon are carried out by sending a command packet to the command processor. This is achieved by sharing a ring buffer between the cpu and the gpu, with the gpu reading packets out of that ring buffer and performing the operations contained within them. Many of these operations will touch video memory (that being the nature of most things you want a gpu to do), and if that happens bad things occur. Like the card locking up and potentially taking your PCI bus with it.

So, obviously, we don't want that to happen. The first thing we do is take a mutex that blocks any further accelerated operations from being submitted by userspace. Then we wait until we get an interrupt from the gpu telling us that the display engine has gone idle. The problem here is that we don't have a terribly good idea of how many more operations there are to complete and we don't know how long each of those operations is going to take, but this is less bad than some of the alternatives[1]. Jerome Glisse has some ideas on how to improve this to require less waiting, but the effects should still be pretty much invisible to the average user.

So we've stopped the command processor touching ram. Everything's good, right?

Well, not really. The obvious problem is that users typically want to display something, so there's a separate chunk of chip that's repeatedly copying video memory over to your monitor. That's got to go too. Thankfully, there's a convenient bit in the crtc registers that lets us turn that off, but the pretty unsurprising downside is that your screen goes blank while that's happening. So we don't want to do that. Instead, we try to perform the reclock while there's nothing being displayed on the screen - that is, while we're in the screen region where a crt's electron gun would be scanning back from the bottom of the screen to the top. It turns out that rather a lot of display assumptions depend on this happening even if there's no crt, no electron gun and no thick sheet of glass with a decent approximation of vacuum behind it, so we get to do this even if we're displaying to an LVDS. And we have about 400-500 microseconds to do it - an almost generous amount of time.

So we ask the hardware to generate an interrupt when we enter vblank and then we reclock. Except the hardware has an irritating habit of lying - sometimes we get the interrupt a line or two before vblank, sometimes we get it after we've already gone out the other side. Vexing, and not entirely solved yet - so sometimes you'll still get a single blank frame during reclock. But there are plans, and they'll probably even work.

At this point the acceleration hardware isn't touching the memory and the scanout hardware isn't touching the memory. Except it still crashes under some workloads. This one took me longer to track down, but the answer turned out to be pretty straightforward. Not all operations are accelerated. When they're not accelerated they have to be done in software. That means that the CPU has to write to the video memory itself. I'm sure you can see where this is going. This was fixed without too much trouble once I'd finished picking through the driver to work out every location where objects might be mapped into the CPU's address space, at which point it's a simple matter of unmapping them and blocking the fault handler from remapping them until the reclock is finished. Linux, thankfully, has lots of synchronisation primitives. And now everything works.

Except when it doesn't. This took a final period of head scratching, followed by the discovery that ttm (the memory allocator used by radeon) has a background thread that would occasionally fire to clean up objects. And by clean up objects, I mean change their mapping - which means updating their status in the gart, which means touching video memory. So, let's block that. And that tripped me off to the fact that even if it couldn't submit new commands, the CPU could still create or destroy objects - with the same consequences.

So, once all of these are blocked, video memory is quiescent and we can do what we want. And we do, at least once I'd sorted out the bits where I was taking locks in the wrong order and deadlocking. Depending on the powerplay tables available on your card we'll chose different rates and so your power savings will vary heavily depending on the values that your vendor provided, but the card I'm testing on sees a handy 30W drop at idle. Right now we're only changing clocks and not dropping voltage so there's potentially a little more to come.

While getting this stable was pretty miserable, the documented entry points for clock changing made a lot of this easier than it would otherwise have been. It's also probably worth noting that Intel's clock configuration registers are entirely missing from any public documentation and the dirver Intel submitted to make them work in their latest chips appeared to have been deliberately obfuscated, so thanks to AMD for making all of this possible.

[1] It's possible to insert various command packets that either indicate when they've passed or stall until a register value gets updated, but these either cause awkward problems with the cp locking or mean that the gui idle functionality never goes idle, so they're not ideal either.
Tags: advogato, fedora
So when is Portal going to work on my N900?


April 28 2010, 09:17:46 UTC 6 years ago

What a horrible HW implementation ! If the memory controller would just support back pressure (ie. not accept any requests and stall the initiator) during reclocking, you would only need to be sure the CPU is not touching the memory and there is enough data in the display fifo to continue refreshing the screen while the memory is unavailable.
Back pressure could be quite problematic if the CPU can write directly to it - are you really going to stall the CPU entirely? How long does it take to reclock? I get the impression it's many thousands of cycles, if it were instantaneous these issues wouldn't aries.

(Anyone know what the reclocking involves? Does it soft-reset the chip? Does it trigger a slow ramp?)

Vaguely relatedly, I saw this morning that there exists an Asus 890GX motherboard for the AMD 6-core cpu that apparently includes "Onboard ATI Radeon HD 4290 Graphics Card". It claims to support DirectX 10.1, but what sort of performance does it have?
Would be great if we could see this on :)

This is impressive progress btw. Congrats and thank you all!
Well, or persisted somewhere in docbook or something.



April 28 2010, 19:16:50 UTC 6 years ago

Unfortunately I still see corruption of clock changes on my M76 :(
It seems that the reclock misses the vblank most of the time...


April 28 2010, 22:42:16 UTC 6 years ago

Currently testing this on my HD4770. Works pretty good so far (engine stays in low-clock mode when doing usual desktop work and watching movies) and upclocks once 3D apps kick in.

However I still have the problem that the rendering somehow "lags". It's not exactly lagging but running around in ioquake3 feels like being connected to some kind of rubber band (in the sense that somestimes your ingame speed goes from normal to slowmo, just to speed up some moments later to return to normal speed again).

Again, great work on the PM patches!

Oh yes, I was wondering. Is it (with the current patchset) possible to select just the power state but not the entry "inside" the power state? As far as I can see you always have to write some value like "x.y" to power_state, x encoding the actual power state and y being the entry I mentioned above. Is it supposed to work if I just write x into power_state?




May 7 2010, 00:02:33 UTC 6 years ago

phoronix mentioned your blog and i'd really like to read it regulary.
is there a rss feed hidden somewhere?

Comments for this post were locked by the author