Matthew Garrett ([info]mjg59) wrote,
@ 2008-05-14 02:14:00
Previous Entry  Add to memories!  Tell a Friend!  Next Entry
Entry tags:advogato

As noted in the comments here, one reason to forcibly slow down the CPU on a system (and waste power in the long run) is to deal with cases where failing to do so can result in the machine overheating. Part of the problem is that Linux still tends to be less power efficient than Windows, and so various components will be generating more heat than when the machine was tested. The ACPI spec includes support for passive cooling of processors, which allows the system to reduce heat generation by slowing the CPU down. The problem is that not all vendors include a passive trip point in their firmware, and in the absence of one Linux will happily let the temperature rise until it hits a critical level and the machine shuts down.

I've written a patch that generates a passive cooling trip point if your firmware doesn't provide one. Since there won't be a firmware event when the temperature crosses the trip point, it also forcibly enables polling (the spec actually requires this, so I've no idea why Linux doesn't do it anyway). The default polling interval is quite long - you can adjust it in /proc/acpi/thermal_zone/*/polling by just echoing in a value in seconds. I've gone for a long delay in order to reduce any possible power consumption issues caused by this, but in the long run I'd like it to reduce the polling interval if the temperature trend is upwards and increase it again if the trend is downwards.

The patch makes the assumption that you're never going to get within 5 degrees of the critical temperature in normal use, which I think is pretty reasonable. The probability that the machine will reach equilibrium at that point is fairly small, so if you get that close you'll almost certainly end up with a powered down machine unless we do something about it (like forcibly downclocking your CPU). I don't have any affected machines, so I've only been able to test it by artificially lowering the trip point - if people find that it still lets the processor get over temperature without attempting to slow it down first, then I'll probably need to implement the more advanced polling policy.

In any case, I'm interested in feedback.




(Post a new comment)


[info]jmspeex
2008-05-14 10:01 am UTC (link)
I'm the (not so) happy owner of a Dell D820 laptop, which is really crappy cooling. When I run something compute-intensive (e.g. a codec automated quality test) for several minutes it tends to overheat. As it reaches 95 degrees, something (BIOS? kernel? magic?) lowers the frequency automatically. If that's not enough it also starts throttling, sometimes going very far in the "throttling modes". Despite that, I've never seen the machine shutdown, although I've seen it being slow as hell with a temperature around 90 degrees. In a perfect works, it would be nice if Linux detected that my CPU can't run the job without overheating and did the throttling "per process" instead, slowing down the CPU-hungry processes without having to make the entire machine feel like a 486.

(Reply to this) (Thread)


[info]mjg59
2008-05-14 10:05 am UTC (link)
In theory, the scheduler could be made temperature and throttling aware. It's an interesting idea.

(Reply to this) (Parent)(Thread)


[info]jmspeex
2008-05-14 10:25 am UTC (link)
Actually, it's not that much about making the scheduler temperature-aware (I guess the temperature part could be handled in userspace), but allowing the scheduler to go idle even when there's processes waiting. Could be as simple as slowing down all nice -19 processes, though detecting non-interactive processes would be even better.

(Reply to this) (Parent)(Thread)

(Reply from suspended user)

(Reply from suspended user)
AND
(Anonymous)
2008-10-09 09:42 am UTC (link)
Cant AND work in this cases?
Its userlevel but it can limit CPU/IO use of any app

http://manpages.ubuntu.com/manpages/intrepid/en/man8/and.html

(Reply to this) (Parent)(Thread)

Re: AND
[info]mjg59
2008-10-09 09:57 am UTC (link)
No. Setting nice values lets you choose which process gets the processor, but if there's enough overall demand for CPU time then you'll be at 100% usage anyway. You'll just have shifted which processes get to run first.

(Reply to this) (Parent)


[info]James Henstridge [gnome.org]
2008-05-14 10:38 am UTC (link)
Should the EXPORT_SYMBOL(processor_list); bit read acpi_processor_list instead?

(Reply to this) (Thread)


[info]mjg59
2008-05-14 10:40 am UTC (link)
Whoops! Yes. Fixed.

(Reply to this) (Parent)


[info]jbailey
2008-05-14 05:04 pm UTC (link)
Is there a userspace way to see if our laptops have the trip point? I'll cheerfully do some testing if mine does.

I have a lovely Toshiba that never gets past C2 because of the Atheros chip, and the is on almost constantly. I suspect it doesn't need to be for my belle's intense games of Nethack :)

(Reply to this) (Thread)


[info]jbailey
2008-05-14 05:05 pm UTC (link)
the fan is on almost constantly. English hard.

(Reply to this) (Parent)(Thread)

(Reply from suspended user)

[info]mjg59
2008-05-14 05:07 pm UTC (link)
cat /proc/acpi/thermal_zones/*/trip_points and see if any of them define a passive zone.

(Reply to this) (Parent)(Thread)


[info]jbailey
2008-05-14 05:20 pm UTC (link)
Joy. /proc/acpi/thermal_zones is empty. I double checked and thermal is loaded.

(Reply to this) (Parent)(Thread)

/proc/acpi/thermal_zone/*/trip_points (no plural)
(Anonymous)
2008-05-23 01:58 pm UTC (link)
(no plural for zone, but there is for points)

$ grep -H . /proc/acpi/thermal_zone/*/trip_points
/proc/acpi/thermal_zone/THRM/trip_points:critical (S5): 100 C

(Reply to this) (Parent)


[info]reddragdiva
2008-05-15 11:14 am UTC (link)
Heh. I found out about mine (Compaq N410c, which I love in every other way) by installing ktemperature and watching it switch itself off when it hit 95°C.

Thank you for this, Matthew, it's just what we needed :-)

(Reply to this) (Parent)

(Reply from suspended user)
ubuntu probs
(Anonymous)
2008-05-14 07:17 pm UTC (link)
The patch fails against the ubuntu 2.6.24 source. FYI.

patching file drivers/acpi/thermal.c
Hunk #1 succeeded at 788 (offset -58 lines).
Hunk #2 succeeded at 1230 with fuzz 2 (offset -346 lines).
patching file drivers/acpi/scan.c
Hunk #2 succeeded at 990 (offset -57 lines).
patching file drivers/acpi/thermal.c
Hunk #1 succeeded at 119 (offset 3 lines).
Hunk #2 FAILED at 423.
Hunk #3 FAILED at 443.
2 out of 3 hunks FAILED -- saving rejects to file drivers/acpi/thermal.c.rej

(Reply to this) (Thread)

Re: ubuntu probs
(Anonymous)
2008-05-14 07:18 pm UTC (link)
2.6.24-16-generic to be exact.

(Reply to this) (Parent)

first vacuum-clean the fans... then switch to conservative
(Anonymous)
2008-05-14 10:34 pm UTC (link)
Geezz, I've spent an entire week reading blogs/forums/howtos/readmes/mans like this one to save my overheating CPU from melting (mobile p4 3.07 GHz, stuffed in a laptop)

And finally i did the ONE and ONLY right thing to do... borrow the vacuum cleaner from my wife... keyboard used to burn fingers, now it's sooooo cool

(Reply to this) (Thread)

Re: first vacuum-clean the fans... then switch to conservative
[info]reddragdiva
2008-05-15 11:15 am UTC (link)
:-D

Note that you should take great care with vacuum-cleaning PCs - the static is terrible. Earthed metal nozzle only.

(Reply to this) (Parent)(Thread)

Re: first vacuum-clean the fans... then switch to conservative
(Anonymous)
2008-05-15 05:46 pm UTC (link)
Sounds scary.
Do you mean that I risk frying the motherboard, or that I will just spike my fingers with the next metal object I touch? Or both?
Metal nozzle (let alone earthed) is not in my wife's toolbox. Would I get the same protection if I leave the laptop power cord plugged (and earthed) during vacuum-cleaning?

(Reply to this) (Parent)(Thread)

Re: first vacuum-clean the fans... then switch to conservative
[info]reddragdiva
2008-05-15 07:12 pm UTC (link)
hmm. Alfoil around the nozzle and earth that?

(Reply to this) (Parent)

Linux power efficiency
(Anonymous)
2008-05-22 08:21 pm UTC (link)
> Part of the problem is that Linux still tends to be less power efficient than Windows

Where in detail is the difference/problem here, and what can we do about it?

(Reply to this)

Dell blues
(Anonymous)
2008-06-17 07:24 pm UTC (link)
Hi Matt,

I also was the displeased owner of a Dell, a D620 core2 duo. (Let's not even mention the fan air intake on the bottom surface, so it can't actually be placed on my lap.) Up until recently everything (else) seemed fine. Then, recently, if I did anything slightly CPU intensive, ACPI would throttle it back to T7 (12%), and ignore any contrary command, and clock speed stuck at 1GHz (50%). CPU usage would show 100% if a browser was running, just to keep up with polling. CPU temperature reported never exceeded 62C.

I used a can of blaster to blow the dust out, and now the reported temperature goes as high as 81C on a kernel build, but it doesn't throttle any more. Evidently dust buildup doesn't just interfere with cooling, it also interferes with the temperature reporting, at least in some machines. This boggles me: who would put a sensor in a spot where dust could cause it to under-report CPU temperature? And how did something know the real temperature to throttle back the CPU?

Nathan Myers
ncm@cantrip.org

(Reply to this)


Create an Account
Forgot your login?
Login w/ OpenID
English • Español • Deutsch • Русский…