Matthew Garrett ([info]mjg59) wrote,
@ 2009-06-25 22:27:00
Previous Entry  Add to memories!  Tell a Friend  Next Entry
Entry tags:advogato, fedora

I've had hard drives die before, but this one seems to be doing it in an especially pathological way. It started with the kernel throwing irq timeout errors and then stepped up into actual read errors, culminating in a corrupt journal, a read-only block device and a forced fsck. It's been behaving a little better since then, though occasional io stalls (without any kernel error) suggest that it's having to repeatedly retry some sectors. SMART says it's all fine, so obviously I'm backing it all up now before a new disk arrives tomorrow and I can sort out access to the data centre. Email might be a bit spotty for me until then.

Turns out that running a 2.5" PATA drive for approximately 4 straight years may not have been the best of ideas. Who'd have guessed?




(15 comments) - (Post a new comment)

Power supply?
(Anonymous)
2009-06-26 12:18 am UTC (link)
I had similar problems couple years back, changed the drive quickly, but problem persisted with new drive.
Then noticed when checking motherboard BIOS settings that 5V measured value fluctuated a lot.
After changing the power supply, the 'faulty' drive has been working without errors.
If SMART sees nothing wrong, better to check that power supply is OK.

(Reply to this) (Thread)

Re: Power supply?
[info]lionsphil
2009-06-26 09:31 am UTC (link)
While I'm not disputing that the fault may be elsewhere, IME SMART is near-useless, and ISTR the Google hard drive study backing that one up. I've had drives that have actually failed, click-whirr, click-whirr, crunch, still telling me that they're healthy.

Marketing, and poor understanding of the factors work measuring, got to it.

(Reply to this) (Parent)(Thread)

Re: Power supply?
[info]thargol
2009-06-26 09:52 am UTC (link)
I wouldn't say it's useless, but didn't the Google study show that it only predicted failure in around N% of actual failures (where N was around 50, IIRC). Or maybe it was the other drive study that came out at around the same time. So all that means is that SMART can be a helpful guide, but you shouldn't rely on it. But we already knew that anyway.

(Reply to this) (Parent)(Thread)

Re: Power supply?
[info]lionsphil
2009-06-26 10:05 am UTC (link)
I thought the %age was lower, but I don't have the time to go rummaging now alas. I leave BIOS SMART reporting on, yeah---I'll take every warning I can get, but my expectation of it ever producing a useful one is balanced only by the fact that the test is basically free.

(Reply to this) (Parent)

Re: Power supply?
(Anonymous)
2009-06-26 11:36 am UTC (link)
The "overall-health self-assessment test" is useless.

Useful info is lifetime records of:
- error rates (if != 0, replace drive).
- reallocated sector count (replace drive soon)
- udma CRC errors (cable errors)
- lifetime temperature readings and cycle counts

(Reply to this) (Parent)

SMART stats
(Anonymous)
2009-06-26 02:09 pm UTC (link)
From what I gathered the rules are:

If SMART says a disk is NOT OK you are almost certainly already in trouble and probably only have days left.
Waiting for SMART to declare the disk BAD rarely ever happens before the disk goes bad (as someone else said the thresholds are too high).
You can sometimes tell a disk is going bad by monitoring all the SMART settings and noticing one moving too far even though it is under the threshold.

The Google paper is summarised nicely here: http://storagemojo.com/2007/02/19/googles-disk-failure-experience/ . The other paper (which won best paper at FAST 2007) is mentioned here: http://storagemojo.com/2007/02/20/everything-you-know-about-disks-is-wrong/ .

(Reply to this) (Parent)


[info]teferi
2009-06-26 02:16 am UTC (link)
I'm told that marketing people frequently make the engineers wind the SMART thresholds up to the point where the drive will die long before the failure indicators get tripped so that they can advertise longer MTBFs.

(Reply to this)


[info]cowbutt
2009-06-26 11:05 am UTC (link)
I've been using HDDs since about 1992, and haven't actually had one outright fail on me yet, personally or professionally.

On personal machines, I have smartd monitor changes and email me, I run long or selective tests daily (when the machine is otherwise idle), and I have Linux md check my RAID1 arrays daily (echo check >/sys/block/md3/md/sync_action).

I've had a few sectors go unreadable, but they've mostly just been occupied by log files, so I just deleted the log files and filled up /var with a dummy file, forcing the drive to reallocate the failed sectors. I think I've lost one "real" file due to a sector going bad, but it was just something I could re-download.

(Reply to this) (Thread)


[info]cowbutt
2009-06-26 11:07 am UTC (link)
Oh, and I usually retire drives at 2-3 years old and use them only as scratch drives from that point on.

(Reply to this) (Parent)


[info]simont
2009-06-26 11:32 am UTC (link)
You're luckier than I am, then! I've had four complete drive-deaths (at least that I can remember off the top of my head) in the period from 1994 to now, just from my home computer use. Three of them were within the first year of life.

(Reply to this) (Parent)

Contacts
(Anonymous)
2009-06-26 11:22 am UTC (link)
I've seen a drive do this kind of stuff (random 'freezes') on a Windows system. I didn't check SMART values on it, but took away the electronics to find that the board's contacts leading to the head/actuator were quite badly corroded. Tapped them with a soldering iron, worked like a charm.

(Reply to this)


[info]Dave Holland [org.uk]
2009-06-26 12:10 pm UTC (link)
Is/was it mounted horizontally or vertically?

My work Dell SX260 has a vertical 2.5" drive, and it eats a drive about every two years.

(Reply to this) (Thread)


[info]mjg59
2009-06-26 10:07 pm UTC (link)
Horizontally - it's a mac mini.

(Reply to this) (Parent)

meme time
(Anonymous)
2009-06-27 09:12 am UTC (link)
The last weeks, I've heard a lot of people complain about bad disks, all of them Western Digital ones with 500 GB (2,5" and 3,5") and one 1 TB 3,5". Is yours another 500 GB WDC disk?

(Reply to this) (Thread)

Re: meme time
[info]lure [launchpad.net]
2009-06-29 11:56 am UTC (link)
I am also loosing trust in WD after my 1TB Black Edition started to return media errors on read after only a month...

(Reply to this) (Parent)


(15 comments) - (Post a new comment)

Create an Account
Forgot your login or password?
Login w/ OpenID
English • Español • Deutsch • Русский…