I've had hard drives die before, but this one seems to be doing it in an especially pathological way. It started with the kernel throwing irq timeout errors and then stepped up into actual read errors, culminating in a corrupt journal, a read-only block device and a forced fsck. It's been behaving a little better since then, though occasional io stalls (without any kernel error) suggest that it's having to repeatedly retry some sectors. SMART says it's all fine, so obviously I'm backing it all up now before a new disk arrives tomorrow and I can sort out access to the data centre. Email might be a bit spotty for me until then.
Turns out that running a 2.5" PATA drive for approximately 4 straight years may not have been the best of ideas. Who'd have guessed?
Anonymous
June 26 2009, 00:18:55 UTC 2 years ago
Power supply?
I had similar problems couple years back, changed the drive quickly, but problem persisted with new drive.Then noticed when checking motherboard BIOS settings that 5V measured value fluctuated a lot.
After changing the power supply, the 'faulty' drive has been working without errors.
If SMART sees nothing wrong, better to check that power supply is OK.
June 26 2009, 09:31:53 UTC 2 years ago
Re: Power supply?
While I'm not disputing that the fault may be elsewhere, IME SMART is near-useless, and ISTR the Google hard drive study backing that one up. I've had drives that have actually failed, click-whirr, click-whirr, crunch, still telling me that they're healthy.Marketing, and poor understanding of the factors work measuring, got to it.
June 26 2009, 09:52:59 UTC 2 years ago
Re: Power supply?
I wouldn't say it's useless, but didn't the Google study show that it only predicted failure in around N% of actual failures (where N was around 50, IIRC). Or maybe it was the other drive study that came out at around the same time. So all that means is that SMART can be a helpful guide, but you shouldn't rely on it. But we already knew that anyway.June 26 2009, 10:05:57 UTC 2 years ago
Re: Power supply?
I thought the %age was lower, but I don't have the time to go rummaging now alas. I leave BIOS SMART reporting on, yeah---I'll take every warning I can get, but my expectation of it ever producing a useful one is balanced only by the fact that the test is basically free.Anonymous
June 26 2009, 11:36:52 UTC 2 years ago
Re: Power supply?
The "overall-health self-assessment test" is useless.Useful info is lifetime records of:
- error rates (if != 0, replace drive).
- reallocated sector count (replace drive soon)
- udma CRC errors (cable errors)
- lifetime temperature readings and cycle counts
Anonymous
June 26 2009, 14:09:43 UTC 2 years ago
SMART stats
From what I gathered the rules are:If SMART says a disk is NOT OK you are almost certainly already in trouble and probably only have days left.
Waiting for SMART to declare the disk BAD rarely ever happens before the disk goes bad (as someone else said the thresholds are too high).
You can sometimes tell a disk is going bad by monitoring all the SMART settings and noticing one moving too far even though it is under the threshold.
The Google paper is summarised nicely here: http://storagemojo.com/2007/02/19/google
June 26 2009, 02:16:45 UTC 2 years ago
June 26 2009, 11:05:31 UTC 2 years ago
On personal machines, I have smartd monitor changes and email me, I run long or selective tests daily (when the machine is otherwise idle), and I have Linux md check my RAID1 arrays daily (echo check >/sys/block/md3/md/sync_action).
I've had a few sectors go unreadable, but they've mostly just been occupied by log files, so I just deleted the log files and filled up /var with a dummy file, forcing the drive to reallocate the failed sectors. I think I've lost one "real" file due to a sector going bad, but it was just something I could re-download.
June 26 2009, 11:07:40 UTC 2 years ago
June 26 2009, 11:32:07 UTC 2 years ago
Anonymous
June 26 2009, 11:22:25 UTC 2 years ago
Contacts
I've seen a drive do this kind of stuff (random 'freezes') on a Windows system. I didn't check SMART values on it, but took away the electronics to find that the board's contacts leading to the head/actuator were quite badly corroded. Tapped them with a soldering iron, worked like a charm.June 26 2009, 12:10:21 UTC 2 years ago
My work Dell SX260 has a vertical 2.5" drive, and it eats a drive about every two years.
June 26 2009, 22:07:48 UTC 2 years ago
Anonymous
June 27 2009, 09:12:21 UTC 2 years ago
meme time
The last weeks, I've heard a lot of people complain about bad disks, all of them Western Digital ones with 500 GB (2,5" and 3,5") and one 1 TB 3,5". Is yours another 500 GB WDC disk?June 29 2009, 11:56:14 UTC 2 years ago
Re: meme time
I am also loosing trust in WD after my 1TB Black Edition started to return media errors on read after only a month...