Matthew Garrett ([info]mjg59) wrote,
I've had hard drives die before, but this one seems to be doing it in an especially pathological way. It started with the kernel throwing irq timeout errors and then stepped up into actual read errors, culminating in a corrupt journal, a read-only block device and a forced fsck. It's been behaving a little better since then, though occasional io stalls (without any kernel error) suggest that it's having to repeatedly retry some sectors. SMART says it's all fine, so obviously I'm backing it all up now before a new disk arrives tomorrow and I can sort out access to the data centre. Email might be a bit spotty for me until then.

Turns out that running a 2.5" PATA drive for approximately 4 straight years may not have been the best of ideas. Who'd have guessed?
Tags: advogato, fedora

  • 15 comments

Anonymous

June 26 2009, 00:18:55 UTC 2 years ago

Power supply?

I had similar problems couple years back, changed the drive quickly, but problem persisted with new drive.
Then noticed when checking motherboard BIOS settings that 5V measured value fluctuated a lot.
After changing the power supply, the 'faulty' drive has been working without errors.
If SMART sees nothing wrong, better to check that power supply is OK.

[info]lionsphil

June 26 2009, 09:31:53 UTC 2 years ago

Re: Power supply?

While I'm not disputing that the fault may be elsewhere, IME SMART is near-useless, and ISTR the Google hard drive study backing that one up. I've had drives that have actually failed, click-whirr, click-whirr, crunch, still telling me that they're healthy.

Marketing, and poor understanding of the factors work measuring, got to it.

[info]thargol

June 26 2009, 09:52:59 UTC 2 years ago

Re: Power supply?

I wouldn't say it's useless, but didn't the Google study show that it only predicted failure in around N% of actual failures (where N was around 50, IIRC). Or maybe it was the other drive study that came out at around the same time. So all that means is that SMART can be a helpful guide, but you shouldn't rely on it. But we already knew that anyway.

[info]lionsphil

June 26 2009, 10:05:57 UTC 2 years ago

Re: Power supply?

I thought the %age was lower, but I don't have the time to go rummaging now alas. I leave BIOS SMART reporting on, yeah---I'll take every warning I can get, but my expectation of it ever producing a useful one is balanced only by the fact that the test is basically free.

Anonymous

June 26 2009, 11:36:52 UTC 2 years ago

Re: Power supply?

The "overall-health self-assessment test" is useless.

Useful info is lifetime records of:
- error rates (if != 0, replace drive).
- reallocated sector count (replace drive soon)
- udma CRC errors (cable errors)
- lifetime temperature readings and cycle counts

Anonymous

June 26 2009, 14:09:43 UTC 2 years ago

SMART stats

From what I gathered the rules are:

If SMART says a disk is NOT OK you are almost certainly already in trouble and probably only have days left.
Waiting for SMART to declare the disk BAD rarely ever happens before the disk goes bad (as someone else said the thresholds are too high).
You can sometimes tell a disk is going bad by monitoring all the SMART settings and noticing one moving too far even though it is under the threshold.

The Google paper is summarised nicely here: http://storagemojo.com/2007/02/19/googles-disk-failure-experience/ . The other paper (which won best paper at FAST 2007) is mentioned here: http://storagemojo.com/2007/02/20/everything-you-know-about-disks-is-wrong/ .

[info]teferi

June 26 2009, 02:16:45 UTC 2 years ago

I'm told that marketing people frequently make the engineers wind the SMART thresholds up to the point where the drive will die long before the failure indicators get tripped so that they can advertise longer MTBFs.

[info]cowbutt

June 26 2009, 11:05:31 UTC 2 years ago

I've been using HDDs since about 1992, and haven't actually had one outright fail on me yet, personally or professionally.

On personal machines, I have smartd monitor changes and email me, I run long or selective tests daily (when the machine is otherwise idle), and I have Linux md check my RAID1 arrays daily (echo check >/sys/block/md3/md/sync_action).

I've had a few sectors go unreadable, but they've mostly just been occupied by log files, so I just deleted the log files and filled up /var with a dummy file, forcing the drive to reallocate the failed sectors. I think I've lost one "real" file due to a sector going bad, but it was just something I could re-download.

[info]cowbutt

June 26 2009, 11:07:40 UTC 2 years ago

Oh, and I usually retire drives at 2-3 years old and use them only as scratch drives from that point on.

[info]simont

June 26 2009, 11:32:07 UTC 2 years ago

You're luckier than I am, then! I've had four complete drive-deaths (at least that I can remember off the top of my head) in the period from 1994 to now, just from my home computer use. Three of them were within the first year of life.

Anonymous

June 26 2009, 11:22:25 UTC 2 years ago

Contacts

I've seen a drive do this kind of stuff (random 'freezes') on a Windows system. I didn't check SMART values on it, but took away the electronics to find that the board's contacts leading to the head/actuator were quite badly corroded. Tapped them with a soldering iron, worked like a charm.

[info]Dave Holland [org.uk]

June 26 2009, 12:10:21 UTC 2 years ago

Is/was it mounted horizontally or vertically?

My work Dell SX260 has a vertical 2.5" drive, and it eats a drive about every two years.

[info]mjg59

June 26 2009, 22:07:48 UTC 2 years ago

Horizontally - it's a mac mini.

Anonymous

June 27 2009, 09:12:21 UTC 2 years ago

meme time

The last weeks, I've heard a lot of people complain about bad disks, all of them Western Digital ones with 500 GB (2,5" and 3,5") and one 1 TB 3,5". Is yours another 500 GB WDC disk?

[info]lure [launchpad.net]

June 29 2009, 11:56:14 UTC 2 years ago

Re: meme time

I am also loosing trust in WD after my 1TB Black Edition started to return media errors on read after only a month...
  • 15 comments
Create an Account
Forgot your login or password?
Facebook Twitter More login options
English • Español • Deutsch • Русский…