bellstar
|
|
A Disk Story, and More
|
Nov 11 09:22 UTC 2011 |
So, here is a story and some questions.
I have a bunch of hard disks ranging from 40 GB to 2 TB. (Smaller capacity
ones have either died or been given away long ago.) From upwards of 80K power
on hours to a few score. One 2 TB HDD, a WD Caviar Green 2 TB EADS series,
started acting up. Writes were fast as ever. Reads would start out fast at
around 80 MB/s but then after a few seconds slow down to a 1-2 MB/s or even
less. It hardly mattered which files were being read. The drive had roughly
60 GB of free space on the single NTFS-formatted partition that spanned it.
For serious work I detached the drive and moved it to my FreeBSD machine.
First course of action, of course, was to do
# smartctl -A /dev/pass2
The drive is controlled by mfi(4) and is configured as a RAID0 volume.
Passthrough provided by mfip.ko.
Results indicated 0 reallocated sectors but 269 sectors pending reallocation
and an extremely high MZER/WER.
Despite the recent HDD price spike and because the data on that disk was
valuable to me I purchased a replacement, stressed it a bit, and
# ddrescue -v /dev/mfid0 /dev/mfid2 log
After roughly 16 hours the process was complete with zero lost bytes.
With my data in a safe place I began tinkering with the old drive.
# mfiutil drive clear /dev/mfid2 start
This supposedly initiated a zeroing of the drive which I expected would
trigger reallocation of pending sectors.
With the clearing finished,
# smartctl -A /dev/pass2
This showed the sectors pending reallocation were gone but the number of
reallocated sectors was still at zero.
Suspicious, I tried
# smartctl -t long /dev/pass2
Some four hours of waiting and the test came back with failure. Sectors
pending reallocation became non-zero again. This time it reported only 1 bad
sector. LBA of first bad sector was in
# smartctl -l selftest /dev/pass2
Following a smartmontools HOWTO I did
# dd if=/dev/zero of=/dev/mfid2 bs=512 count=8 seek=3702951801
To be sure I started at (LBA_of_first_error - 4) and wrote 8 sectors worth
of zeros to the disk. (This is an AF disk, naturally, but it emulates 512k
sectors.)
Another invocation of smartctl and sectors pending reallocation went back to
zero. Still, reallocated sector count was yet again at zero.
Meanwhile MZER/WER (attribute 200) had gone from an extremely high value (>
10000) to a low value of 8. I assume it has rolled around.
For a final test the disk is now under
# badblocks -v -w /dev/mfid2
from e2fsprogs port, with a few hours of the operation remaining. So far,
after roughly 28 hours, both reallocated sector count and sectors pending
reallocation count are fixed at zero.
My questions are:
1. Why are sectors pending reallocation disappearing but no reallocated
sectors appear?
2. What was the disk's problem in the first place?
3. Could this be a disk controller problem rather than media problem? How do
I narrow it down to the controller?
4. Supposing the disk is still under warranty what rationale should I present
to have it replaced? It is likely the a full format and subsequent writes and
reads fail to demonstrate its problem.
5. Will it be safe to use this disk as it is for actual data?
Here is current smartctl output for the disk:
> smartctl 5.41 2011-06-09 r3365 [FreeBSD 8.2-RELEASE-p3 amd64] (local build)
> Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.ne
t
>
> === START OF READ SMART DATA SECTION ===
> SMART Attributes Data Structure revision number: 16
> Vendor Specific SMART Attributes with Thresholds:
> ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH UPDATED RAW_VALUE
> 1 Raw_Read_Error_Rate 0x002f 198 198 051 Always 202264
> 3 Spin_Up_Time 0x0027 149 148 021 Always 9516
> 4 Start_Stop_Count 0x0032 100 100 000 Always 36
> 5 Reallocated_Sector_Ct 0x0033 200 200 140 Always 0
> 7 Seek_Error_Rate 0x002e 200 200 000 Always 0
> 9 Power_On_Hours 0x0032 089 089 000 Always 8045
> 10 Spin_Retry_Count 0x0032 100 253 000 Always 0
> 11 Calibration_Retry_Count 0x0032 100 253 000 Always 0
> 12 Power_Cycle_Count 0x0032 100 100 000 Always 32
> 192 Power-Off_Retract_Count 0x0032 200 200 000 Always 15
> 193 Load_Cycle_Count 0x0032 180 180 000 Always 61451
> 194 Temperature_Celsius 0x0022 117 102 000 Always 35
> 196 Reallocated_Event_Count 0x0032 200 200 000 Always 0
> 197 Current_Pending_Sector 0x0032 200 200 000 Always 0
> 198 Offline_Uncorrectable 0x0030 200 200 000 Offline 0
> 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Always 0
> 200 Multi_Zone_Error_Rate 0x0008 200 001 000 Offline 8
(I removed 'TYPE' and 'WHEN_FAILED' columns so that the output fits 80
columns. 'TYPE' is same as 'TYPE' for this attributes on other drives. No
'WHEN_FAILED' row shows a failure status at any time in the life of the
drive.)
|
bellstar
|
|
response 2 of 2:
|
Nov 11 19:16 UTC 2011 |
About the categorization of this into systems or hardware, I'm not sure.
I regularly roll Linux installations as small as 2 MB but those are for
specific purposes. I store... stuff. Poke a guess at it :-) Current water mark
is at roughly 10 TB. There'll be a surge to 15 TB very soon.
These are all SATA drives. Except those left from years ago, all smaller than
250 GB in capacity. IDE is obsolete, except for niche uses. The controller
I wonder about is the drive's own controller not the RAID controller it is
attached to (which is new and okay).
Yes, it has been tested with four diffrent controllers on the far end of the
cable. ICH7 (chipset), ICH10R (chipset), SiI3114 (entry level SATA
controller), and LSI 2108 (enterprise RAID controller). I've tried different
cables, too.
|