|
|
So, here is a story and some questions. I have a bunch of hard disks ranging from 40 GB to 2 TB. (Smaller capacity ones have either died or been given away long ago.) From upwards of 80K power on hours to a few score. One 2 TB HDD, a WD Caviar Green 2 TB EADS series, started acting up. Writes were fast as ever. Reads would start out fast at around 80 MB/s but then after a few seconds slow down to a 1-2 MB/s or even less. It hardly mattered which files were being read. The drive had roughly 60 GB of free space on the single NTFS-formatted partition that spanned it. For serious work I detached the drive and moved it to my FreeBSD machine. First course of action, of course, was to do # smartctl -A /dev/pass2 The drive is controlled by mfi(4) and is configured as a RAID0 volume. Passthrough provided by mfip.ko. Results indicated 0 reallocated sectors but 269 sectors pending reallocation and an extremely high MZER/WER. Despite the recent HDD price spike and because the data on that disk was valuable to me I purchased a replacement, stressed it a bit, and # ddrescue -v /dev/mfid0 /dev/mfid2 log After roughly 16 hours the process was complete with zero lost bytes. With my data in a safe place I began tinkering with the old drive. # mfiutil drive clear /dev/mfid2 start This supposedly initiated a zeroing of the drive which I expected would trigger reallocation of pending sectors. With the clearing finished, # smartctl -A /dev/pass2 This showed the sectors pending reallocation were gone but the number of reallocated sectors was still at zero. Suspicious, I tried # smartctl -t long /dev/pass2 Some four hours of waiting and the test came back with failure. Sectors pending reallocation became non-zero again. This time it reported only 1 bad sector. LBA of first bad sector was in # smartctl -l selftest /dev/pass2 Following a smartmontools HOWTO I did # dd if=/dev/zero of=/dev/mfid2 bs=512 count=8 seek=3702951801 To be sure I started at (LBA_of_first_error - 4) and wrote 8 sectors worth of zeros to the disk. (This is an AF disk, naturally, but it emulates 512k sectors.) Another invocation of smartctl and sectors pending reallocation went back to zero. Still, reallocated sector count was yet again at zero. Meanwhile MZER/WER (attribute 200) had gone from an extremely high value (> 10000) to a low value of 8. I assume it has rolled around. For a final test the disk is now under # badblocks -v -w /dev/mfid2 from e2fsprogs port, with a few hours of the operation remaining. So far, after roughly 28 hours, both reallocated sector count and sectors pending reallocation count are fixed at zero. My questions are: 1. Why are sectors pending reallocation disappearing but no reallocated sectors appear? 2. What was the disk's problem in the first place? 3. Could this be a disk controller problem rather than media problem? How do I narrow it down to the controller? 4. Supposing the disk is still under warranty what rationale should I present to have it replaced? It is likely the a full format and subsequent writes and reads fail to demonstrate its problem. 5. Will it be safe to use this disk as it is for actual data? Here is current smartctl output for the disk: > smartctl 5.41 2011-06-09 r3365 [FreeBSD 8.2-RELEASE-p3 amd64] (local build) > Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.ne t > > === START OF READ SMART DATA SECTION === > SMART Attributes Data Structure revision number: 16 > Vendor Specific SMART Attributes with Thresholds: > ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH UPDATED RAW_VALUE > 1 Raw_Read_Error_Rate 0x002f 198 198 051 Always 202264 > 3 Spin_Up_Time 0x0027 149 148 021 Always 9516 > 4 Start_Stop_Count 0x0032 100 100 000 Always 36 > 5 Reallocated_Sector_Ct 0x0033 200 200 140 Always 0 > 7 Seek_Error_Rate 0x002e 200 200 000 Always 0 > 9 Power_On_Hours 0x0032 089 089 000 Always 8045 > 10 Spin_Retry_Count 0x0032 100 253 000 Always 0 > 11 Calibration_Retry_Count 0x0032 100 253 000 Always 0 > 12 Power_Cycle_Count 0x0032 100 100 000 Always 32 > 192 Power-Off_Retract_Count 0x0032 200 200 000 Always 15 > 193 Load_Cycle_Count 0x0032 180 180 000 Always 61451 > 194 Temperature_Celsius 0x0022 117 102 000 Always 35 > 196 Reallocated_Event_Count 0x0032 200 200 000 Always 0 > 197 Current_Pending_Sector 0x0032 200 200 000 Always 0 > 198 Offline_Uncorrectable 0x0030 200 200 000 Offline 0 > 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Always 0 > 200 Multi_Zone_Error_Rate 0x0008 200 001 000 Offline 8 (I removed 'TYPE' and 'WHEN_FAILED' columns so that the output fits 80 columns. 'TYPE' is same as 'TYPE' for this attributes on other drives. No 'WHEN_FAILED' row shows a failure status at any time in the life of the drive.)
2 responses total.
Doesn't this belong in a hardware conference? I just gave someone a computer with a 6GB drive and two operating systems and still have a 10 and a 13 to put in the next two free computers, and two 20s. What do you store? Do you have two controllers? We have had computers where the primary went bad and the secondary still worked. I have no experience with SATA. Can you test your drives in another computer?
About the categorization of this into systems or hardware, I'm not sure. I regularly roll Linux installations as small as 2 MB but those are for specific purposes. I store... stuff. Poke a guess at it :-) Current water mark is at roughly 10 TB. There'll be a surge to 15 TB very soon. These are all SATA drives. Except those left from years ago, all smaller than 250 GB in capacity. IDE is obsolete, except for niche uses. The controller I wonder about is the drive's own controller not the RAID controller it is attached to (which is new and okay). Yes, it has been tested with four diffrent controllers on the far end of the cable. ICH7 (chipset), ICH10R (chipset), SiI3114 (entry level SATA controller), and LSI 2108 (enterprise RAID controller). I've tried different cables, too.
Response not possible - You must register and login before posting.
|
|
- Backtalk version 1.3.30 - Copyright 1996-2006, Jan Wolter and Steve Weiss