No Next Item No Next Conference Can't Favor Can't Forget Item List Conference Home Entrance    Help
View Responses


Grex Systems Item 123: A Disk Story, and More
Entered by bellstar on Fri Nov 11 09:22:51 UTC 2011:

So, here is a story and some questions.

I have a bunch of hard disks ranging from 40 GB to 2 TB. (Smaller capacity
ones have either died or been given away long ago.) From upwards of 80K power
on hours to a few score. One 2 TB HDD, a WD Caviar Green 2 TB EADS series,
started acting up. Writes were fast as ever. Reads would start out fast at
around 80 MB/s but then after a few seconds slow down to a 1-2 MB/s or even
less. It hardly mattered which files were being read. The drive had roughly
60 GB of free space on the single NTFS-formatted partition that spanned it.
For serious work I detached the drive and moved it to my FreeBSD machine.

First course of action, of course, was to do 

# smartctl -A /dev/pass2

The drive is controlled by mfi(4) and is configured as a RAID0 volume.
Passthrough provided by mfip.ko. 

Results indicated 0 reallocated sectors but 269 sectors pending reallocation
and an extremely high MZER/WER.

Despite the recent HDD price spike and because the data on that disk was
valuable to me I purchased a replacement, stressed it a bit, and

# ddrescue -v /dev/mfid0 /dev/mfid2 log

After roughly 16 hours the process was complete with zero lost bytes.

With my data in a safe place I began tinkering with the old drive.

# mfiutil drive clear /dev/mfid2 start

This supposedly initiated a zeroing of the drive which I expected would
trigger reallocation of pending sectors.

With the clearing finished,

# smartctl -A /dev/pass2

This showed the sectors pending reallocation were gone but the number of
reallocated sectors was still at zero.

Suspicious, I tried 

# smartctl -t long /dev/pass2

Some four hours of waiting and the test came back with failure. Sectors
pending reallocation became non-zero again. This time it reported only 1 bad
sector. LBA of first bad sector was in 

# smartctl -l selftest /dev/pass2

Following a smartmontools HOWTO I did

# dd if=/dev/zero of=/dev/mfid2 bs=512 count=8 seek=3702951801

To be sure I started at (LBA_of_first_error - 4) and wrote 8 sectors worth
of zeros to the disk. (This is an AF disk, naturally, but it emulates 512k
sectors.)

Another invocation of smartctl and sectors pending reallocation went back to
zero. Still, reallocated sector count was yet again at zero.

Meanwhile MZER/WER (attribute 200) had gone from an extremely high value (>
10000) to a low value of 8. I assume it has rolled around.

For a final test the disk is now under

# badblocks -v -w /dev/mfid2

from e2fsprogs port, with a few hours of the operation remaining. So far,
after roughly 28 hours, both reallocated sector count and sectors pending
reallocation count are fixed at zero.

My questions are:

1. Why are sectors pending reallocation disappearing but no reallocated
sectors appear?

2. What was the disk's problem in the first place?

3. Could this be a disk controller problem rather than media problem? How do
I narrow it down to the controller?

4. Supposing the disk is still under warranty what rationale should I present
to have it replaced? It is likely the a full format and subsequent writes and
reads fail to demonstrate its problem.

5. Will it be safe to use this disk as it is for actual data?

Here is current smartctl output for the disk:

> smartctl 5.41 2011-06-09 r3365 [FreeBSD 8.2-RELEASE-p3 amd64] (local build)
> Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.ne
t
> 
> === START OF READ SMART DATA SECTION ===
> SMART Attributes Data Structure revision number: 16
> Vendor Specific SMART Attributes with Thresholds:
> ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH UPDATED  RAW_VALUE
>   1 Raw_Read_Error_Rate     0x002f   198   198   051    Always   202264
>   3 Spin_Up_Time            0x0027   149   148   021    Always   9516
>   4 Start_Stop_Count        0x0032   100   100   000    Always   36
>   5 Reallocated_Sector_Ct   0x0033   200   200   140    Always   0
>   7 Seek_Error_Rate         0x002e   200   200   000    Always   0
>   9 Power_On_Hours          0x0032   089   089   000    Always   8045
>  10 Spin_Retry_Count        0x0032   100   253   000    Always   0
>  11 Calibration_Retry_Count 0x0032   100   253   000    Always   0
>  12 Power_Cycle_Count       0x0032   100   100   000    Always   32
> 192 Power-Off_Retract_Count 0x0032   200   200   000    Always   15
> 193 Load_Cycle_Count        0x0032   180   180   000    Always   61451
> 194 Temperature_Celsius     0x0022   117   102   000    Always   35
> 196 Reallocated_Event_Count 0x0032   200   200   000    Always   0
> 197 Current_Pending_Sector  0x0032   200   200   000    Always   0
> 198 Offline_Uncorrectable   0x0030   200   200   000    Offline  0
> 199 UDMA_CRC_Error_Count    0x0032   200   200   000    Always   0
> 200 Multi_Zone_Error_Rate   0x0008   200   001   000    Offline  8

(I removed 'TYPE' and 'WHEN_FAILED' columns so that the output fits 80
columns. 'TYPE' is same as 'TYPE' for this attributes on other drives. No
'WHEN_FAILED' row shows a failure status at any time in the life of the
drive.)

2 responses total.



#1 of 2 by keesan on Fri Nov 11 13:27:10 2011:

Doesn't this belong in a hardware conference?  I just gave someone a computer
with a 6GB drive and two operating systems and still have a 10 and a 13 to
put in the next two free computers, and two 20s.  What do you store?

Do you have two controllers?  We have had computers where the primary went
bad and the secondary still worked.  I have no experience with SATA.  Can you
test your drives in another computer?


#2 of 2 by bellstar on Fri Nov 11 19:16:49 2011:

About the categorization of this into systems or hardware, I'm not sure.

I regularly roll Linux installations as small as 2 MB but those are for
specific purposes. I store... stuff. Poke a guess at it :-) Current water mark
is at roughly 10 TB. There'll be a surge to 15 TB very soon.

These are all SATA drives. Except those left from years ago, all smaller than
250 GB in capacity. IDE is obsolete, except for niche uses. The controller
I wonder about is the drive's own controller not the RAID controller it is
attached to (which is new and okay).

Yes, it has been tested with four diffrent controllers on the far end of the
cable. ICH7 (chipset), ICH10R (chipset), SiI3114 (entry level SATA
controller), and LSI 2108 (enterprise RAID controller). I've tried different
cables, too.

Response not possible - You must register and login before posting.

No Next Item No Next Conference Can't Favor Can't Forget Item List Conference Home Entrance    Help

- Backtalk version 1.3.30 - Copyright 1996-2006, Jan Wolter and Steve Weiss