No Next Item No Next Conference Can't Favor Can't Forget Item List Conference Home Entrance    Help
View Responses


Grex Mnet Item 19: M-Net Down Because of Bad Magic Numbers
Entered by dpc on Fri Dec 5 22:04:03 UTC 1997:

M-Net is down because of some problems involving our file systems.
We crashed about 9:00 this morning.  I got to Supreme HQ at 10:30.
The following unusual messages were on the console:

possible probe of account anirudh from host 202.54.10.234
bha0: SCSI bus hung attempting reset

I did control-alt-delete and was told:

rebooting: syncing disks...

Then nothing.  So I pressed the reset button.  A normal reboot began.
Part way through, the System claimed:

sd2:sn0: not ready, cause not reportable.

Then it said that there were unexpected inconsistencies in the following
file systems:

/dev/rsd2a
/dev/rsd2h
/dev/rsd2g

I ran fsck -y on these file systems.  The problems were cured, I rebooted
again, and all was well.

This state of nirvana lasted until about 1:30 this afternoon, when the
System crashed and burned again.  I got to HQ about 4:20.  On the
console were these messages:

bha0: bad mbox status 0x4
panic: ab_prmbox
syncing disks...bha0: bad mbox status 0x4
panic: ab_prmbox
dumpint to dev 18,0,1, offset 0
dump 128

at which point the little darling had frozen.  I did control-alt-delete,
and a reboot began.  I got the same messages as before:

snd2:sn0: not ready, cause not reportable
Unexpected inconsistency in /dev/rsd2g, /dev/rsd2h, and /dev/rsd2a.

So I again ran fsck -y, but this time the problems were *not* cured.
Instead I was told:

** /dev/rsd2h 
BAD SUPERBLOCK:MAGIC NUMBER WRONG
** /dev/rsd2a 
BAD SUPERBLOCK:MAGIC NUMBER WRONG
** /dev/rsd2g
The following disk sectors could not be read: 2031648, 2031649, etc.,
through 2031663

at which point the System froze.  So I "halted" it.

Mark Bobak is calling the appropriate Unix gurus.  Any ideas what all
this means, if anything?

4 responses total.



#1 of 4 by tpryan on Fri Dec 5 23:51:21 1997:

        It's dim Jed.


#2 of 4 by kaplan on Sat Dec 6 00:19:51 1997:

I think it means you should have done a backup recently.


#3 of 4 by mdw on Sat Dec 6 00:32:31 1997:

So far as I can tell, the mbox status message would refer to the mbox
status area for an adaptec 1542 host controller or compatible.  Value 4
should be AHA_MBI_ERROR or MBOX_I_ERROR, and means the ccb completed
with an error.  This should be handled by case logic, so the most likely
possibiliby is that the mbox status changed between the time the case
statement saw it, and the printf fetched the value.

It looks to me like you have some sort of scsi problem.  Before you get
too involved with this, it is *very* worthwhile to cycle power on
*EVERYTHING* - both *all* disks, and the computer.  Even though a reset
signal *ought* to suffice to reset the computer, sometimes there can be
logic that gets into strange "can't happen" states and will continue to
misbehave even after a reset.  Modern scsi drives tend to have lots of
on-drive logic to deal with bad block mappings and such - if this
information gets corrupted, random parts of the disk can become
inaccessible even though the drive *appears* to be otherwise operating
perfectly.

The next thing to check is the cabling: is it all in good shape?  Do you
have one and only one terminator on the bus?  Is it at the *end* of the
bus? The next thing to check after that are the drives.  A failing drive
can cause this kind of problem (hung scsi bus).  Sometimes, drives can
start misbehaving because they are running too hot.  So it's worth
checking their temperature while in operation.  A good rule of thumb is
that no part of the drive should be too hot to touch.  Ideally, all
external surfaces should be at best, "barely" warm.

Another thing to check is the power levels.  Voltage levels to the drive
should be +5 and +12, to within 1/10 V, but ideally, they should be
either exact or if not, a hair above.  It's also worth checking the
voltage with an oscilloscope - viewed this way, in normal full-load
operation (ie, the disk is spinning and being accessed), the voltage
should still be rock solid, with no spikes, ripples, or other noise.  If
there is any noise, find another power supply.

Oh yes, another thing that can cause results such as you're experiencing
is bad memory.  Make sure all the memory is properly seated.  You may
want to consider taking out some of the memory, and re-arranging the
rest, to see if it makes a difference.

Another thing to try is to start taking out as many cards and drives as
you can.  If you have spare controller cards and such, you should try
swapping some in if you can't find a minimal system that will boot and
is stable.  Basically, what you want to do is the classical "divide and
conquer" strategy.  Once you find something that is stable, you can
start adding things back in until things start to fail.

A good easy test to "exercise" a disk is this:
 dd if=/dev/rsd0c of=/dev/null bs=16k this won't do much seeking, but it
will do lots of reading, and it will read the whole disk, reasonably
quickly.  If the drive doesn't pass this test, you should not use it for
production.

As for the other messages, they look like normal noise to me.  I don't
know what "possible probe of account anirudh from host 202.54.10.234"
means, but presumably that's something you have programmed into the
system, somehow.  The disk errors you get on power-up sound mostly like
the normal consequences of unstable disk hardware.  The ones about a bad
superblock magic # do sound scary.  I do hope you've been doing regular
disk backups.

Past this, well, it's really a black art, to fix the system.  This is
where it's helpful to have the best people you can find around to help
you get things going again.


#4 of 4 by dpc on Sat Dec 6 02:45:10 1997:

M-Net was brought back up about 8:15 this evening.  Thanx *very* much
for the helpful hints, mdw!  I'm going to post your comments in the
M-Net System Problems item--assuming we're still up, that is.   8-)

Response not possible - You must register and login before posting.

No Next Item No Next Conference Can't Favor Can't Forget Item List Conference Home Entrance    Help

- Backtalk version 1.3.30 - Copyright 1996-2006, Jan Wolter and Steve Weiss