|
|
M-Net is down because of some problems involving our file systems. We crashed about 9:00 this morning. I got to Supreme HQ at 10:30. The following unusual messages were on the console: possible probe of account anirudh from host 202.54.10.234 bha0: SCSI bus hung attempting reset I did control-alt-delete and was told: rebooting: syncing disks... Then nothing. So I pressed the reset button. A normal reboot began. Part way through, the System claimed: sd2:sn0: not ready, cause not reportable. Then it said that there were unexpected inconsistencies in the following file systems: /dev/rsd2a /dev/rsd2h /dev/rsd2g I ran fsck -y on these file systems. The problems were cured, I rebooted again, and all was well. This state of nirvana lasted until about 1:30 this afternoon, when the System crashed and burned again. I got to HQ about 4:20. On the console were these messages: bha0: bad mbox status 0x4 panic: ab_prmbox syncing disks...bha0: bad mbox status 0x4 panic: ab_prmbox dumpint to dev 18,0,1, offset 0 dump 128 at which point the little darling had frozen. I did control-alt-delete, and a reboot began. I got the same messages as before: snd2:sn0: not ready, cause not reportable Unexpected inconsistency in /dev/rsd2g, /dev/rsd2h, and /dev/rsd2a. So I again ran fsck -y, but this time the problems were *not* cured. Instead I was told: ** /dev/rsd2h BAD SUPERBLOCK:MAGIC NUMBER WRONG ** /dev/rsd2a BAD SUPERBLOCK:MAGIC NUMBER WRONG ** /dev/rsd2g The following disk sectors could not be read: 2031648, 2031649, etc., through 2031663 at which point the System froze. So I "halted" it. Mark Bobak is calling the appropriate Unix gurus. Any ideas what all this means, if anything?
4 responses total.
It's dim Jed.
I think it means you should have done a backup recently.
So far as I can tell, the mbox status message would refer to the mbox status area for an adaptec 1542 host controller or compatible. Value 4 should be AHA_MBI_ERROR or MBOX_I_ERROR, and means the ccb completed with an error. This should be handled by case logic, so the most likely possibiliby is that the mbox status changed between the time the case statement saw it, and the printf fetched the value. It looks to me like you have some sort of scsi problem. Before you get too involved with this, it is *very* worthwhile to cycle power on *EVERYTHING* - both *all* disks, and the computer. Even though a reset signal *ought* to suffice to reset the computer, sometimes there can be logic that gets into strange "can't happen" states and will continue to misbehave even after a reset. Modern scsi drives tend to have lots of on-drive logic to deal with bad block mappings and such - if this information gets corrupted, random parts of the disk can become inaccessible even though the drive *appears* to be otherwise operating perfectly. The next thing to check is the cabling: is it all in good shape? Do you have one and only one terminator on the bus? Is it at the *end* of the bus? The next thing to check after that are the drives. A failing drive can cause this kind of problem (hung scsi bus). Sometimes, drives can start misbehaving because they are running too hot. So it's worth checking their temperature while in operation. A good rule of thumb is that no part of the drive should be too hot to touch. Ideally, all external surfaces should be at best, "barely" warm. Another thing to check is the power levels. Voltage levels to the drive should be +5 and +12, to within 1/10 V, but ideally, they should be either exact or if not, a hair above. It's also worth checking the voltage with an oscilloscope - viewed this way, in normal full-load operation (ie, the disk is spinning and being accessed), the voltage should still be rock solid, with no spikes, ripples, or other noise. If there is any noise, find another power supply. Oh yes, another thing that can cause results such as you're experiencing is bad memory. Make sure all the memory is properly seated. You may want to consider taking out some of the memory, and re-arranging the rest, to see if it makes a difference. Another thing to try is to start taking out as many cards and drives as you can. If you have spare controller cards and such, you should try swapping some in if you can't find a minimal system that will boot and is stable. Basically, what you want to do is the classical "divide and conquer" strategy. Once you find something that is stable, you can start adding things back in until things start to fail. A good easy test to "exercise" a disk is this: dd if=/dev/rsd0c of=/dev/null bs=16k this won't do much seeking, but it will do lots of reading, and it will read the whole disk, reasonably quickly. If the drive doesn't pass this test, you should not use it for production. As for the other messages, they look like normal noise to me. I don't know what "possible probe of account anirudh from host 202.54.10.234" means, but presumably that's something you have programmed into the system, somehow. The disk errors you get on power-up sound mostly like the normal consequences of unstable disk hardware. The ones about a bad superblock magic # do sound scary. I do hope you've been doing regular disk backups. Past this, well, it's really a black art, to fix the system. This is where it's helpful to have the best people you can find around to help you get things going again.
M-Net was brought back up about 8:15 this evening. Thanx *very* much for the helpful hints, mdw! I'm going to post your comments in the M-Net System Problems item--assuming we're still up, that is. 8-)
Response not possible - You must register and login before posting.
|
|
- Backtalk version 1.3.30 - Copyright 1996-2006, Jan Wolter and Steve Weiss