No Next Item No Next Conference Can't Favor Can't Forget Item List Conference Home Entrance    Help
View Responses


Grex Mnet Item 73: M-Net Downtime Status, August 23, 2002
Entered by tonster on Fri Aug 23 20:53:14 UTC 2002:

Okay, new item.

M-Net crashed sometime around 2pm today.  From what I can tell, we've 
either got a corrupted filesystem, or the vinum filesystem is 
corrupt.  Either way, it's going to take some time to repair.  I'm 
unsure how it crashed.  I think it just rebooted itself, as I was on 
at the time and there were no shutdown messages.

Here is what I saw when I had it booting into a serial console:

fdc0: FIFO enabled, 8 bytes threshold
fd0: <1440-KB 3.5" drive> on fdc0 drive 0
atkbdc0: <Keyboard controller (i8042)> at port 0x60,0x64 on isa0
atkbd0: <AT Keyboard> flags 0x1 irq 1 on atkbdc0
kbd0 at atkbd0
psm0: <PS/2 Mouse> irq 12 on atkbdc0
psm0: model Generic PS/2 mouse, device ID 0
vga0: <Generic ISA VGA> at port 0x3c0-0x3df iomem 0xa0000-0xbffff on 
isa0
sc0: <System console> at flags 0x100 on isa0
sc0: VGA <16 virtual consoles, flags=0x100>
sio0 at port 0x3f8-0x3ff irq 4 flags 0x10 on isa0
sio0: type 16550A, console
sio1 at port 0x2f8-0x2ff irq 3 on isa0
sio1: type 16550A
IP packet filtering initialized, divert disabled, rule-based 
forwarding enabled, default to deny, logging disabled
ad0: DMA limited to UDMA33, non-ATA66 cable or device
ad0: 8297MB <Maxtor 90871U2> [16858/16/63] at ata0-master UDMA33
acd0: CDROM <ATAPI CDROM> at ata1-master PIO4
Waiting 15 seconds for SCSI devices to settle
sa0 at ahc0 bus 0 target 6 lun 0
sa0: <ARCHIVE Python 28388-XXX 5.45> Removable Sequential Access SCSI-
2 device
sa0: 7.812MB/s transfers (7.812MHz, offset 15)
da0 at ahc0 bus 0 target 0 lun 0
da0: <SEAGATE ST39216W 0010> Fixed Direct Access SCSI-3 device
da0: 40.000MB/s transfers (20.000MHz, offset 8, 16bit), Tagged 
Queueing Enabled
da0: 8761MB (17942584 512 byte sectors: 255H 63S/T 1116C)
da1 at ahc0 bus 0 target 1 lun 0
da1: <WDIGTL WDE9100 1.50> Fixed Direct Access SCSI-2 device
da1: 40.000MB/s transfers (20.000MHz, offset 8, 16bit)
da1: 8683MB (17783204 512 byte sectors: 255H 63S/T 1106C)
vinum: loaded
Mounting root from ufs:/dev/ad0s1a
vinum: reading configuration from /dev/da1s1b
vinum: updating configuration from /dev/da0s1b
vinum: /dev is mounted read-only, not rebuilding /dev/vinum
Warning: defective objects

P var.p1              C State: faulty   Subdisks:     1 Size:        
969 MB
P home.p1             C State: faulty   Subdisks:     1 Size:       
3767 MB
P usr.p1              C State: faulty   Subdisks:     1 Size:        
545 MB
P usrlocal.p1         C State: faulty   Subdisks:     1 Size:        
969 MB
P usrbbs.p1           C State: faulty   Subdisks:     1 Size:        
827 MB
S var.p1.s0             State: stale    PO:        0  B Size:        
969 MB
S home.p1.s0            State: stale    PO:        0  B Size:       
3767 MB
S usr.p1.s0             State: stale    PO:        0  B Size:        
545 MB
S usrlocal.p1.s0        State: stale    PO:        0  B Size:        
969 MB
S usrbbs.p1.s0          State: stale    PO:        0  B Size:        
827 MB
swapon: adding /dev/ad0s1b as swap device
Automatic boot in progress...
/dev/ad0s1a: FILESYSTEM CLEAN; SKIPPING CHECKS
/dev/ad0s1a: clean, 444301 free (485 frags, 55477 blocks, 0.1% 
fragmentation)
/dev/vinum/usr: FILESYSTEM CLEAN; SKIPPING CHECKS
/dev/vinum/usr: clean, 331415 free (8815 frags, 40325 blocks, 1.6% 
fragmentation)
/dev/vinum/usrlocal: FILESYSTEM CLEAN; SKIPPING CHECKS
/dev/vinum/usrlocal: clean, 134816 free (1040 frags, 16722 blocks, 
0.1% fragmentation)
/dev/vinum/binsuid: FILESYSTEM CLEAN; SKIPPING CHECKS
/dev/vinum/binsuid: clean, 84764 free (28 frags, 10592 blocks, 0.0% 
fragmentation)
/dev/vinum/var: FILESYSTEM CLEAN; SKIPPING CHECKS
/dev/vinum/var: clean, 532829 free (861 frags, 66496 blocks, 0.1% 
fragmentation)
/dev/vinum/varmail: FILESYSTEM CLEAN; SKIPPING CHECKS
/dev/vinum/varmail: clean, 519218 free (12226 frags, 63374 blocks, 
0.8% fragmentation)
/dev/vinum/usrbbs: FILESYSTEM CLEAN; SKIPPING CHECKS
/dev/vinum/usrbbs: clean, 394402 free (10770 frags, 47954 blocks, 1.3% 
fragmentation)
/dev/vinum/roothome: FILESYSTEM CLEAN; SKIPPING 
CHECKS/etc/rc.shutdown: /usr/bin/logger: not found
Shutting down daemon processes:.
Saving firewall state tables:.

When it gets to the firewall line above, it just hangs.

Needless to say, I'm going to run to WWNet later tonight and retrieve 
the box so I can work on it.

Visit http://down.arbornet.org for more current information.  I'm not 
sure how often I'll remember to actually logon to grex.

108 responses total.



#1 of 108 by cyklone on Fri Aug 23 22:20:25 2002:

I'll expect a full report on my desk tomorrow morning.


#2 of 108 by cyklone on Fri Aug 23 22:21:17 2002:

<Just kidding. Good luck>


#3 of 108 by polytarp on Fri Aug 23 22:29:46 2002:

YEah,  go tonster, &c.


#4 of 108 by jp2 on Fri Aug 23 23:19:23 2002:

This response has been erased.



#5 of 108 by polytarp on Fri Aug 23 23:25:02 2002:

jp2; you owe me 20-USD.


#6 of 108 by jp2 on Fri Aug 23 23:52:20 2002:

This response has been erased.



#7 of 108 by lelande on Sat Aug 24 00:14:30 2002:

wow
computers are COMPLIKATED!


#8 of 108 by ric on Sat Aug 24 00:28:55 2002:

Thanks for the update, Tony!


#9 of 108 by tonster on Sat Aug 24 02:06:27 2002:

We've been offered a hardware RAID controller for free from WWNet.  I 
think we should take it.  I'm going to see what I can do about the 
vinum problem in the meantime.  M-Net is currently in the back of my 
truck.  I'll work on it later tonight.


#10 of 108 by jp2 on Sat Aug 24 02:11:31 2002:

This response has been erased.



#11 of 108 by tonster on Sat Aug 24 04:09:20 2002:

They've got a number of them.  We could likely pick.  The one I know 
they have a lot of would be an AMI MegaRaid.


#12 of 108 by jor on Sat Aug 24 12:14:19 2002:

        To think that the entire known M-Net universe could be
        in the back of soneone's truck.



#13 of 108 by iggy on Sat Aug 24 12:18:30 2002:

hey guido! wanna buy a BBS CHEAP?
how about a watch? or a TV?
you like women? i could set you up.


#14 of 108 by cyklone on Sat Aug 24 13:40:25 2002:

"They are very clean"

        - quote from a Tijuana cabbie


#15 of 108 by lelande on Sat Aug 24 17:44:42 2002:

This response has been erased.



#16 of 108 by ric on Sun Aug 25 01:53:49 2002:

"Money for Nothing" by Dire Straits..

I want me....
I want me M dash Net...


#17 of 108 by jor on Sun Aug 25 02:46:47 2002:

        Money for nothin'



#18 of 108 by cyklone on Sun Aug 25 02:49:51 2002:

WRITE THE CHECK!


#19 of 108 by tod on Sun Aug 25 03:20:30 2002:

This response has been erased.



#20 of 108 by lelande on Sun Aug 25 04:23:43 2002:

pussy for free.


#21 of 108 by iggy on Sun Aug 25 04:39:07 2002:

i have 2!


#22 of 108 by tonster on Sun Aug 25 05:49:09 2002:

 posted August 25, 2002 01:44              
-----------------------------------------------------------------------
---------
Okay, good news and bad news.
The good news is that I've finally successfully repaired all of m-
net's partitions, including /bin/suid. I finally found out how to make 
it work by reading more closely the man page for vinum, though it's 
not clear exactly why it required what it did. Anyway, all the 
partitions come up cleanly after they're fsck'd.

The problems arise when it attempts to start sendmail. At that point, 
it kernel panic's with this:

Fatal trap 12: page fault while in kernel mode
fault virtual address = 0x0
fault code = supervisor read, page not present
instruction pointer = 0x8:0xc0196c34
stack pointer = 0x10:0xc8bc8d0c
frame pointer = 0x10:0xc8bc8d14
code segment = base 0x0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, def32 1, gran 1
processor eflags = interrupt enabled, resume, IOPL = 0
current process = 137 (sendmail)
interrupt mask = net 
trap number = 12
panic: page fault

Possibly, sendmail is just corrupt. I'm going to try disabling 
sendmail and starting the box up and see if I can get a login: prompt 
then.

------------------
Tony




#23 of 108 by slynne on Sun Aug 25 13:59:30 2002:

Thanks for the updates :)


#24 of 108 by tod on Sun Aug 25 14:44:45 2002:

This response has been erased.



#25 of 108 by krokus on Sun Aug 25 14:49:40 2002:

I guess this explains why M-net didn't answer, when I tried to dial-in.


#26 of 108 by tonster on Sun Aug 25 15:17:37 2002:

Yeah, m-net will pretty much ignore you when it's not physically in 
the building. :)


#27 of 108 by tod on Sun Aug 25 15:18:45 2002:

This response has been erased.



#28 of 108 by tonster on Sun Aug 25 15:40:57 2002:

Sun Aug 25 11:39:10 EDT 2002


Welcome to the Once and Future M-Net 
FreeBSD 4.6 (m-net.arbornet.org) (ttyd0)

Enter  newuser  at the login prompt to create a new account 
Enter  upgrade  at the login prompt to find out about increased access

login: 

Welcome to the Once and Future M-Net 
FreeBSD 4.6 (m-net.arbornet.org) (ttyd0)

Enter  newuser  at the login prompt to create a new account 
Enter  upgrade  at the login prompt to find out about increased access

login: 

It's not back at WWNet yet, but I've finally got the box back 
online. :)  Something is broken with sendmail though.


#29 of 108 by jor on Sun Aug 25 15:43:36 2002:

        /does the Slim Pickens ending in Dr. Strangelove

        Yee-haw


#30 of 108 by jp2 on Sun Aug 25 19:01:50 2002:

This response has been erased.



#31 of 108 by polytarp on Sun Aug 25 20:40:19 2002:

jp2; that's silly talk.


#32 of 108 by tod on Sun Aug 25 22:33:25 2002:

This response has been erased.



#33 of 108 by mdw on Mon Aug 26 01:51:58 2002:

Actually, depending on the vintage, configuration, and just how broken
sendmail is, it certainly could cause a kernel panic.  Older vintages
ran as root always, so certainly had the right to open /dev/kmem, poke
around, and generally weak havoc.  Granted, this wasn't likely.  I think
newer versions try to run as somebody else, but there still has to be a
piece that as root binds to port 25, and there also has to be a
mechanism to write to people's mailboxes and run user specified programs
from .forward as that user.

Even so, a kernel mode page fault is a rather unlikely failure mode, at
least, not without deliberate and strange corruption of the sendmail
binary.  More likely possibilities include: a kernel bug plus a possibly
corrupted sendmail binary image.  A kernel bug triggered by an odd
combination of events in sendmail.  This condition might repeat as long
as there's a certain mail item in the queue, if that mail item triggers
the odd combination of events (and there may not be anything "wrong"
about the actual mail.) Unless the system was recently upgraded to a new
kernel, I'd discount the possibility of a bug, in favour of some sort of
hardware failure.  The most likely failure is probably a memory problem.
Sendmail may merely be the guy most likely to try to use the bad memory
first.  Bad memory ought to generate a parity fault, but this will only
happen if you don't have virtual parity memory (or otherwise don't have
memory parity or ECC installed and working.) A bad motherboard could
cause this fault.  A bad CPU could cause this fault.

Things to try (if you have the resources):
        Check for any loose cables or chips.
        Check fans, cooling, & temperature inside case.
        Check power supply -- right voltage?  no ripple?
        Try swapping with another known good motherboard or CPU.
        Try swapping memory chips.
        Take out any "extra" peripheral cards not needed,
                and see if the problem goes away.
        Memory diagnostics.
        Any CPU or other diagnostics you have.
        If you haven't got any, try deliberately setting a fork bomb loose as
                root, and see if it it crashes or thrashes.
        Check CPU clock, voltage, and bus speed jumpers.
        Check CPU cooling - fan up to speed?  Any hardware logic
                to monitor CPU temperature or fan speed?
For the software,
        Is this the latest kernel?  Are there older "stable" kernels?
                Has anyone else reported this problem?
                What's in CHANGELOG as the latest changes?
        Try another kernel.
        See if the sendmail binary "cmp"s from wherever it was built
                or installed.
        See if libc.so or ld.so or anything else changed - use "cmp" not sum.


#34 of 108 by jp2 on Mon Aug 26 02:19:37 2002:

This response has been erased.



#35 of 108 by tonster on Mon Aug 26 04:32:10 2002:

M-Net is back home at WWNet now.  I'm recompiling the kernel right 
now.  Maybe something was corrupted there.  Once it has recompiled, 
I'll reboot and see where we stand.


#36 of 108 by tonster on Mon Aug 26 05:12:42 2002:

I found libiconv to be in a failed state (the library was there and 
everything, but either not registered properly or corrupt so nothing 
could use it).  I've reinstalled the library and am proceeding with 
recompiling the kernel.


#37 of 108 by tonster on Mon Aug 26 06:10:38 2002:

Okay, that seems to have fixed the problem compiling and booting to 
the kernel.  I'm going to re-enable sendmail and see if it fixed that 
problem as well.


#38 of 108 by tonster on Mon Aug 26 06:11:01 2002:

Everything appears to be working now.  Logoff grex and return to m-
net!  Do it now!


#39 of 108 by twinkie on Mon Aug 26 06:55:46 2002:

I did. It was up, and then it was down. Just under and hour of total uptime.



Next 40 Responses.
Last 40 Responses and Response Form.
No Next Item No Next Conference Can't Favor Can't Forget Item List Conference Home Entrance    Help

- Backtalk version 1.3.30 - Copyright 1996-2006, Jan Wolter and Steve Weiss