|
|
Okay, new item. M-Net crashed sometime around 2pm today. From what I can tell, we've either got a corrupted filesystem, or the vinum filesystem is corrupt. Either way, it's going to take some time to repair. I'm unsure how it crashed. I think it just rebooted itself, as I was on at the time and there were no shutdown messages. Here is what I saw when I had it booting into a serial console: fdc0: FIFO enabled, 8 bytes threshold fd0: <1440-KB 3.5" drive> on fdc0 drive 0 atkbdc0: <Keyboard controller (i8042)> at port 0x60,0x64 on isa0 atkbd0: <AT Keyboard> flags 0x1 irq 1 on atkbdc0 kbd0 at atkbd0 psm0: <PS/2 Mouse> irq 12 on atkbdc0 psm0: model Generic PS/2 mouse, device ID 0 vga0: <Generic ISA VGA> at port 0x3c0-0x3df iomem 0xa0000-0xbffff on isa0 sc0: <System console> at flags 0x100 on isa0 sc0: VGA <16 virtual consoles, flags=0x100> sio0 at port 0x3f8-0x3ff irq 4 flags 0x10 on isa0 sio0: type 16550A, console sio1 at port 0x2f8-0x2ff irq 3 on isa0 sio1: type 16550A IP packet filtering initialized, divert disabled, rule-based forwarding enabled, default to deny, logging disabled ad0: DMA limited to UDMA33, non-ATA66 cable or device ad0: 8297MB <Maxtor 90871U2> [16858/16/63] at ata0-master UDMA33 acd0: CDROM <ATAPI CDROM> at ata1-master PIO4 Waiting 15 seconds for SCSI devices to settle sa0 at ahc0 bus 0 target 6 lun 0 sa0: <ARCHIVE Python 28388-XXX 5.45> Removable Sequential Access SCSI- 2 device sa0: 7.812MB/s transfers (7.812MHz, offset 15) da0 at ahc0 bus 0 target 0 lun 0 da0: <SEAGATE ST39216W 0010> Fixed Direct Access SCSI-3 device da0: 40.000MB/s transfers (20.000MHz, offset 8, 16bit), Tagged Queueing Enabled da0: 8761MB (17942584 512 byte sectors: 255H 63S/T 1116C) da1 at ahc0 bus 0 target 1 lun 0 da1: <WDIGTL WDE9100 1.50> Fixed Direct Access SCSI-2 device da1: 40.000MB/s transfers (20.000MHz, offset 8, 16bit) da1: 8683MB (17783204 512 byte sectors: 255H 63S/T 1106C) vinum: loaded Mounting root from ufs:/dev/ad0s1a vinum: reading configuration from /dev/da1s1b vinum: updating configuration from /dev/da0s1b vinum: /dev is mounted read-only, not rebuilding /dev/vinum Warning: defective objects P var.p1 C State: faulty Subdisks: 1 Size: 969 MB P home.p1 C State: faulty Subdisks: 1 Size: 3767 MB P usr.p1 C State: faulty Subdisks: 1 Size: 545 MB P usrlocal.p1 C State: faulty Subdisks: 1 Size: 969 MB P usrbbs.p1 C State: faulty Subdisks: 1 Size: 827 MB S var.p1.s0 State: stale PO: 0 B Size: 969 MB S home.p1.s0 State: stale PO: 0 B Size: 3767 MB S usr.p1.s0 State: stale PO: 0 B Size: 545 MB S usrlocal.p1.s0 State: stale PO: 0 B Size: 969 MB S usrbbs.p1.s0 State: stale PO: 0 B Size: 827 MB swapon: adding /dev/ad0s1b as swap device Automatic boot in progress... /dev/ad0s1a: FILESYSTEM CLEAN; SKIPPING CHECKS /dev/ad0s1a: clean, 444301 free (485 frags, 55477 blocks, 0.1% fragmentation) /dev/vinum/usr: FILESYSTEM CLEAN; SKIPPING CHECKS /dev/vinum/usr: clean, 331415 free (8815 frags, 40325 blocks, 1.6% fragmentation) /dev/vinum/usrlocal: FILESYSTEM CLEAN; SKIPPING CHECKS /dev/vinum/usrlocal: clean, 134816 free (1040 frags, 16722 blocks, 0.1% fragmentation) /dev/vinum/binsuid: FILESYSTEM CLEAN; SKIPPING CHECKS /dev/vinum/binsuid: clean, 84764 free (28 frags, 10592 blocks, 0.0% fragmentation) /dev/vinum/var: FILESYSTEM CLEAN; SKIPPING CHECKS /dev/vinum/var: clean, 532829 free (861 frags, 66496 blocks, 0.1% fragmentation) /dev/vinum/varmail: FILESYSTEM CLEAN; SKIPPING CHECKS /dev/vinum/varmail: clean, 519218 free (12226 frags, 63374 blocks, 0.8% fragmentation) /dev/vinum/usrbbs: FILESYSTEM CLEAN; SKIPPING CHECKS /dev/vinum/usrbbs: clean, 394402 free (10770 frags, 47954 blocks, 1.3% fragmentation) /dev/vinum/roothome: FILESYSTEM CLEAN; SKIPPING CHECKS/etc/rc.shutdown: /usr/bin/logger: not found Shutting down daemon processes:. Saving firewall state tables:. When it gets to the firewall line above, it just hangs. Needless to say, I'm going to run to WWNet later tonight and retrieve the box so I can work on it. Visit http://down.arbornet.org for more current information. I'm not sure how often I'll remember to actually logon to grex.
108 responses total.
I'll expect a full report on my desk tomorrow morning.
<Just kidding. Good luck>
YEah, go tonster, &c.
This response has been erased.
jp2; you owe me 20-USD.
This response has been erased.
wow computers are COMPLIKATED!
Thanks for the update, Tony!
We've been offered a hardware RAID controller for free from WWNet. I think we should take it. I'm going to see what I can do about the vinum problem in the meantime. M-Net is currently in the back of my truck. I'll work on it later tonight.
This response has been erased.
They've got a number of them. We could likely pick. The one I know they have a lot of would be an AMI MegaRaid.
To think that the entire known M-Net universe could be
in the back of soneone's truck.
hey guido! wanna buy a BBS CHEAP? how about a watch? or a TV? you like women? i could set you up.
"They are very clean"
- quote from a Tijuana cabbie
This response has been erased.
"Money for Nothing" by Dire Straits.. I want me.... I want me M dash Net...
Money for nothin'
WRITE THE CHECK!
This response has been erased.
pussy for free.
i have 2!
posted August 25, 2002 01:44 ----------------------------------------------------------------------- --------- Okay, good news and bad news. The good news is that I've finally successfully repaired all of m- net's partitions, including /bin/suid. I finally found out how to make it work by reading more closely the man page for vinum, though it's not clear exactly why it required what it did. Anyway, all the partitions come up cleanly after they're fsck'd. The problems arise when it attempts to start sendmail. At that point, it kernel panic's with this: Fatal trap 12: page fault while in kernel mode fault virtual address = 0x0 fault code = supervisor read, page not present instruction pointer = 0x8:0xc0196c34 stack pointer = 0x10:0xc8bc8d0c frame pointer = 0x10:0xc8bc8d14 code segment = base 0x0, limit 0xfffff, type 0x1b = DPL 0, pres 1, def32 1, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 137 (sendmail) interrupt mask = net trap number = 12 panic: page fault Possibly, sendmail is just corrupt. I'm going to try disabling sendmail and starting the box up and see if I can get a login: prompt then. ------------------ Tony
Thanks for the updates :)
This response has been erased.
I guess this explains why M-net didn't answer, when I tried to dial-in.
Yeah, m-net will pretty much ignore you when it's not physically in the building. :)
This response has been erased.
Sun Aug 25 11:39:10 EDT 2002 Welcome to the Once and Future M-Net FreeBSD 4.6 (m-net.arbornet.org) (ttyd0) Enter newuser at the login prompt to create a new account Enter upgrade at the login prompt to find out about increased access login: Welcome to the Once and Future M-Net FreeBSD 4.6 (m-net.arbornet.org) (ttyd0) Enter newuser at the login prompt to create a new account Enter upgrade at the login prompt to find out about increased access login: It's not back at WWNet yet, but I've finally got the box back online. :) Something is broken with sendmail though.
/does the Slim Pickens ending in Dr. Strangelove
Yee-haw
This response has been erased.
jp2; that's silly talk.
This response has been erased.
Actually, depending on the vintage, configuration, and just how broken
sendmail is, it certainly could cause a kernel panic. Older vintages
ran as root always, so certainly had the right to open /dev/kmem, poke
around, and generally weak havoc. Granted, this wasn't likely. I think
newer versions try to run as somebody else, but there still has to be a
piece that as root binds to port 25, and there also has to be a
mechanism to write to people's mailboxes and run user specified programs
from .forward as that user.
Even so, a kernel mode page fault is a rather unlikely failure mode, at
least, not without deliberate and strange corruption of the sendmail
binary. More likely possibilities include: a kernel bug plus a possibly
corrupted sendmail binary image. A kernel bug triggered by an odd
combination of events in sendmail. This condition might repeat as long
as there's a certain mail item in the queue, if that mail item triggers
the odd combination of events (and there may not be anything "wrong"
about the actual mail.) Unless the system was recently upgraded to a new
kernel, I'd discount the possibility of a bug, in favour of some sort of
hardware failure. The most likely failure is probably a memory problem.
Sendmail may merely be the guy most likely to try to use the bad memory
first. Bad memory ought to generate a parity fault, but this will only
happen if you don't have virtual parity memory (or otherwise don't have
memory parity or ECC installed and working.) A bad motherboard could
cause this fault. A bad CPU could cause this fault.
Things to try (if you have the resources):
Check for any loose cables or chips.
Check fans, cooling, & temperature inside case.
Check power supply -- right voltage? no ripple?
Try swapping with another known good motherboard or CPU.
Try swapping memory chips.
Take out any "extra" peripheral cards not needed,
and see if the problem goes away.
Memory diagnostics.
Any CPU or other diagnostics you have.
If you haven't got any, try deliberately setting a fork bomb loose as
root, and see if it it crashes or thrashes.
Check CPU clock, voltage, and bus speed jumpers.
Check CPU cooling - fan up to speed? Any hardware logic
to monitor CPU temperature or fan speed?
For the software,
Is this the latest kernel? Are there older "stable" kernels?
Has anyone else reported this problem?
What's in CHANGELOG as the latest changes?
Try another kernel.
See if the sendmail binary "cmp"s from wherever it was built
or installed.
See if libc.so or ld.so or anything else changed - use "cmp" not sum.
This response has been erased.
M-Net is back home at WWNet now. I'm recompiling the kernel right now. Maybe something was corrupted there. Once it has recompiled, I'll reboot and see where we stand.
I found libiconv to be in a failed state (the library was there and everything, but either not registered properly or corrupt so nothing could use it). I've reinstalled the library and am proceeding with recompiling the kernel.
Okay, that seems to have fixed the problem compiling and booting to the kernel. I'm going to re-enable sendmail and see if it fixed that problem as well.
Everything appears to be working now. Logoff grex and return to m- net! Do it now!
I did. It was up, and then it was down. Just under and hour of total uptime.
| Last 40 Responses and Response Form. |
|
|
- Backtalk version 1.3.30 - Copyright 1996-2006, Jan Wolter and Steve Weiss