You are not logged in. Login Now
 0-24   25-43         
 
Author Message
19 new of 43 responses total.
cross
response 25 of 43: Mark Unseen   Sep 29 18:05 UTC 2005

Both.
tod
response 26 of 43: Mark Unseen   Sep 29 18:26 UTC 2005

re #21
Cold solder joints are a strong possibility.  A rack mount server might be
a logical step just for posterity.
cross
response 27 of 43: Mark Unseen   Sep 29 18:29 UTC 2005

Nah, it'll never happen.  There was strenuous argument about it the first
time I suggested it, and of course, grex went the wrong direction and got
a tower case.  Oh well.

Buying a rackmount Dell server would be a good idea.  Or an IBM server.
In either case, something that could run FreeBSD.  Hey!  Those things
even come with RAID controllers and hot-swappable disks!
nharmon
response 28 of 43: Mark Unseen   Sep 29 18:33 UTC 2005

Well, if it came down to only changing one of the two (hardware, or OS),
I would argue for changing the hardware. Nextgrex has only been in use
for 10 months? Does Grex depreciate equipment?
tod
response 29 of 43: Mark Unseen   Sep 29 20:18 UTC 2005

Back in the day, we used to depreciate M-Net h/w on a 3 year cycle.
Grex could certainly benefit by getting a rack server with a warranty on it
and possibly paths for h/w upgrades as technology/price permits.
cross
response 30 of 43: Mark Unseen   Sep 29 21:00 UTC 2005

Upgrade?  Whatever would they do that for?  Grex likes living in the
basement.
jp2
response 31 of 43: Mark Unseen   Sep 30 15:10 UTC 2005

This response has been erased.

albaugh
response 32 of 43: Mark Unseen   Sep 30 17:45 UTC 2005

Hardware costs moola, I don't think you're going to inspire the dwindling grex
membership to donate more $ for more, possibly-unneeded hardware.
Software requires peoples' time to install and configure.  This is also in
much dwindling supply.
mcnally
response 33 of 43: Mark Unseen   Sep 30 18:37 UTC 2005

This response has been erased.

mcnally
response 34 of 43: Mark Unseen   Sep 30 18:38 UTC 2005

 re #32:  So what do we do then?  Roll over and die?

 Ignoring the problem out of existence isn't working..
remmers
response 35 of 43: Mark Unseen   Sep 30 19:15 UTC 2005

I did some looking for patterns in the "who was logged in when
grex crashed" department but didn't spot anything obvious.  Not
all exploits in the universe depend on being logged in, of course.

We're running an old version of OpenBSD (3.5), so it's certainly due for
upgrade or replacement.  I'm personally neutral on the issue of
whether we stick with OpenBSD or switch to something else like
FreeBSD, but since the procedure for installing Grex under
OpenBSD has been *extremely* well documented, mostly by Jan
Wolter (see /grex/grexdoc), the *FASTEST* thing to do at this point
would be to upgrade to the latest OpenBSD.  

If upgrading fixes our instability problems (and I've heard more than
one opinion to the effect that they stem from problems in 3.5 that have
been fixed in subsequent versions), great.  If not, we have all the more
reason to switch to a different OS and/or investigate hardware issues
more thoroughly.

Since I seem to be the staff member with the most free time at the
moment, I've been working on upgrading to the latest OpenBSD, using a
spare x86 machine in my house to install a mini-grex and editing the
installation procedures to reflect changes between OpenBSD 3.5 and
3.8.  When I'm done, upgrading the "real" Grex should go pretty fast,
hopefully with not much more than a day of downtime.

This effort is pretty well along.  On my mini-grex I've installed
OpenBSD 3.8, a whole slew of stuff from the ports tree, the exim mail
agent, and most of Grex's home-grown software items (e.g. backtalk,
fronttalk, party) and
customizations (e.g. packet filtering and disk quotas).

There's a handful of unresolved issues that I'm hoping other staffers
can advise me on.  Once that's done, we should be ready to upgrade. 
Things are pretty close, so I'm hoping that this can happen in a week or so.
tod
response 36 of 43: Mark Unseen   Sep 30 22:29 UTC 2005

Blame the liberal media
dpc
response 37 of 43: Mark Unseen   Oct 12 19:06 UTC 2005

It's great that you're working on a software upgrade, John!
tod
response 38 of 43: Mark Unseen   Oct 13 15:57 UTC 2005

Patch management is a great first step to finding a resolution.
Thanks R E M M E R (S)
janc
response 39 of 43: Mark Unseen   Nov 20 04:37 UTC 2005

I think we are all convinced now that Grex is suffering from a hard ware
problem.

(1) It crashed twice during the OpenBSD 3.8 install.  Both crashs were
during the rare times during the upgrade when Grex was heavily loaded -
during massive compiles.  The fact that the same crashs continue to
happen on new software tends to favor a hardware explanation.

(2) I dug through the kernel a bit to diagnose one of those crashes. 
The problem occured in the virtual memory management code, during a
malloc(). An unused block of memory had been pulled off the free memory
list.  All unused memory blocks contain a header containing information
used by the memory management code.  This header also includes a "magic
number" - just a special number stored there to mark this as a free
memory block.  The value used in OpenBSD is the hexidecimal value
"deafbeef".  When taking memory off the free list, the kernel checks
that the magic number has not been meddled with.  This would be a sign
that something is overwriting supposedly unused memory, and a good cause
for panic.  In checking the memory value, it found "deabbeef" instead of
"deafbeef".  It paniced, printing and error message and halting the system.

"deabbeef" instead of "deafbeef" is a single bit 1->0 error.  I found
record of an identical crash on the previous OS with the same bit going
wrong in a different memory location.  This looks very much like a bad
memory chip on one of Grex's DIMMs.  (Note that each bit of a value
stored in memory is typically stored in a different chip, not altogether
on the same chip like you'd expect.)  Except of course that the memory
test failed to find a problem.

So we have two competing theories:

  - There is a bad memory, but memtest86+ failed to find it.

  - The memory is fine, but there is some defect in the motherboard
    that is effecting the particular circuit that brings bit 18 from
    the memory to the CPU.  This only comes into play when the system
    is fairly busy, doing I/O as well as computation.

I consider both of these theories entirely plausible.  Not that my
opinion on this things is particular expert.  We'll probably investigate
the memory theory first, because it is easier to test.  Grex currently
has three 512 megabyte DIMMs.  We could pull out the ones from slots 2
and 3, leaving behind just the slot 1 one (closest to the CPU).  If
crashs stop, then probably one of the pulled DIMMs was bad.  If not, we
try swapping the slot 1 DIMM with one of the others.

If the bad DIMM theory doesn't pan out, then we'll want to try a
motherboard swap.  Grex purchased a spare motherboard, which is,
unfortunately, currently misplaced.  Need to look harder.
naftee
response 40 of 43: Mark Unseen   Nov 20 05:12 UTC 2005

i think it just crashed.

yep.
12:12AM  up 11 mins, 3 users, load averages: 1.53, 1.15, 0.70
root
response 41 of 43: Mark Unseen   Nov 20 13:13 UTC 2005

I suggest rerunning "make world" after the memory swap.  It seems to 
generate the right load to trigger the problem quickly.
steve
response 42 of 43: Mark Unseen   Nov 20 19:13 UTC 2005

   Rebuild the world is a great way to test things, as are comiling
large packages.  I'm heading to Provide now to try some stuff.
jesuit
response 43 of 43: Mark Unseen   May 17 02:15 UTC 2006

TROGG IS DAVID BLAINE
 0-24   25-43         
Response Not Possible: You are Not Logged In
 

- Backtalk version 1.3.30 - Copyright 1996-2006, Jan Wolter and Steve Weiss