Grex Oldcoop Conference

Item 285: hardware testing ?

Entered by khamsun on Sun Sep 25 08:37:41 2005:

On the Grex status page of HVCN, the message was something like:
"Grex will be down until someday for hardware testing"

hardware testing?

hammer? chainsaw? dropping the box from a 20-storey building? immersion
in the deeps of lake Michigan? nuclear radiation?
or the power cord was just loose?

:-)
43 responses total.

#1 of 43 by mary on Sun Sep 25 11:04:20 2005:

Yes.


#2 of 43 by mcnally on Sun Sep 25 19:46:51 2005:

  We had hoped that by bombarding it with gamma rays that we could
  turn it into a Super Grex which would simply grow huge, split out
  of its enclosure, and go on a rampage when stressed instead of
  merely crashing.  Unfortunately it didn't work.  In restrospect
  we should probably have listened to Rane's opinions on the matter
  but we thought Dr. Banner knew what he was talking about.

  Actually it was down for extended memory testing to see whether a
  bad DIMM could have been responsible for the recent crashes.


#3 of 43 by naftee on Sun Sep 25 21:36:23 2005:

i think rane would enjoy playing with gamma rays


#4 of 43 by mcnally on Sun Sep 25 22:51:53 2005:

 Gamma radiation is not a toy!
 <stern look>


#5 of 43 by khamsun on Sun Sep 25 23:43:14 2005:

gamma rays to try to make a recent peecee box reliable as a tweny years
old Sun that one can get on Ebay for the price of a six-pack beer...



#6 of 43 by tod on Mon Sep 26 03:14:46 2005:

What's the verdict from the h/w tests?


#7 of 43 by mcnally on Mon Sep 26 06:40:48 2005:

 No memory defect found.


#8 of 43 by mcnally on Mon Sep 26 06:41:06 2005:

 (to the best of our recollection..)


#9 of 43 by tod on Mon Sep 26 15:57:36 2005:

That's too bad whatever you told me isn't the cause for...uh..what are we
talking about?


#10 of 43 by cross on Mon Sep 26 17:08:21 2005:

The operating system sucking.


#11 of 43 by mcnally on Mon Sep 26 22:06:48 2005:

 I hope that the theories about this being an OS problem are correct and
 that addressing OS deficiencies will fix the problem.  But I'm still
 highly skeptical, since none of the OS-upgrade theories suggested so far
 do much to explain why Grex had several months of stable operation on its
 current operating system and then began experiencing more-than-daily 
 crashes.


#12 of 43 by nharmon on Tue Sep 27 12:44:12 2005:

I seem to recall a lot of software installations being requested and
approved. Not to say this was wrong, but there are a lot of variables here.


#13 of 43 by naftee on Tue Sep 27 14:28:16 2005:

you seem to be unlucky


#14 of 43 by cross on Tue Sep 27 15:28:22 2005:

Most likely, someone figured out how to tickle some bug in OpenBSD
and has fun doing so on a near-daily basis.  I don't think anyone
has ever tried to correlate crash times to who was logged in at the
time (if they did, I'm not sure what the data would look like; there
may be one or more people using different accounts coming from
different ISP's doing whatever they chose to grex).


#15 of 43 by mcnally on Tue Sep 27 16:45:26 2005:

 re #14:  That is a definite possibility, and one that I've been pretty
 concerned about.  Is the latest version of OpenBSD so exploit-free that
 we can expect it to survive a determined vandal with local shell access?


#16 of 43 by tod on Wed Sep 28 03:42:34 2005:

Is Grex on a rack mount server?


#17 of 43 by mary on Wed Sep 28 11:45:30 2005:

No.  Our co-lo charges us by the server, not by rack space or
footprint.


#18 of 43 by cross on Wed Sep 28 21:34:46 2005:

I think Todd was asking whether the box is a rack mount chasis.  It is
not.  It's a standard tower case.

Regarding #15; I think the reports of OpenBSD's security have been greatly
exagerated.


#19 of 43 by tod on Wed Sep 28 23:20:52 2005:

Is Cyberspace considering a move to a rack mountable chasis?


#20 of 43 by gelinas on Thu Sep 29 01:14:36 2005:

Depends upon your definition of "considering," Todd.  It's been mentioned a
few times, now and again.


#21 of 43 by khamsun on Thu Sep 29 08:27:14 2005:

I used to run OpenBSD since 2.6 x86 for
gateway/firewall/http/ftp/p2p/ssh an I recall now that once I
experienced some weird crashes.After much head banging on the walls all
around, I found that the soldering points on the motherboard were
slightly too pointy and after I added cardboard between the back of the
mainboard and the case, the problem disappeared.
Bad occasional combination of small vibrations and heating, enough to
generete small electrostatic discharges between the back of the board
and the case.
A hardware failure will let a log from some i/o subsystem, and faulty
memory will be diagnosed by segfaults spitted out by some compilations.
Grex could run for the contest of the most unreliable OpenBSD box this
side of the Pecos.
But we could also bet by PayPal on the cause of the crashes, the sum
goes to the board to buy a book on unix security.

Ok, i hurry to log out before the next crash occurs.


#22 of 43 by nharmon on Thu Sep 29 13:54:59 2005:

I've always been leery about using a custom built PC as a production
server. I understand Grex being a non-profit and all it has to keep
costs down. But I think we should consider investing money in better
hardware before we pay someone to fix the software.


#23 of 43 by cross on Thu Sep 29 16:32:25 2005:

Installing FreeBSD on a Dell or IBM server would be preferable to what
we're doing now.


#24 of 43 by mcnally on Thu Sep 29 18:01:04 2005:

 Because of the FreeBSD part or because of the Dell or IBM server part?


#25 of 43 by cross on Thu Sep 29 18:05:00 2005:

Both.


#26 of 43 by tod on Thu Sep 29 18:26:37 2005:

re #21
Cold solder joints are a strong possibility.  A rack mount server might be
a logical step just for posterity.


#27 of 43 by cross on Thu Sep 29 18:29:08 2005:

Nah, it'll never happen.  There was strenuous argument about it the first
time I suggested it, and of course, grex went the wrong direction and got
a tower case.  Oh well.

Buying a rackmount Dell server would be a good idea.  Or an IBM server.
In either case, something that could run FreeBSD.  Hey!  Those things
even come with RAID controllers and hot-swappable disks!


#28 of 43 by nharmon on Thu Sep 29 18:33:48 2005:

Well, if it came down to only changing one of the two (hardware, or OS),
I would argue for changing the hardware. Nextgrex has only been in use
for 10 months? Does Grex depreciate equipment?


#29 of 43 by tod on Thu Sep 29 20:18:44 2005:

Back in the day, we used to depreciate M-Net h/w on a 3 year cycle.
Grex could certainly benefit by getting a rack server with a warranty on it
and possibly paths for h/w upgrades as technology/price permits.


#30 of 43 by cross on Thu Sep 29 21:00:54 2005:

Upgrade?  Whatever would they do that for?  Grex likes living in the
basement.


#31 of 43 by jp2 on Fri Sep 30 15:10:32 2005:

This response has been erased.



#32 of 43 by albaugh on Fri Sep 30 17:45:44 2005:

Hardware costs moola, I don't think you're going to inspire the dwindling grex
membership to donate more $ for more, possibly-unneeded hardware.
Software requires peoples' time to install and configure.  This is also in
much dwindling supply.


#33 of 43 by mcnally on Fri Sep 30 18:37:22 2005:

This response has been erased.



#34 of 43 by mcnally on Fri Sep 30 18:38:30 2005:

 re #32:  So what do we do then?  Roll over and die?

 Ignoring the problem out of existence isn't working..


#35 of 43 by remmers on Fri Sep 30 19:15:49 2005:

I did some looking for patterns in the "who was logged in when
grex crashed" department but didn't spot anything obvious.  Not
all exploits in the universe depend on being logged in, of course.

We're running an old version of OpenBSD (3.5), so it's certainly due for
upgrade or replacement.  I'm personally neutral on the issue of
whether we stick with OpenBSD or switch to something else like
FreeBSD, but since the procedure for installing Grex under
OpenBSD has been *extremely* well documented, mostly by Jan
Wolter (see /grex/grexdoc), the *FASTEST* thing to do at this point
would be to upgrade to the latest OpenBSD.  

If upgrading fixes our instability problems (and I've heard more than
one opinion to the effect that they stem from problems in 3.5 that have
been fixed in subsequent versions), great.  If not, we have all the more
reason to switch to a different OS and/or investigate hardware issues
more thoroughly.

Since I seem to be the staff member with the most free time at the
moment, I've been working on upgrading to the latest OpenBSD, using a
spare x86 machine in my house to install a mini-grex and editing the
installation procedures to reflect changes between OpenBSD 3.5 and
3.8.  When I'm done, upgrading the "real" Grex should go pretty fast,
hopefully with not much more than a day of downtime.

This effort is pretty well along.  On my mini-grex I've installed
OpenBSD 3.8, a whole slew of stuff from the ports tree, the exim mail
agent, and most of Grex's home-grown software items (e.g. backtalk,
fronttalk, party) and
customizations (e.g. packet filtering and disk quotas).

There's a handful of unresolved issues that I'm hoping other staffers
can advise me on.  Once that's done, we should be ready to upgrade. 
Things are pretty close, so I'm hoping that this can happen in a week or so.


#36 of 43 by tod on Fri Sep 30 22:29:48 2005:

Blame the liberal media


#37 of 43 by dpc on Wed Oct 12 19:06:38 2005:

It's great that you're working on a software upgrade, John!


#38 of 43 by tod on Thu Oct 13 15:57:24 2005:

Patch management is a great first step to finding a resolution.
Thanks R E M M E R (S)


#39 of 43 by janc on Sun Nov 20 04:37:54 2005:

I think we are all convinced now that Grex is suffering from a hard ware
problem.

(1) It crashed twice during the OpenBSD 3.8 install.  Both crashs were
during the rare times during the upgrade when Grex was heavily loaded -
during massive compiles.  The fact that the same crashs continue to
happen on new software tends to favor a hardware explanation.

(2) I dug through the kernel a bit to diagnose one of those crashes. 
The problem occured in the virtual memory management code, during a
malloc(). An unused block of memory had been pulled off the free memory
list.  All unused memory blocks contain a header containing information
used by the memory management code.  This header also includes a "magic
number" - just a special number stored there to mark this as a free
memory block.  The value used in OpenBSD is the hexidecimal value
"deafbeef".  When taking memory off the free list, the kernel checks
that the magic number has not been meddled with.  This would be a sign
that something is overwriting supposedly unused memory, and a good cause
for panic.  In checking the memory value, it found "deabbeef" instead of
"deafbeef".  It paniced, printing and error message and halting the system.

"deabbeef" instead of "deafbeef" is a single bit 1->0 error.  I found
record of an identical crash on the previous OS with the same bit going
wrong in a different memory location.  This looks very much like a bad
memory chip on one of Grex's DIMMs.  (Note that each bit of a value
stored in memory is typically stored in a different chip, not altogether
on the same chip like you'd expect.)  Except of course that the memory
test failed to find a problem.

So we have two competing theories:

  - There is a bad memory, but memtest86+ failed to find it.

  - The memory is fine, but there is some defect in the motherboard
    that is effecting the particular circuit that brings bit 18 from
    the memory to the CPU.  This only comes into play when the system
    is fairly busy, doing I/O as well as computation.

I consider both of these theories entirely plausible.  Not that my
opinion on this things is particular expert.  We'll probably investigate
the memory theory first, because it is easier to test.  Grex currently
has three 512 megabyte DIMMs.  We could pull out the ones from slots 2
and 3, leaving behind just the slot 1 one (closest to the CPU).  If
crashs stop, then probably one of the pulled DIMMs was bad.  If not, we
try swapping the slot 1 DIMM with one of the others.

If the bad DIMM theory doesn't pan out, then we'll want to try a
motherboard swap.  Grex purchased a spare motherboard, which is,
unfortunately, currently misplaced.  Need to look harder.


#40 of 43 by naftee on Sun Nov 20 05:12:38 2005:

i think it just crashed.

yep.
12:12AM  up 11 mins, 3 users, load averages: 1.53, 1.15, 0.70


#41 of 43 by root on Sun Nov 20 13:13:55 2005:

I suggest rerunning "make world" after the memory swap.  It seems to 
generate the right load to trigger the problem quickly.


#42 of 43 by steve on Sun Nov 20 19:13:51 2005:

   Rebuild the world is a great way to test things, as are comiling
large packages.  I'm heading to Provide now to try some stuff.


#43 of 43 by jesuit on Wed May 17 02:15:46 2006:

TROGG IS DAVID BLAINE


There are no more items selected.

You have several choices: