On the Grex status page of HVCN, the message was something like: "Grex will be down until someday for hardware testing" hardware testing? hammer? chainsaw? dropping the box from a 20-storey building? immersion in the deeps of lake Michigan? nuclear radiation? or the power cord was just loose? :-)43 responses total.
Yes.
We had hoped that by bombarding it with gamma rays that we could turn it into a Super Grex which would simply grow huge, split out of its enclosure, and go on a rampage when stressed instead of merely crashing. Unfortunately it didn't work. In restrospect we should probably have listened to Rane's opinions on the matter but we thought Dr. Banner knew what he was talking about. Actually it was down for extended memory testing to see whether a bad DIMM could have been responsible for the recent crashes.
i think rane would enjoy playing with gamma rays
Gamma radiation is not a toy! <stern look>
gamma rays to try to make a recent peecee box reliable as a tweny years old Sun that one can get on Ebay for the price of a six-pack beer...
What's the verdict from the h/w tests?
No memory defect found.
(to the best of our recollection..)
That's too bad whatever you told me isn't the cause for...uh..what are we talking about?
The operating system sucking.
I hope that the theories about this being an OS problem are correct and that addressing OS deficiencies will fix the problem. But I'm still highly skeptical, since none of the OS-upgrade theories suggested so far do much to explain why Grex had several months of stable operation on its current operating system and then began experiencing more-than-daily crashes.
I seem to recall a lot of software installations being requested and approved. Not to say this was wrong, but there are a lot of variables here.
you seem to be unlucky
Most likely, someone figured out how to tickle some bug in OpenBSD and has fun doing so on a near-daily basis. I don't think anyone has ever tried to correlate crash times to who was logged in at the time (if they did, I'm not sure what the data would look like; there may be one or more people using different accounts coming from different ISP's doing whatever they chose to grex).
re #14: That is a definite possibility, and one that I've been pretty concerned about. Is the latest version of OpenBSD so exploit-free that we can expect it to survive a determined vandal with local shell access?
Is Grex on a rack mount server?
No. Our co-lo charges us by the server, not by rack space or footprint.
I think Todd was asking whether the box is a rack mount chasis. It is not. It's a standard tower case. Regarding #15; I think the reports of OpenBSD's security have been greatly exagerated.
Is Cyberspace considering a move to a rack mountable chasis?
Depends upon your definition of "considering," Todd. It's been mentioned a few times, now and again.
I used to run OpenBSD since 2.6 x86 for gateway/firewall/http/ftp/p2p/ssh an I recall now that once I experienced some weird crashes.After much head banging on the walls all around, I found that the soldering points on the motherboard were slightly too pointy and after I added cardboard between the back of the mainboard and the case, the problem disappeared. Bad occasional combination of small vibrations and heating, enough to generete small electrostatic discharges between the back of the board and the case. A hardware failure will let a log from some i/o subsystem, and faulty memory will be diagnosed by segfaults spitted out by some compilations. Grex could run for the contest of the most unreliable OpenBSD box this side of the Pecos. But we could also bet by PayPal on the cause of the crashes, the sum goes to the board to buy a book on unix security. Ok, i hurry to log out before the next crash occurs.
I've always been leery about using a custom built PC as a production server. I understand Grex being a non-profit and all it has to keep costs down. But I think we should consider investing money in better hardware before we pay someone to fix the software.
Installing FreeBSD on a Dell or IBM server would be preferable to what we're doing now.
Because of the FreeBSD part or because of the Dell or IBM server part?
Both.
re #21 Cold solder joints are a strong possibility. A rack mount server might be a logical step just for posterity.
Nah, it'll never happen. There was strenuous argument about it the first time I suggested it, and of course, grex went the wrong direction and got a tower case. Oh well. Buying a rackmount Dell server would be a good idea. Or an IBM server. In either case, something that could run FreeBSD. Hey! Those things even come with RAID controllers and hot-swappable disks!
Well, if it came down to only changing one of the two (hardware, or OS), I would argue for changing the hardware. Nextgrex has only been in use for 10 months? Does Grex depreciate equipment?
Back in the day, we used to depreciate M-Net h/w on a 3 year cycle. Grex could certainly benefit by getting a rack server with a warranty on it and possibly paths for h/w upgrades as technology/price permits.
Upgrade? Whatever would they do that for? Grex likes living in the basement.
This response has been erased.
Hardware costs moola, I don't think you're going to inspire the dwindling grex membership to donate more $ for more, possibly-unneeded hardware. Software requires peoples' time to install and configure. This is also in much dwindling supply.
This response has been erased.
re #32: So what do we do then? Roll over and die? Ignoring the problem out of existence isn't working..
I did some looking for patterns in the "who was logged in when grex crashed" department but didn't spot anything obvious. Not all exploits in the universe depend on being logged in, of course. We're running an old version of OpenBSD (3.5), so it's certainly due for upgrade or replacement. I'm personally neutral on the issue of whether we stick with OpenBSD or switch to something else like FreeBSD, but since the procedure for installing Grex under OpenBSD has been *extremely* well documented, mostly by Jan Wolter (see /grex/grexdoc), the *FASTEST* thing to do at this point would be to upgrade to the latest OpenBSD. If upgrading fixes our instability problems (and I've heard more than one opinion to the effect that they stem from problems in 3.5 that have been fixed in subsequent versions), great. If not, we have all the more reason to switch to a different OS and/or investigate hardware issues more thoroughly. Since I seem to be the staff member with the most free time at the moment, I've been working on upgrading to the latest OpenBSD, using a spare x86 machine in my house to install a mini-grex and editing the installation procedures to reflect changes between OpenBSD 3.5 and 3.8. When I'm done, upgrading the "real" Grex should go pretty fast, hopefully with not much more than a day of downtime. This effort is pretty well along. On my mini-grex I've installed OpenBSD 3.8, a whole slew of stuff from the ports tree, the exim mail agent, and most of Grex's home-grown software items (e.g. backtalk, fronttalk, party) and customizations (e.g. packet filtering and disk quotas). There's a handful of unresolved issues that I'm hoping other staffers can advise me on. Once that's done, we should be ready to upgrade. Things are pretty close, so I'm hoping that this can happen in a week or so.
Blame the liberal media
It's great that you're working on a software upgrade, John!
Patch management is a great first step to finding a resolution. Thanks R E M M E R (S)
I think we are all convinced now that Grex is suffering from a hard ware
problem.
(1) It crashed twice during the OpenBSD 3.8 install. Both crashs were
during the rare times during the upgrade when Grex was heavily loaded -
during massive compiles. The fact that the same crashs continue to
happen on new software tends to favor a hardware explanation.
(2) I dug through the kernel a bit to diagnose one of those crashes.
The problem occured in the virtual memory management code, during a
malloc(). An unused block of memory had been pulled off the free memory
list. All unused memory blocks contain a header containing information
used by the memory management code. This header also includes a "magic
number" - just a special number stored there to mark this as a free
memory block. The value used in OpenBSD is the hexidecimal value
"deafbeef". When taking memory off the free list, the kernel checks
that the magic number has not been meddled with. This would be a sign
that something is overwriting supposedly unused memory, and a good cause
for panic. In checking the memory value, it found "deabbeef" instead of
"deafbeef". It paniced, printing and error message and halting the system.
"deabbeef" instead of "deafbeef" is a single bit 1->0 error. I found
record of an identical crash on the previous OS with the same bit going
wrong in a different memory location. This looks very much like a bad
memory chip on one of Grex's DIMMs. (Note that each bit of a value
stored in memory is typically stored in a different chip, not altogether
on the same chip like you'd expect.) Except of course that the memory
test failed to find a problem.
So we have two competing theories:
- There is a bad memory, but memtest86+ failed to find it.
- The memory is fine, but there is some defect in the motherboard
that is effecting the particular circuit that brings bit 18 from
the memory to the CPU. This only comes into play when the system
is fairly busy, doing I/O as well as computation.
I consider both of these theories entirely plausible. Not that my
opinion on this things is particular expert. We'll probably investigate
the memory theory first, because it is easier to test. Grex currently
has three 512 megabyte DIMMs. We could pull out the ones from slots 2
and 3, leaving behind just the slot 1 one (closest to the CPU). If
crashs stop, then probably one of the pulled DIMMs was bad. If not, we
try swapping the slot 1 DIMM with one of the others.
If the bad DIMM theory doesn't pan out, then we'll want to try a
motherboard swap. Grex purchased a spare motherboard, which is,
unfortunately, currently misplaced. Need to look harder.
i think it just crashed. yep. 12:12AM up 11 mins, 3 users, load averages: 1.53, 1.15, 0.70
I suggest rerunning "make world" after the memory swap. It seems to generate the right load to trigger the problem quickly.
Rebuild the world is a great way to test things, as are comiling large packages. I'm heading to Provide now to try some stuff.
TROGG IS DAVID BLAINE
You have several choices: