|
|
| Author |
Message |
| 19 new of 43 responses total. |
cross
|
|
response 25 of 43:
|
Sep 29 18:05 UTC 2005 |
Both.
|
tod
|
|
response 26 of 43:
|
Sep 29 18:26 UTC 2005 |
re #21
Cold solder joints are a strong possibility. A rack mount server might be
a logical step just for posterity.
|
cross
|
|
response 27 of 43:
|
Sep 29 18:29 UTC 2005 |
Nah, it'll never happen. There was strenuous argument about it the first
time I suggested it, and of course, grex went the wrong direction and got
a tower case. Oh well.
Buying a rackmount Dell server would be a good idea. Or an IBM server.
In either case, something that could run FreeBSD. Hey! Those things
even come with RAID controllers and hot-swappable disks!
|
nharmon
|
|
response 28 of 43:
|
Sep 29 18:33 UTC 2005 |
Well, if it came down to only changing one of the two (hardware, or OS),
I would argue for changing the hardware. Nextgrex has only been in use
for 10 months? Does Grex depreciate equipment?
|
tod
|
|
response 29 of 43:
|
Sep 29 20:18 UTC 2005 |
Back in the day, we used to depreciate M-Net h/w on a 3 year cycle.
Grex could certainly benefit by getting a rack server with a warranty on it
and possibly paths for h/w upgrades as technology/price permits.
|
cross
|
|
response 30 of 43:
|
Sep 29 21:00 UTC 2005 |
Upgrade? Whatever would they do that for? Grex likes living in the
basement.
|
jp2
|
|
response 31 of 43:
|
Sep 30 15:10 UTC 2005 |
This response has been erased.
|
albaugh
|
|
response 32 of 43:
|
Sep 30 17:45 UTC 2005 |
Hardware costs moola, I don't think you're going to inspire the dwindling grex
membership to donate more $ for more, possibly-unneeded hardware.
Software requires peoples' time to install and configure. This is also in
much dwindling supply.
|
mcnally
|
|
response 33 of 43:
|
Sep 30 18:37 UTC 2005 |
This response has been erased.
|
mcnally
|
|
response 34 of 43:
|
Sep 30 18:38 UTC 2005 |
re #32: So what do we do then? Roll over and die?
Ignoring the problem out of existence isn't working..
|
remmers
|
|
response 35 of 43:
|
Sep 30 19:15 UTC 2005 |
I did some looking for patterns in the "who was logged in when
grex crashed" department but didn't spot anything obvious. Not
all exploits in the universe depend on being logged in, of course.
We're running an old version of OpenBSD (3.5), so it's certainly due for
upgrade or replacement. I'm personally neutral on the issue of
whether we stick with OpenBSD or switch to something else like
FreeBSD, but since the procedure for installing Grex under
OpenBSD has been *extremely* well documented, mostly by Jan
Wolter (see /grex/grexdoc), the *FASTEST* thing to do at this point
would be to upgrade to the latest OpenBSD.
If upgrading fixes our instability problems (and I've heard more than
one opinion to the effect that they stem from problems in 3.5 that have
been fixed in subsequent versions), great. If not, we have all the more
reason to switch to a different OS and/or investigate hardware issues
more thoroughly.
Since I seem to be the staff member with the most free time at the
moment, I've been working on upgrading to the latest OpenBSD, using a
spare x86 machine in my house to install a mini-grex and editing the
installation procedures to reflect changes between OpenBSD 3.5 and
3.8. When I'm done, upgrading the "real" Grex should go pretty fast,
hopefully with not much more than a day of downtime.
This effort is pretty well along. On my mini-grex I've installed
OpenBSD 3.8, a whole slew of stuff from the ports tree, the exim mail
agent, and most of Grex's home-grown software items (e.g. backtalk,
fronttalk, party) and
customizations (e.g. packet filtering and disk quotas).
There's a handful of unresolved issues that I'm hoping other staffers
can advise me on. Once that's done, we should be ready to upgrade.
Things are pretty close, so I'm hoping that this can happen in a week or so.
|
tod
|
|
response 36 of 43:
|
Sep 30 22:29 UTC 2005 |
Blame the liberal media
|
dpc
|
|
response 37 of 43:
|
Oct 12 19:06 UTC 2005 |
It's great that you're working on a software upgrade, John!
|
tod
|
|
response 38 of 43:
|
Oct 13 15:57 UTC 2005 |
Patch management is a great first step to finding a resolution.
Thanks R E M M E R (S)
|
janc
|
|
response 39 of 43:
|
Nov 20 04:37 UTC 2005 |
I think we are all convinced now that Grex is suffering from a hard ware
problem.
(1) It crashed twice during the OpenBSD 3.8 install. Both crashs were
during the rare times during the upgrade when Grex was heavily loaded -
during massive compiles. The fact that the same crashs continue to
happen on new software tends to favor a hardware explanation.
(2) I dug through the kernel a bit to diagnose one of those crashes.
The problem occured in the virtual memory management code, during a
malloc(). An unused block of memory had been pulled off the free memory
list. All unused memory blocks contain a header containing information
used by the memory management code. This header also includes a "magic
number" - just a special number stored there to mark this as a free
memory block. The value used in OpenBSD is the hexidecimal value
"deafbeef". When taking memory off the free list, the kernel checks
that the magic number has not been meddled with. This would be a sign
that something is overwriting supposedly unused memory, and a good cause
for panic. In checking the memory value, it found "deabbeef" instead of
"deafbeef". It paniced, printing and error message and halting the system.
"deabbeef" instead of "deafbeef" is a single bit 1->0 error. I found
record of an identical crash on the previous OS with the same bit going
wrong in a different memory location. This looks very much like a bad
memory chip on one of Grex's DIMMs. (Note that each bit of a value
stored in memory is typically stored in a different chip, not altogether
on the same chip like you'd expect.) Except of course that the memory
test failed to find a problem.
So we have two competing theories:
- There is a bad memory, but memtest86+ failed to find it.
- The memory is fine, but there is some defect in the motherboard
that is effecting the particular circuit that brings bit 18 from
the memory to the CPU. This only comes into play when the system
is fairly busy, doing I/O as well as computation.
I consider both of these theories entirely plausible. Not that my
opinion on this things is particular expert. We'll probably investigate
the memory theory first, because it is easier to test. Grex currently
has three 512 megabyte DIMMs. We could pull out the ones from slots 2
and 3, leaving behind just the slot 1 one (closest to the CPU). If
crashs stop, then probably one of the pulled DIMMs was bad. If not, we
try swapping the slot 1 DIMM with one of the others.
If the bad DIMM theory doesn't pan out, then we'll want to try a
motherboard swap. Grex purchased a spare motherboard, which is,
unfortunately, currently misplaced. Need to look harder.
|
naftee
|
|
response 40 of 43:
|
Nov 20 05:12 UTC 2005 |
i think it just crashed.
yep.
12:12AM up 11 mins, 3 users, load averages: 1.53, 1.15, 0.70
|
root
|
|
response 41 of 43:
|
Nov 20 13:13 UTC 2005 |
I suggest rerunning "make world" after the memory swap. It seems to
generate the right load to trigger the problem quickly.
|
steve
|
|
response 42 of 43:
|
Nov 20 19:13 UTC 2005 |
Rebuild the world is a great way to test things, as are comiling
large packages. I'm heading to Provide now to try some stuff.
|
jesuit
|
|
response 43 of 43:
|
May 17 02:15 UTC 2006 |
TROGG IS DAVID BLAINE
|