|
|
We have completed an upgrade from OpenBSD 3.5 to OpenBSD 3.8.
Doing upgrades periodically is a necessary fact of life. Security fixes cease
to be available for older releases, so keeping them secure against newly
discovered exploits requires a full upgrade at least every year and a half,
preferably more often.
In this instance, we were also hoping that an upgrade might help with Grex's
problem with crashing about once a day. STeve believes that a weakness in
the 3.5 ethernet driver that has been fixed in 3.8 might have been the cause
of some of our crashs.
However, we've already had two crashs on the new version of the OS, so this
is clearly not the whole story. I did some investigating of the specific
cause of one of those crashs and determined that there was a value that had
been written to memory by the CPU, that differed in one bit when it was read
back. This is a clear hardware error. It could be a bad memory (though a
memory test we ran before the upgrade did not turn it up) or a defect in teh
mother board. We will be starting a program of tests to identify the exact
problem in a few days. I'm confident that we will be able to fix this. But
it's a bit of a disappointment that the problem was not solved by the OS
upgrade.
The biggest screw-up, of course, was the loss of the /var/mail partition.
The /var partition was backed up, and everyone thought /var/mail was included
in that backup file, but in fact, /var/mail was not mounted when the backup
was made. The mirror scripts are not smart enough not to mirror unmounted
file systems, and they were not shut down before starting work, so that copy
was lost too. This wasn't noticed until after the disks had been reformatted.
Finally, it just took too long. My goal is to get to the point where we can
do a rebuild of Grex in a day. It took us a week.
Worse, a great deal of the delay was because we as staff really failed to work
together effectively. We ran into deep differences in basic philosophy about
how grex should be run that cost us extra days. Because we didn't all agree
on what we were going to be doing before we started, our preparation for the
rebuild was not complete. We ended up redoing significant portions of the
job more than once.
For the next rebuild, we need to do the following:
- Agree beforehand on exactly what version we are upgrading to. Upload
all distribution files to Grex. In this case, STeve had done the
uploads, but for a different version than I believed we needed. When
we made the course change, we had to upload everything again, which
cost us a lot of time.
- Carefully and fully update Grexdoc before shutting down Grex. Grexdoc
is a collection of installation instructions, scripts, patches, saved
config files and custom source code that, in theory contains everything
you need to build a new Grex from the bare machine up. I wrote it and
used it to build the previous 3.5 Grex, but, of course, many small
adjustments would be needed to build a 3.8 Grex. John Remmers did a
great job of testing it out for 3.8. He took a spare machine he had
and used Grexdoc to build a whole Grex, validating all procedures.
This was great, but we missed a step. In the year or so since 3.5
Grex was built, staff has made many changes to Grex. Some, but not
all of these changes were documented in Grexdoc. What we should have
done BEFORE John built his test machine, was to go through Grexdoc
and compare everything there with the current configuration of Grex,
and make sure it all matched up. Then it's time to check how it all
works for the new OS. Since we missed the update check, I ended up
first building Grex from a slightly obsolete Grexdoc, then spending
a day or so checking all the configuration of the new system against
the configuration of the backup of the old system, copying those
changes into the new system, and checking them into Grexdoc. If
I'd had my act together, then all of that could have been done while
Grex was still up for users, giving much less down time.
- Use ALT partitions. When I originally laid out the partitions for
OpenBSD grex, I set up a full set of spare partitions called the ALT
partitions. The idea was that when we did a rebuild, we would install
the new system on the ALT partitions, leaving the old OS on the old
partitions. When the new build was done, the ALT partitions would
become the primary partitions. The advantage of this is that if we
ran into problems with the new OS, then we could bring Grex back up
on the old OS while we scratched our heads and thought it over. In
this instance, we wanted to repartition the drives for various good
reasons so we ended up erasing the old partitions before doing the
install. But not having the option of coming back up under the old
OS really cost us. By Sunday it was clear that (1) the upgrade wasn't
going to fix the crash problem and (2) different staff had very
different ideas on how to proceed with the upgrade. If we'd had the
ALT partitions, then the thing to do would have been to reboot on the
old 3.5 system, and take a week or two to revise our plan and get
our heads on the same page. Lacking that option, and with Grex down,
we had to proceed. I ended up just vetoing other staff members, a
move that was not good for staff harmony. Next time, if we want
new partitions, we should get a new disk to put them on. Keep the
old OS, at least until the rebuild is done.
62 responses total.
This response has been erased.
Thanks to all staff involved. Jan, especially, has been very generous in putting in so much time and energy to keep this community online. Know it's appreciated.
Hello Staff: Just in case you are not aware of it, ftp in and out is not working. Thanks, Ahmet Toprak
Yes, thanks, Jan and John amd STeve, for upgrading Grex. It's too bad it was such a hassle. We're all glad you pushed through to the end.
I'd like #0 to include "inform users of Grex beforehand as much as possible." However, we've already had two crashs on the new version of the OS, so this is clearly not the whole story. Perhaps OpenBSD is a bad idea altogether.
grex crashes like an old cranky russian car. That's the fun of it.Who wants to drive a boring shiny well-working japanese car? with grex you know you are in something full of surprises and unpredictable, exactly like that old Pobeda you rescued from a farm yard somewhere in Finland. Never drive farest than the corner without having with you some valves and gaskets.
(Re#3: Ftp is working again)
GreX has a last name : Silence
Dan, I didn't want to get over technical in an agora item, but the panic that occurred was in the virtual memory system. Free memory blocks are marked with a hexidecimal value of "deafbeef" in the block header. (Funny guys these openbsd folks.) When the virtual memory system reclaims a free memory block with the intension of allocating it to some process that wants memory, it checks that value in the block header. If it's not still "deafbeef", then something has been writing to that supposedly free block of memory, and things are very bad, so it performs a panic, printing a message and halting the computer. This is what we were getting. The memory value was "deabbeef" instead of "deafbeef". I've only seen a three crash messages, but two of the three I've seen, one from before the upgrade and one from after the upgrade were "deabbeef" errors, though on different memory blocks. This suggests routine 1 -> 0 memory errors on one particular bit position across a block of memories. This could be a bad chip on one of the DIMMs, or it could be motherboard fault. Grex is currently running with three 512Meg DIMMs. I'm inclined to remove two of them and see if we still have crashs. If we do, I'd suggest swapping the remaining one with one of the removed ones. A nicer strategy would be if we could figure out what DIMM the memory locations we saw errors at were located on. But that would require more information about the mother board than I have been able to find. For what it's worth, it's an ASUS A7V8X/GBL/13. I'm pretty sure that if you take out two memories, you want to leave behind the one in the slot closest to the CPU. Though the manual I found wasn't exactly illuminating. It says To enhance system performance, utilize dual-channel feature when installing additional DIMMs. Install the DIMMs in any of the following sequences: Sockets 1 & 3 or Sockets 2 & 3 or Sockets 1, 2 & 3. Really, clear up the matter doesn't it? If pulling memories doesn't work, then we can try swapping in our spare motherboard. Except that we've misplaced our spare motherboard. Sigh.
Oh, please include Bruce Howard in your list of folks to thank. He's done quite a lot. Thanks.
Thank you Bruce Howard.
Thanks, Bruce Coward.
This response has been erased.
I plan on getting to the hardware sometime after 1:30 to play swaping around games with the memory. I do not think that our memory is at fault, but this doed need to be checked. We can remove two dimms and still have 512M of ram which is still pretty good.
When are you going to apologized for sending a false accusatory e-mail about me to Gmail?
1. What did he say about you? 2. What's Gmail?
Dear sir. Basically, some guy sent an E-mail saying that I was a staff member and that my E-mail address was polytarp@gmail.com and that the guy who was receiving the E-mail ought to send any complaints about some rude statements he made to my address. I had nothing to do with either of these guys. Steve' caught hold of the E-mail (I forget; maybe even because I posted a copy of it?) and decided to E-mail www.gmail.com and have my account there closed, claiming I was the one who had sent the original E-mail. I lost many E-mails that were very important to me. There're, uh, comtemporaneous items about it hanging around coop. I'll try to find them!
Yeah. Items 243, 244, 250, 259 in coop. 152 in agora52. A bit of my silly posturing, a few sad admissions, a few hilarious moments (like when Eric Bassey (who, by the way, does NOT have a foreskin) threatened to have me thrown in jail, noting that Canadian authorities tend to be very cooperative wtih extradition requests regarding computer fraud) and, at its heart, a terrible example of Drex's staff at its best.
NOW THAT WAS JUST RUDE< SCHOLAR> SERIOUSLY< MOST OF THE STAFF IS JSUT FINE>
-bash-3.00$ tel steve Telegram to steve on ttyp4... Msg: You are a bad man. SENT I fully support that tel, though, and I'm sure it will help the situation.
Like, uh. I understand the whole thing about OH WELL YOUI"VE CAUSED SO MUCH TROUBLE ON THIS SYSTEM WHAT DID YOU EXPECT. But I'll tell you what I expect: SOME SORT OF INTEGRITY. If I'm fucking around with Grex, ban me. Fine. I can deal with that. In fact, I wouldn't be particularly perturbed if Grex banned me right now, even though I haven't fucked with the system in quite a while. However, I DON"T support sending UNTRUE e-mails to other systems that have nothing to do with Grex and that I have not abused in the slightest. There have been plenty of reasons to nail me over the years. That, though, doesn't make it right to send false E-mails to other systems in order to inflict more 'damage' on me than they could by simply banning me from Grex, though if you read the items I posted you'll see many people (cowards, I think) claim that this sort of stuff is justified.
Anyway, yeah. I like having a shoulder to cry on. And your shoulder, md. It's so much comfier than most people's.
THanks Bruce!
Are you in posession of other's foreskin, polytarp ?
Thanks ya'll.
Re. 23: Thanks, Bruce! Re. 24: I firmly believe other's foreskin ascended to the heavens and now forms the rings of Saturn.
This response has been erased.
I'm a bit upset that md ignored me crying on his shoulder. Dear Mr. Expert Witness, I would be willing to hire you for a small cash fee to testify in any legal cases I may or may not bring against Mr. ANdre. I wonder if that will get his attention.
You just try that, 'scholar'.
Okay! I will indeed try to either bring a legal case against you or not bring a legal case against you. No joke.
Grex is now running with 512M of ram. There are three DIMM slots on the motherboard. I coudn't see any markings for which was slot 0 (of 0-2) so guessed that they'd be numbered left to right. That turned out to be right. I took all three out and put the second DIMM (slot 1) into slot 0. Given that this is all a crap shoot I figured that moving everything around was the best idea. If Grex becomes stable we still don't know where the problem lies, exactly. It could be the ram. I plan on calling Crucial tomrorow and see about sending the two unused DIMMs to them for testing. While I believe the memtest86 program results, Crucial has the best hardware to test ram. Grex has been running for almost an hour now with a couple of programs comiling in the background. We'll see if it continues...
"a value that had been written to memory by the CPU, that differed in one bit when it was read back" I had that problem back in 1958 or so on a Datatron.
We've now been running for 45 minutes running normal things, plus two infinate loops of compiling a 100,000 line C program. These last two items are causing us to swap. Grex is definitely busy at the moment. I'm going to let this run as long as I can.
After 20 or so minutes of a few programs compiling, I bought it up to 10, started swaping like mad and got the load average up to 68 for several minutes. Grex didn't crash after more than 30 minutes of this.
This response has been erased.
Dan, as far as I know, no one has looked into recompiling grexsoft or anything else outside of the grexdoc structure bar possibly a few things manually installed from the ports tree. We're flying into Annapolis early tomorrow but I may be able to take a look into it once things have settled down later this week.
(35 slipped in) It looks like it was a controlled reboot and Steve logged back in right after the system came up. I suspect it was part of the system memory swapping and testing he's doing this evening.
Grex rebooted 32 minutes ago because I had to bring it down in order to put it back in its little home. Grex lives in the Attic. I gotta get some pictures of this place. It definitely looks like an Attic. Grex lives in the bottom of an empty 19in rack mount. You can't stare at the hardware when its there, so it has to be brought out and hooked up on a table next to the rack. The test I ran earlier tonight put real strain on the system. I've never seen an OpenBSD system with a load average of 68. I kept the system stressed for at least 30 minutes, to see what would happen. I think we're in better shape now.
steVE's wife is glandular.
|
|
- Backtalk version 1.3.30 - Copyright 1996-2006, Jan Wolter and Steve Weiss