No Next Item No Next Conference Can't Favor Can't Forget Item List Conference Home Entrance    Help
View Responses


Grex Helpers Item 145: Upgrading Grex to 3.8 - a postmortem
Entered by janc on Sat Nov 19 20:05:05 UTC 2005:

We have completed an upgrade from OpenBSD 3.5 to OpenBSD 3.8.

Doing upgrades periodically is a necessary fact of life.  Security fixes cease
to be available for older releases, so keeping them secure against newly
discovered exploits requires a full upgrade at least every year and a half,
preferably more often.

In this instance, we were also hoping that an upgrade might help with Grex's
problem with crashing about once a day.  STeve believes that a weakness in
the 3.5 ethernet driver that has been fixed in 3.8 might have been the cause
of some of our crashs.

However, we've already had two crashs on the new version of the OS, so this
is clearly not the whole story.  I did some investigating of the specific
cause of one of those crashs and determined that there was a value that had
been written to memory by the CPU, that differed in one bit when it was read
back.  This is a clear hardware error.  It could be a bad memory (though a
memory test we ran before the upgrade did not turn it up) or a defect in teh
mother board.  We will be starting a program of tests to identify the exact
problem in a few days.  I'm confident that we will be able to fix this.  But
it's a bit of a disappointment that the problem was not solved by the OS
upgrade.

The biggest screw-up, of course, was the loss of the /var/mail partition. 
The /var partition was backed up, and everyone thought /var/mail was included
in that backup file, but in fact, /var/mail was not mounted when the backup
was made.  The mirror scripts are not smart enough not to mirror unmounted
file systems, and they were not shut down before starting work, so that copy
was lost too.  This wasn't noticed until after the disks had been reformatted.

Finally, it just took too long.  My goal is to get to the point where we can
do a rebuild of Grex in a day.  It took us a week.

Worse, a great deal of the delay was because we as staff really failed to work
together effectively.  We ran into deep differences in basic philosophy about
how grex should be run that cost us extra days.  Because we didn't all agree
on what we were going to be doing before we started, our preparation for the
rebuild was not complete.  We ended up redoing significant portions of the
job more than once.

For the next rebuild, we need to do the following:

  -  Agree beforehand on exactly what version we are upgrading to.  Upload
     all distribution files to Grex.  In this case, STeve had done the
     uploads, but for a different version than I believed we needed.  When
     we made the course change, we had to upload everything again, which
     cost us a lot of time.

  -  Carefully and fully update Grexdoc before shutting down Grex.  Grexdoc
     is a collection of installation instructions, scripts, patches, saved
     config files and custom source code that, in theory contains everything
     you need to build a new Grex from the bare machine up.  I wrote it and
     used it to build the previous 3.5 Grex, but, of course, many small
     adjustments would be needed to build a 3.8 Grex.  John Remmers did a
     great job of testing it out for 3.8.  He took a spare machine he had
     and used Grexdoc to build a whole Grex, validating all procedures.
     This was great, but we missed a step.  In the year or so since 3.5
     Grex was built, staff has made many changes to Grex.  Some, but not
     all of these changes were documented in Grexdoc.  What we should have
     done BEFORE John built his test machine, was to go through Grexdoc
     and compare everything there with the current configuration of Grex,
     and make sure it all matched up.  Then it's time to check how it all
     works for the new OS.  Since we missed the update check, I ended up
     first building Grex from a slightly obsolete Grexdoc, then spending
     a day or so checking all the configuration of the new system against
     the configuration of the backup of the old system, copying those
     changes into the new system, and checking them into Grexdoc.  If
     I'd had my act together, then all of that could have been done while
     Grex was still up for users, giving much less down time.

  -  Use ALT partitions.  When I originally laid out the partitions for
     OpenBSD grex, I set up a full set of spare partitions called the ALT
     partitions.  The idea was that when we did a rebuild, we would install
     the new system on the ALT partitions, leaving the old OS on the old
     partitions.  When the new build was done, the ALT partitions would
     become the primary partitions.  The advantage of this is that if we
     ran into problems with the new OS, then we could bring Grex back up
     on the old OS while we scratched our heads and thought it over.  In
     this instance, we wanted to repartition the drives for various good
     reasons so we ended up erasing the old partitions before doing the
     install.  But not having the option of coming back up under the old
     OS really cost us.  By Sunday it was clear that (1) the upgrade wasn't
     going to fix the crash problem and (2) different staff had very
     different ideas on how to proceed with the upgrade.  If we'd had the
     ALT partitions, then the thing to do would have been to reboot on the
     old 3.5 system, and take a week or two to revise our plan and get
     our heads on the same page.  Lacking that option, and with Grex down,
     we had to proceed.  I ended up just vetoing other staff members, a
     move that was not good for staff harmony.  Next time, if we want
     new partitions, we should get a new disk to put them on.  Keep the
     old OS, at least until the rebuild is done.

62 responses total.



#1 of 62 by cross on Sat Nov 19 21:20:55 2005:

This response has been erased.



#2 of 62 by mary on Sat Nov 19 22:27:56 2005:

Thanks to all staff involved.  Jan, especially, has been very 
generous in putting in so much time and energy to keep this
community online.  Know it's appreciated.


#3 of 62 by trh on Sat Nov 19 23:59:51 2005:

Hello Staff:

Just in case you are not aware of it, ftp in and out is not working. 

Thanks,

Ahmet Toprak


#4 of 62 by aruba on Sun Nov 20 00:02:00 2005:

Yes, thanks, Jan and John amd STeve, for upgrading Grex.  It's too bad it
was such a hassle.  We're all glad you pushed through to the end.


#5 of 62 by tod on Sun Nov 20 00:12:26 2005:

I'd like #0 to include "inform users of Grex beforehand as much as possible."

 However, we've already had two crashs on the new version of the OS, so this
 is clearly not the whole story.
Perhaps OpenBSD is a bad idea altogether.


#6 of 62 by khamsun on Sun Nov 20 01:08:51 2005:

grex crashes like an old cranky russian car.
That's the fun of it.Who wants to drive a boring shiny well-working
japanese car?
with grex you know you are in something full of surprises and
unpredictable, exactly like that old Pobeda you rescued from a farm yard
somewhere in Finland. 
Never drive farest than the corner without having with you some valves
and gaskets.


#7 of 62 by bhoward on Sun Nov 20 01:34:29 2005:

(Re#3: Ftp is working again)


#8 of 62 by naftee on Sun Nov 20 02:08:09 2005:

GreX has a last name : Silence


#9 of 62 by janc on Sun Nov 20 03:28:25 2005:

Dan, I didn't want to get over technical in an agora item, but the panic
that occurred was in the virtual memory system.  Free memory blocks are
marked with a hexidecimal value of "deafbeef" in the block header. 
(Funny guys these openbsd folks.)  When the virtual memory system
reclaims a free memory block with the intension of allocating it to some
process that wants memory, it checks that value in the block header.  If
it's not still "deafbeef", then something has been writing to that
supposedly free block of memory, and things are very bad, so it performs
a panic, printing a message and halting the computer.  This is what we
were getting.  The memory value was "deabbeef" instead of "deafbeef". 
I've only seen a three crash messages, but two of the three I've seen,
one from before the upgrade and one from after the upgrade were
"deabbeef" errors, though on different memory blocks.  This suggests
routine 1 -> 0 memory errors on one particular bit position across a
block of memories.  This could be a bad chip on one of the DIMMs, or it
could be motherboard fault.

Grex is currently running with three 512Meg DIMMs.  I'm inclined to
remove two of them and see if we still have crashs.  If we do, I'd
suggest swapping the remaining one with one of the removed ones.

A nicer strategy would be if we could figure out what DIMM the memory
locations we saw errors at were located on.  But that would require more
information about the mother board than I have been able to find.  For
what it's worth, it's an ASUS A7V8X/GBL/13.  I'm pretty sure that if you
take out two memories, you want to leave behind the one in the slot
closest to the CPU.  Though the manual I found wasn't exactly
illuminating.  It says

   To enhance system performance, utilize dual-channel feature when
   installing additional DIMMs.  Install the DIMMs in any of the
   following sequences:  Sockets 1 & 3 or Sockets 2 & 3 or Sockets 1,
   2 & 3.

Really, clear up the matter doesn't it?

If pulling memories doesn't work, then we can try swapping in our spare
motherboard.  Except that we've misplaced our spare motherboard.  Sigh.


#10 of 62 by janc on Sun Nov 20 03:29:17 2005:

Oh, please include Bruce Howard in your list of folks to thank.  He's
done quite a lot.  Thanks.


#11 of 62 by nharmon on Sun Nov 20 04:10:05 2005:

Thank you Bruce Howard.


#12 of 62 by naftee on Sun Nov 20 04:33:18 2005:

Thanks, Bruce Coward.


#13 of 62 by cross on Sun Nov 20 06:10:08 2005:

This response has been erased.



#14 of 62 by steve on Sun Nov 20 07:11:13 2005:

   I plan on getting to the hardware sometime after 1:30 to play swaping
around games with the memory.  I do not think that our memory is at
fault, but this doed need to be checked.  We can remove two dimms and
still have 512M of ram which is still pretty good.


#15 of 62 by scholar on Sun Nov 20 09:46:46 2005:

When are you going to apologized for sending a false accusatory e-mail about
me to Gmail?


#16 of 62 by md on Sun Nov 20 13:46:08 2005:

1. What did he say about you?

2. What's Gmail?


#17 of 62 by scholar on Sun Nov 20 14:06:08 2005:

Dear sir.

Basically, some guy sent an E-mail saying that I was a staff member and that
my E-mail address was polytarp@gmail.com and that the guy who was receiving
the E-mail ought to send any complaints about some rude statements he made
to my address.

I had nothing to do with either of these guys.

Steve' caught hold of the E-mail (I forget; maybe even because I posted a copy
of it?) and decided to E-mail www.gmail.com and have my account there closed,
claiming I was the one who had sent the original E-mail.

I lost many E-mails that were very important to me.

There're, uh, comtemporaneous items about it hanging around coop.

I'll try to find them!


#18 of 62 by scholar on Sun Nov 20 14:17:14 2005:

Yeah.

Items 243, 244, 250, 259 in coop.

152 in agora52.

A bit of my silly posturing, a few sad admissions, a few hilarious moments
(like when Eric Bassey (who, by the way, does NOT have a foreskin) threatened
to have me thrown in jail, noting that Canadian authorities tend to be very
cooperative wtih extradition requests regarding computer fraud) and, at its
heart, a terrible example of Drex's staff at its best.


#19 of 62 by scholar on Sun Nov 20 14:23:04 2005:

NOW THAT WAS JUST RUDE< SCHOLAR>

SERIOUSLY< MOST OF THE STAFF IS JSUT FINE>


#20 of 62 by scholar on Sun Nov 20 14:27:27 2005:

-bash-3.00$ tel steve
Telegram to steve on ttyp4...
Msg: You are a bad man.
SENT


I fully support that tel, though, and I'm sure it will help the situation.


#21 of 62 by scholar on Sun Nov 20 14:35:14 2005:

Like, uh.

I understand the whole thing about OH WELL YOUI"VE CAUSED SO MUCH TROUBLE ON
THIS SYSTEM WHAT DID YOU EXPECT.

But I'll tell you what I expect:  SOME SORT OF INTEGRITY.

If I'm fucking around with Grex, ban me.

Fine.

I can deal with that.

In fact, I wouldn't be particularly perturbed if Grex banned me right now,
even though I haven't fucked with the system in quite a while.

However, I DON"T support sending UNTRUE e-mails to other systems that have
nothing to do with Grex and that I have not abused in the slightest.

There have been plenty of reasons to nail me over the years.  That, though,
doesn't make it right to send false E-mails to other systems in order to
inflict more 'damage' on me than they could by simply banning me from Grex,
though if you read the items I posted you'll see many people (cowards, I
think) claim that this sort of stuff is justified.


#22 of 62 by scholar on Sun Nov 20 14:41:55 2005:

Anyway, yeah.

I like having a shoulder to cry on.

And your shoulder, md.

It's so much comfier than most people's.


#23 of 62 by aruba on Sun Nov 20 15:43:58 2005:

THanks Bruce!


#24 of 62 by naftee on Sun Nov 20 16:02:24 2005:

Are you in posession of other's foreskin, polytarp ?


#25 of 62 by eprom on Sun Nov 20 19:32:12 2005:

Thanks ya'll.


#26 of 62 by scholar on Sun Nov 20 19:44:00 2005:

Re. 23:  Thanks, Bruce!

Re. 24:  I firmly believe other's foreskin ascended to the heavens and now
forms the rings of Saturn.


#27 of 62 by cross on Sun Nov 20 19:49:39 2005:

This response has been erased.



#28 of 62 by scholar on Sun Nov 20 19:55:19 2005:

I'm a bit upset that md ignored me crying on his shoulder.

Dear Mr. Expert Witness,

I would be willing to hire you for a small cash fee to testify in any legal
cases I may or may not bring against Mr. ANdre.

I wonder if that will get his attention.


#29 of 62 by steve on Sun Nov 20 20:20:00 2005:

  You just try that, 'scholar'.


#30 of 62 by scholar on Sun Nov 20 20:22:55 2005:

Okay!

I will indeed try to either bring a legal case against you or not bring a
legal case against you.

No joke.


#31 of 62 by steve on Sun Nov 20 22:11:28 2005:

   Grex is now running with 512M of ram.  There are three DIMM
slots on the motherboard.  I coudn't see any markings for which
was slot 0 (of 0-2) so guessed that they'd be numbered left to
right.  That turned out to be right.  I took all three out and
put the second DIMM (slot 1) into slot 0.  Given that this is
all a crap shoot I figured that moving everything around was
the best idea.

   If Grex becomes stable we still don't know where the
problem lies, exactly.  It could be the ram.  I plan on
calling Crucial tomrorow and see about sending the two
unused DIMMs to them for testing.  While I believe the
memtest86 program results, Crucial has the best hardware
to test ram.  

   Grex has been running for almost an hour now with
a couple of programs comiling in the background.  We'll
see if it continues...


#32 of 62 by rcurl on Sun Nov 20 22:37:30 2005:

"a value that had been written to memory by the CPU, that differed in one 
bit when it was read back"

I had that problem back in 1958 or so on a Datatron.


#33 of 62 by steve on Sun Nov 20 22:38:43 2005:

   We've now been running for 45 minutes running normal
things, plus two infinate loops of compiling a 100,000
line C program.  These last two items are causing us to
swap.  Grex is definitely busy at the moment.  I'm going
to let this run as long as I can.


#34 of 62 by steve on Mon Nov 21 01:02:03 2005:

   After 20 or so minutes of a few programs compiling, I bought it
up to 10, started swaping like mad and got the load average up to
68 for several minutes.  Grex didn't crash after more than 30 minutes
of this.


#35 of 62 by cross on Mon Nov 21 01:20:08 2005:

This response has been erased.



#36 of 62 by bhoward on Mon Nov 21 01:24:35 2005:

Dan, as far as I know, no one has looked into recompiling grexsoft
or anything else outside of the grexdoc structure bar possibly a
few things manually installed from the ports tree.

We're flying into Annapolis early tomorrow but I may be able to
take a look into it once things have settled down later this week.


#37 of 62 by bhoward on Mon Nov 21 01:28:17 2005:

(35 slipped in)

It looks like it was a controlled reboot and Steve logged back in
right after the system came up.  I suspect it was part of the system
memory swapping and testing he's doing this evening.


#38 of 62 by steve on Mon Nov 21 01:28:35 2005:

   Grex rebooted 32 minutes ago because I had to bring it down in order
to put it back in its little home.  Grex lives in the Attic.  I gotta
get some pictures of this place.  It definitely looks like an Attic.
Grex lives in the bottom of an empty 19in rack mount.  You can't stare
at the hardware when its there, so it has to be brought out and hooked
up on a table next to the rack.

   The test I ran earlier tonight put real strain on the system. I've
never seen an OpenBSD system with a load average of 68.  I kept the
system stressed for at least 30 minutes, to see what would happen.
I think we're in better shape now.


#39 of 62 by naftee on Mon Nov 21 04:04:57 2005:

steVE's wife is glandular.


Last 23 Responses and Response Form.
No Next Item No Next Conference Can't Favor Can't Forget Item List Conference Home Entrance    Help

- Backtalk version 1.3.30 - Copyright 1996-2006, Jan Wolter and Steve Weiss