Grex Oldcoop Conference

Item 294: Why Grex lost its mail partition

Entered by steve on Sat Nov 19 00:43:39 2005:

   In the process of making backups of Grex, I bungled it, such
that when I thought I'd backed up the entire /var partition, a
piece was missing, namely /var/mail.  The /var/mail parition is
a seperate thing from /var, and when I did the backup for /var
I thought /var/mail was mounted.

   Oops.

   Because of this we lost all the mail that was sitting there
for people.  Perhaps the least affected are the users who use
Grex for mail every day--if you dealt with new mail on Friday
or Saturday before we went down, then you probably didn't loose
much.  Sporadic users lost the most, sigh.

   I lost mail as well, so I'm afraid I know that bad feeling
when one realizes that mail is lost.

   I'm sorry.  I bungled that part of the backup.
176 responses total.

#1 of 176 by naftee on Sat Nov 19 20:02:04 2005:

That's OK, steVE !


#2 of 176 by scholar on Sun Nov 20 00:06:26 2005:

Speaking of mail,

Did you send any libelous mail about users to anyone this week?

A quote from dearly departed staff member Daniel Cross of the MARINES:

"Steve wrongly sent an email to gmail claiming that polytarp
 did something he didn't actually do.  He did not retract it even when it
 was demonstrated to him that he was wrong.  In fact, as I recall, he
 argued that he wasn't wrong, despite clear evidence to the contrary."



#3 of 176 by tod on Sun Nov 20 00:09:00 2005:

Thanks for backing up Grex when you didn't announce it was going offline for
a month.  


#4 of 176 by scholar on Sun Nov 20 00:13:06 2005:

You're welcome, Todd!


#5 of 176 by naftee on Sun Nov 20 00:16:42 2005:

wasn't it only a week ?

actually, i'm surprised how fast that week went past !


#6 of 176 by tod on Sun Nov 20 00:17:47 2005:

I actually got alot of things done.  THANK YOU!


#7 of 176 by scholar on Sun Nov 20 00:18:06 2005:

Thanks, Jan!


#8 of 176 by ric on Mon Nov 21 16:19:33 2005:

(good reason not to rely on email provided by an organization with no paid
employees)


#9 of 176 by albaugh on Mon Nov 21 19:02:07 2005:

Yes, my mistake.  And it will be the last.


#10 of 176 by glenda on Tue Nov 22 03:16:03 2005:

We have said all along not to keep anything on Grex that you cannot afford
to lose.  I don't even keep stuff on my own website without having it backed
up on my home system (and that also gets backed up frequently).


#11 of 176 by scholar on Tue Nov 22 03:27:09 2005:

That doesn't mean Grex should go out of its way to destroy people's data.


#12 of 176 by nharmon on Tue Nov 22 03:38:33 2005:

Why not? It would prevent complacency.


#13 of 176 by cross on Tue Nov 22 03:42:11 2005:

I doubt that grex went out of its way to delete anyone's data.  They
just screwed up.  Hey, shit happens.  That said, every guide to upgrading
*I've* ever read says to backup to stable media first (ie, a tape).


#14 of 176 by steve on Tue Nov 22 04:57:42 2005:

  You think we went out of our way to destroy people's data?  You re
delusional.

  I agree about stable media -- we had (have) copies of the partition
data in tar files that were stored in two places: Grex's /mirror on
the IDE disk, and a travelstar disk on my laptop which I tested by
ensuring the checksums of the tar files were the same in both places.
If that isn't stable, what is?

   My blunder was that I did not check that /var/mail was mounted,
a specific check for that.  I won't make that mistake again.



#15 of 176 by cross on Tue Nov 22 05:03:07 2005:

Regarding #14, First paragraph; Who are you talking to?  I said you
*didn't* go out of your way to destroy people's data, just made a mistake.
Were you referring to me, or polytarp in #11?  Please specify.

What I was really getting at is that something other than tar should've
been used to do the backups.  Like, e.g., dump.


#16 of 176 by scholar on Tue Nov 22 05:06:37 2005:

I'd like to make it clear that I also didn't say Grex went out of its way to
destroy data.


#17 of 176 by scholar on Tue Nov 22 05:08:49 2005:

(What I'm trying to say here is that, even though people SHOULDN"T keep
unbacked up data on Grex, fact is they DO and the staff ought to take that
into consideration when they do things, which they obviously didn't do in this
case.)


#18 of 176 by steve on Tue Nov 22 05:19:40 2005:

   Come on Dan -- we all know that you didn't say that.  I was responding
to the eternal noise machine that infests Grex these days.

   In the case of the error I made, dump would have given the same results.
dump is useful at times, but a tar file you can rip apart with vi to extract
things.  No so with dump.

   Re #17: if we didn't take into account that people do have data here,
why would we have even bothered to try at all?  Really, you comment here
is simply absurd.  Grex staff has had a long history of *saving* data on
bad disks, such that disaster was avoided.  This is the worst data loss
that Grex has ever had in its 14 year history.


#19 of 176 by scholar on Tue Nov 22 05:21:27 2005:

And it's because of neglience.

YOUR negligence.


#20 of 176 by steve on Tue Nov 22 06:06:46 2005:

   You really are having fun, aren't you.

   Well, I can't expect anything else from you.  You don't have the
ability to be creative, constructive or helpful.  You simply snarl
at things.

   As pissed as I am about you, I feel more compassion in the end:
you are a sad unhappy person.

   Please continue.  It is all you can do.


#21 of 176 by mcnally on Tue Nov 22 06:14:09 2005:

 Are there *any* restorable backup tapes of /var/mail?
 I had personal messages going back a long way in my spool file
 and some recovery would be better than none.  I'm sure there are
 other people in the same boat.


#22 of 176 by steve on Tue Nov 22 06:22:08 2005:

   There will be some backups on 8mm tape, which are pretty old.
I'd say at least a year?  I have the tape box.  I'll look for
the latest tape that mentions /var.


#23 of 176 by cross on Tue Nov 22 13:22:17 2005:

Regarding #18; Just clarifying.

However, dump *wouldn't* have given the same result: dump works by
interpreting the filesystem data on the raw disk devices itself, which
means that it doesn't have to be mounted (in fact, it's somewhat better
if it *isn't*).  And you *can* rip apart a dump file to pull things out
with, e.g., a text editor.


#24 of 176 by steve on Tue Nov 22 17:24:10 2005:

   Right, but since /var is seperate from /var/mail, dump wouldn't have
included it.  Given the choice of tar or dump for tearing apart, I'll
take tar.  It also has the advantage of working on Windows systems.  I
should have said "not reasonably" with dump.  This gets more into
philosophical areas.  The problem was I made an error overlooking
the partitions and I don't think my error would have been different
with dump.


#25 of 176 by cross on Tue Nov 22 17:31:36 2005:

Presumably, you'd use dump on every filesystem on the system!  That's the
whole point!  /var/mail was missed because it was unmounted when you ran
tar.  Dump doesn't care; you just tell it to dump a filesystem and it does
it, regardless of whether that filesystem is online at the time.  That's
the big difference.  With tar, it *has* to be online, with dump, it doesn't
(and in some ways, it's better if it's not.  Dump favors a quiescent
filesystem).

I'm not sure, in this case, why moving the data to a Windows system could
have been useful, though I can see the portability of tar as being an asset
in more general situations.


#26 of 176 by tod on Tue Nov 22 17:41:10 2005:

re #21
Thanks, Mike and STeve.


#27 of 176 by naftee on Tue Nov 22 19:15:45 2005:

re 20 You're pissed at polytarp about him making fun of your mistake(s) ?
I guess that is something to get mildly ticked off at, but oh well@!


#28 of 176 by ric on Tue Nov 22 19:21:18 2005:

Umm.. Steve.. stop responding to him/them.


#29 of 176 by tod on Tue Nov 22 20:22:08 2005:

Be sure to send that RAM in for the lifetime guarantee replacement, STeve.


#30 of 176 by steve on Tue Nov 22 20:29:21 2005:

  I'm going to send these two in to Crucial for testing.  Indeed they do
have a good warranty.  Never had to use it before.  We'll see.


#31 of 176 by scholar on Tue Nov 22 23:18:58 2005:

What two?!

Me and naftee?!


#32 of 176 by steve on Tue Nov 22 23:24:20 2005:

   I wouldn't bother.  We know you're defective.


#33 of 176 by naftee on Tue Nov 22 23:34:03 2005:

haha zing


#34 of 176 by slynne on Fri Nov 25 20:31:02 2005:

Ah well. I had some personal mail that I would have liked to have kept 
but I guess it isnt the end of the world to lose it either. 


#35 of 176 by tod on Fri Nov 25 21:14:58 2005:

I also noticed some items from the parenting cf have come up missing.
Can you restore them, STeve?


#36 of 176 by naftee on Fri Nov 25 23:10:47 2005:

yeah, steVE.  it's bugging us.


#37 of 176 by cross on Sat Nov 26 03:11:35 2005:

Badda bing!


#38 of 176 by tsty on Tue Nov 29 04:36:06 2005:

grex did not go out of its way to destroy any data. borg/staff went out
of its way to favor its phony  invincibility - not an unknown arrogance
inside consensus-only leaderless organizations which eschew 'outsiders'
of any stripe. 
  
grex has, sometimes, 'taken in' 'strays' but only until somethng can
be trumped up into a 'scandal' and then ... poof! 
  
different thoughts are prohibited inside the inner navel-gaze.
  
its sadly predictable but no one would listen - invincible arogance or
something similar. maybe bad history teachers helped. maybe no
systems analyts/engineers permitted
  
it's not STeve .. it's borg.
  



#39 of 176 by tsty on Tue Nov 29 04:37:02 2005:

hey, cross, how ya doin?


#40 of 176 by janc on Tue Nov 29 15:08:31 2005:

I think it is ridiculous to say that Grex lost the mail partition
because of some great organizational fault.  Sure Grex has
organizational issues, and I believe some of them contributed directly
to the excessively long down time for the upgrade, but I don't think
that the mail partition problem was particularly caused by this.

The backups were performed by STeve Andre and John Remmers.  Probably
STeve was typing and talking about what he was doing, and John was
looking over his shoulder (this is the mode they were working in when I
dropped by later).  This is a pretty good way to do things like this,
because the second person has a good chance of catching the first
person's errors.  In this case the error was subtle - I think they had
both forgotten that /var/mail was unmounted because it was done some
time before.  When they tar'ed up /var, they got a huge file.  STeve did
a partial listing of its contents to see if the right sort of thing was
in it, but he didn't look far enough.  He saved a copy on Grex's IDE
disk, and uploaded another to his laptop.  There was supposed to be an
additional safety net - the mirror drive.  Unfortunately, my mirror
scripts weren't smart enough to not mirror an unmounted drive and nobody
remembered to turn them off, so we lost the mirror copy of /var/mail too.

So we had several safety nets in place, and all of them failed.  It's
pathetic and unfortunate, but it's not an organizational failure, and
it's not a failure based on a phoney sense of invincibility.  Like all
computer professionals, we are well aware of our capacity to screw up
and take precautions to protect ourselves.  But sometimes the
precautions fail.  That's life.


#41 of 176 by tod on Tue Nov 29 16:49:57 2005:

 I think it is ridiculous to say that Grex lost the mail partition
 because of some great organizational fault.
Let's step back for a second.  There was a time when it was considered polite
to notify users of intended downtime due to upgrades.  That was so users could
ensure they have their precious data moved offline if they felt the need. 
It was also so they'd know not to plan on being online at that time.
Organizational fault is written all over the last "upgrade."  The Board is
slacking off by letting a bunch of part time hobbyhorse types take the system
offline without forewarning.  The board has a fiduciary responsibility to the
members to keep the system around and available.  No accountability in this
organization, imo.  My recommendation is that the board seek some fresh blood
for the staff and also take some lessons in diplomacy and accountability.
No need to chant "volunteer organization" at me, neither.  That's a dead horse
not worth beating and everyone is tired of that excuse.


#42 of 176 by keesan on Tue Nov 29 17:25:21 2005:

Todd, why don't you start your own bbs and do it right?  Our volunteers are
not perfect, they admit to this, they accept suggestions, and they don't need
more complaints.  A polite request for a few days notice next time grex is
going down would accomplish more than #41.   Are you looking for someone to
volunteer as full-time staff?


#43 of 176 by nharmon on Tue Nov 29 17:31:00 2005:

The function of the staff should be to advise the BoD on technical
issues, implement the decisions made by the BoD, and intervene on their
own initiative in some circumstances.  I say should be, because with the
exception of a few board motions, there is not a lot that defines what
staff's duties, responsibilities, requirements, etc. are.

Ideally, there should be one person appointed by the BoD responsible for
supervising the staff. This person would be accountable to the BoD, and
the other staffers accountable to him/her.

Right now it seems there is a hash of staff members who do a good job of
working together but without any real guidance or direction. I think
finding someone with the time and drive to give direction is something
Grex needs desperately.


#44 of 176 by nharmon on Tue Nov 29 17:32:56 2005:

I think it is a shame that someone takes the time to voice their
recommend on how to improve Grex only to be told to leave and start
their own BBS if they don't like how Grex is run.

That sort of attitude is exactly what ruins good organizations like this.


#45 of 176 by tod on Tue Nov 29 18:10:58 2005:

#42 of 44: by Sindi Keesan (keesan) on Tue, Nov 29, 2005 (12:25):
 Todd, why don't you start your own bbs and do it right?  
I'm a member of Cyberspace, Inc.  I like this BBS (when its online.)

 Our volunteers are not perfect, they admit to this, they accept suggestions,

 and they don't need more complaints.  
How do we provide a community service without community feedback?

 A polite request for a few days notice next time grex is
 going down would accomplish more than #41.
I would request politely if I thought the Board would listen.  I suspect
that the Board is 2nd fiddle to janc and STeve's whims, though.  
Therefore, I'm using a harsher tone in hopes of a constructive response
and action.

   Are you looking for someone to volunteer as full-time staff?
I think the Board seriously should be.  The downtimes of the recent past
have been at the full mercy of staffers with Grex nowhere near the
top of their priorities.  I don't fault them for it but I do fault the
Board for not seeking additional available and willing staff.
Retaining staff should also include a bit more diplomacy in the
way it treats existing volunteers and members.

#43 of 44: by Nathan Harmon (nharmon) on Tue, Nov 29, 2005 (12:31):
 Ideally, there should be one person appointed by the BoD responsible for
 supervising the staff. This person would be accountable to the BoD, and
 the other staffers accountable to him/her.
I agree with you, Nathan.  I'd also interject that the BoD is
ultimately responsible.  As members, we should not be told to shutup
when we ask the Board why no one is being accountable for Grex.


#46 of 176 by steve on Tue Nov 29 18:25:50 2005:

   Grex has never operated with a chief staff person.  It's always been
more of a collective thing.  It's worked out at least as good as work
places I've been at which had an official structure.

   Tod you know as well I as I do that there isn't going to be a
full-time staff person.



#47 of 176 by mcnally on Tue Nov 29 18:39:57 2005:

>  I think it is ridiculous to say that Grex lost the mail partition
>  because of some great organizational fault.

I don't think that's ridiculous at all.

I think the mirroring scheme was set up by someone other than did the
repartitioning and the person doing the repartitioning didn't fully
understand the implications of the backup scheme, namely that unmounted
partitions aren't backed up.

Furthermore I think that there are organizational issues that led to
the mail disaster in other ways, too.  I've refrained from commenting
because I haven't had a good idea how to separate criticism of the
upgrade from criticism of the people who performed the upgrade, but I
personally think it was a very bad idea to upgrade and restore in place.
If we had a spare SCSI disk (or perhaps set of disks) [which we should
have anyway, for disaster recovery] the entire upgrade could have been
performed without ever risking the data on the disk(s) the system had
been running on.  As I understand it Grex has got a not excessive, but
still reasonable amount of money in the bank.  Perhaps we should
invest in preventing exactly this sort of behavior the next time around.

And while I would never suggest that anyone jettisoned the mail on
purpose, I suspect a contributing factor in the mail loss is that
none of the people involved depend on the mail system here in any
way that's truly important to them.  They shouldn't *have to* to
administer the system but it does tend to focus one's attention
when you've got something to lose.


#48 of 176 by tod on Tue Nov 29 18:42:02 2005:

I agree to disagree with STeve about a full-time staff person.  The Board can
at least make an attempt to find such person(s) with flexibility in
availability and accountability.  A staff of several persons with dedicated
timeslots would be ideal but needs to happen by someone taking that task as
the lead.  "We never did it before" and "isn't going to be" are empty excuses,
imo.  Why is improving Grex uptime and maintenance so painful a concept?


#49 of 176 by steve on Tue Nov 29 18:46:07 2005:

   And how do we pay for this person?


#50 of 176 by steve on Tue Nov 29 18:50:36 2005:

   Well Mike, I would have *liked* to have had spare disks for the
upgrade.  Here at work I keep entire spare machines such that an 
upgrade is done on the next machine, with data transfers done onto
the new machine, and a switchover of IP addresses.  I've had as
little as 4 seconds of downtime for such upgrades.

   But I did not succeed in getting the board to move on getting
the PC Weasel because of costs.  yes, I thought of asking for
money for at least one more 36G scsi disk, but I didn't want to
go through that, dealing about money again.

   I lost mail of mine too, Mike, so I felt the pain as well...


#51 of 176 by mcnally on Tue Nov 29 19:19:20 2005:

> But I did not succeed in getting the board to move on getting
> the PC Weasel because of costs.  yes, I thought of asking for
> money for at least one more 36G scsi disk, but I didn't want to
> go through that, dealing about money again.

Right.  Which supports my counter-argument against Jan's statement
and suggests that maybe the recent trouble points to some organizational
problems that we can remedy before the next time we reach a crisis.  


#52 of 176 by tod on Tue Nov 29 19:23:50 2005:

re #49
    And how do we pay for this person?
We can pay them double what you're getting from Grex.  ;)


#53 of 176 by steve on Tue Nov 29 19:24:39 2005:

   I'm not so sure.  The management of Grex has always been prudent
about financial things, to keep the system healthy. Perhaps one
could say there is too much of that at times, but thats life.  No
organization is perfect.  Overall I think Grex does things pretty
well, and, in spite of my thoughts on things, I'd rather have
this organization than even a lot of "professional" organizations
in terms of how they work.  This is not to say that we couldn't
stand improvement, just that we're less screwed up than most
business places, from my point of view.


#54 of 176 by steve on Tue Nov 29 20:12:12 2005:

   Re #40:  It was *my* fault that we lost the mail partition, and mine
alone.


#55 of 176 by nharmon on Tue Nov 29 20:31:19 2005:

Thank you for not displacing blame and taking responsibility Steve. I'm
sure we all know you didn't do it deliberately and will probably not
make the same mistake again. Thank you for being honest.


#56 of 176 by mcnally on Tue Nov 29 20:32:39 2005:

 re #54:  Your error was the proximate cause, but that doesn't mean that
 there weren't contributing issues that we should address in anticipation
 of future mistakes -- anytime people are involved mistakes are inevitable
 but through proper planning and procedures you can make a huge difference
 in the outcomes..

 I think you're seeing this as a discussion about blame, which is probably
 not an unreasonable way to look at it from your standpoint, especially as
 there are still a lot of people who want to discuss blame.  I'd much rather
 try to figure out how best to keep it from happening again, which requires
 some degree of understanding what happened and why, but is a question to
 which blame is pretty much irrelevant.


#57 of 176 by steve on Tue Nov 29 20:38:27 2005:

   No, I'm not looking at it from a blame stanpoint, just the truth.

   I do however fully agree with you that we need to be able to look at
things and do some things better.  Thats a good thing to do.


#58 of 176 by naftee on Tue Nov 29 22:04:13 2005:

hey tod.  how do you say 'fiduciary' in Romanian ?


#59 of 176 by bhoward on Tue Nov 29 22:28:28 2005:

Re#50: It should be noted for the record, as of the last board
meeting, we have reconfirmed that the cost of a PC Weasel is being
covered by an anonymous donation.

Has the order been placed?

Once this and other post-mortem discussions have run their course,
I suggest we (staff) should take the points made, write up a summary
of how we'll go about the next one, making certain the process
described will address the shortcomings identified in the last one.


#60 of 176 by other on Tue Nov 29 23:05:30 2005:

I have a suggestion.

Documentation tends to be written and then squirreled away in any of a
number of places where it may or may not ever be read or seen again.

I propose that Grex operational documentation be kept in a single file
or directory, and that the contents be tagged (XML, perhaps?) and the
relevant scripts/programs modified so that those scripts and programs,
when run, can access and echo to the screen of the calling user any
information which they should have in mind before any actions are
performed.  Ideally, they would require an acknowledgement before
continuing.

The advantages are: easy updating of documentation (all in the same
location); improvement of documentation (since it would constantly be
appearing, chances are it would be written or rewritten to better
communicate important information); and less time wasted either writing
useless documentation or because of lack of documentation where and when
it was needed.

This is a fairly easy to implement suggestion, and will make it much
easier to have new staff trained in the vagaries of Grex whenever there
is new staff to train.

This may represent a significant allocation of time, but for those who
have already spent lots of time writing documentation, it shouldn't be
hard to see why this is necessary.  It should be prioritized as highly
as any other staff responsibility including keeping the system running
and secure, because it will make both of those goals easier and faster.

Lastly, if anyone thinks they don't need to have stuff documented
because either "everyone knows it" or "I'm the only one who does this
and I know it," those are the persons who most need to be doing this.


#61 of 176 by other on Tue Nov 29 23:10:25 2005:

By the way, this scheme easily allows for pointing to additional tools
and documentation to supplement the echoed information.  

For that matter, both tools/scripts and documentation might be collected
in a keyword searchable database (using the same tagged source
documents) for anyone needing to know how to perform a certain function
on the system.

The more of this kind of thing that gets done, the less the system is
dependent on a few individuals with highly specialized knowledge to do
most of the things necessary to keep the system running properly.


#62 of 176 by tod on Tue Nov 29 23:28:57 2005:

re #58
 hey tod.  how do you say 'fiduciary' in Romanian ?
demn de incredere


#63 of 176 by steve on Tue Nov 29 23:41:10 2005:

   We have a good start at documentation in the /grexdoc directory
and in the staff conference.  Both need more work, but we do have
a good start for it.


#64 of 176 by aruba on Wed Nov 30 01:14:56 2005:

Re #59: I haven't ordered the PC Weasel yet, but I will soon.  Someone needs
to find out from Provide Net what it's going to cost us per month to have a
separate machine running, which is, as I understand it, what we will need in
order to make the PC Weasel work.

Tod said the board should be trying harder to get more staff.  Well, I'm not
on the board right now, but I think I speak for them when I say, they're
open to suggestions.


#65 of 176 by steve on Wed Nov 30 01:39:16 2005:

   I'll send mail to John A again about the cost.


#66 of 176 by glenda on Wed Nov 30 02:02:21 2005:

Re #41:  You keep saying that it would have been nice if notice had been given
before the upgrade.  It was in the motd for several days beforehand that the
upgrade would happen that weekend if at all possible.  The upgrade happening
as soon as STeve got everything together and could get with John was discussed
in at least one item for a couple of weeks before it was done.  How much
warning do you need, or did you expect a personal email?


#67 of 176 by keesan on Wed Nov 30 02:25:51 2005:

The problem with messages in the motd is that people forget to change them
when they get outdated so we tend to ignore them.  If there were only relevant
messages there I would read them.  I don't really want to know that grex was
down two weeks ago for a day.


#68 of 176 by nharmon on Wed Nov 30 04:04:01 2005:

Things like maintaining the MOTD are tasks to give to people who want to
join staff as a way of seeing how they handle it. Start him/her off
here, and then go up from there.


#69 of 176 by glenda on Wed Nov 30 04:09:37 2005:

If not the motd, where?  It was also discussed in at least one item here and
in Agora.  Short of sending email to every account on Grex what are we
supposed to do.  I manage to glance at the motd every time I log on enough
to notice if something new is posted.  It only takes a couple of seconds, it
isn't that long and it has been rather up to date lately.  If you choose it
ignore it, that is more your problem than it is staff's.  Yes, I agree that
outdated things should be removed, but lets get real here.


#70 of 176 by nharmon on Wed Nov 30 04:11:17 2005:

What would be the impact of sending an e-mail message to every account
on Grex?

OR, better yet, what about an opt-in mailing list for people who would
like to get system announcements.


#71 of 176 by steve on Wed Nov 30 04:31:43 2005:

  Now *that* is a good idea, a mailing list for announcements of system
work, downtime, etc.  Excellent.

  The impact of staff sending out mail to every acocunt on Grex would
be 1) to take about 20 minutes of system pounding to deliver about
29,000 emails, 2) consume about 50M of /var/mail space, and 3) would
likely generate a couple hundred emails back with 1/2 asking if this
was real, and the other half asking about why and when the system
would be back up, regardless of what we said in the mail. ;-)


#72 of 176 by naftee on Wed Nov 30 05:08:01 2005:

 ;)

GreX should provide an escort service


#73 of 176 by bhoward on Wed Nov 30 12:20:11 2005:

Re#67 Sindi, we can certainly remove motd messages more aggressively.
Notices for things such as recent outages tend to stay in motd for
at least a week to insure that folks not regularly logging in or
reading the conferences still will have some idea why the system
may have recently crashed or otherwise been unavailable.

A weeks notice for major notices is a (hopefully) reasonable balance
between those who log in daily and those who hit the system at least
weekly (arguably it is a balance between those who only need to be
told once and those for whom the message may not register until
they've seen it several times).


#74 of 176 by steve on Wed Nov 30 13:26:37 2005:

   Sorry Bruce but I don't think we should commit to that.  Access
to Grex's hardware is simply too limiting.  Back when Grex was starting
to crash every day some months ago, I wanted to get to Grex and do
things for a week, every day, and simply couldn't get there in time
to be able to do anything with the 10pm curfew we live under now.

   Yes, its a *good thing* to give advance notice on shutdowns, I
fully agree.  But let's not lock ourselves to it.


#75 of 176 by bhoward on Wed Nov 30 14:08:15 2005:

Steve, I was referring to Sindi's complaint that motd has messages
about past crashes and outages too long *after* the event.  I made
no comment as to how much warning there should be before there is
an outage.

I think our current routine of announcing system down time several
days in advance for scheduled downtime, and best effort warning for
anything else is sufficient.

On a related note, I think we should commit to updating the hvcn
page with current system status *before* commencing any system work,
emergency or otherwise, that will keep grex down or unavailable for
more than a few minutes.


#76 of 176 by steve on Wed Nov 30 14:51:32 2005:

  Sigh.  OK, upon rereading this I see what you mean.  Don't type
before coffee should be my mantra on these rare days when I'm up
before 8am.   Yes, putting an announcement on the hvcn page is
something we need to do.


#77 of 176 by tsty on Wed Nov 30 14:55:26 2005:

an operation as large as an os upgrade ought to have had a 
written checklist - and that checkllist could have been discussed
looong before hand in public, agora &/or coop.
  
but then shoulda/coulda/woulda only has recrimination value after the fact.
  
since this type of operation isn't about to happen too often, the
disaster is just  lurking around - AS ALWAYS - waiting for memories
to fade .
  
teh previous upgrades were not as complicated, had much more notice,
and were thought through with more precision - might even have had
a written checklist handy!
  



#78 of 176 by tsty on Wed Nov 30 15:34:05 2005:

re hvcn ... i went there for info but had to call mary and ask where
it was on hvcn .. there was nothing (at that time) that would have
led anyone to know where to click --- unless you already knew in 
advance and book marked it.
  
re  #74 -- what 10pm curfew???   i thought 24/7/365.25 was the deal?
  
we got out of ken's wharehouse for the same curfew problem, adn now we
are back into another curfew?  guess i wasn't payig enough attention.
  
btw, back on the checklist thought ... at least those with military
training wold have, by default, created their own instructions sheet.
  
not that you have to have had military training to figger that out
but it helps. and systems engineering 101, remedial, would have demanded
a check list ... system analysis 099, non-remedial, would have had
checklist provisions built-in to the coourse.
  
hell, the repetitive event sequence of starting up an airplane is
done by two pople with a checklist! 
  
hell, back wehn *i* halted grex, by accident, i wanted that non-existant
checkllist to provide some thoughtful path. wasn't one. left grex 'as is'.
nothing was damamged, noting was lost except my staff responsibillities - 
backups (how ironic). at least one of hte ppl (me) who forsaw precisely
this disaster *sometime* in the future and volunteered to backstop it
from being the disaster it now is, was 'offed' and dissed in teh process.
  
future borg adnfuture staff *could* have an adgenda item: monthly
backup accomplished? checklist adgenda item: yes/no 
  
first things first. secure your environment, what is contemporary, before
wnadering off into the unknown future with NO ROUTE HOME IN PLACE.
  
it is only isn the last few years that i no longer enter a new 
environment without already knowinig THE OTHER WAY OUT, just in case.
 
catastrophy theory (my masters subject) says, 'you can't get back from here.'
  
therefore you prepare everything so that you will never *HAVE* to go
back where you cant get to. 
 
if yo can't get back and you cant cover your ass, you fscking STAY PUT
until another path is found/created to prevent exactly this sort
of catastrophy.   my explanation of that, simplified, back when i
halted grex was apparently unintelligible to the recipeint(s).
 
or, it was forgotten over time, which is more of what i think, just
damn forgotten. blithly erased from teh cache of collected wisdom.
  
not much cache in that account anymore, eh?


#79 of 176 by steve on Wed Nov 30 18:06:36 2005:

   I'm not really sure how to respond to what you are saying.  I
don't think you understand the nature of the upgrade, in that 
there was no path back.  The op system had problems with both
the networking card and filesystem issues, and as an extra
treat hardware problems.

   This all dances around the critical issue that no one should
keep valuable mail on Grex only.  Keeping valuable mail in the
/var/spool area is even worse; it's an active filesystem, the
most active one on Grex and as such is more prone to failures
than anything else.


#80 of 176 by naftee on Wed Nov 30 18:44:03 2005:

I don't think anyone can coherently respond to one of tsty's posts :(


#81 of 176 by ric on Wed Nov 30 19:11:14 2005:

(hah)


#82 of 176 by aruba on Thu Dec 1 00:23:09 2005:

Re #79: You know, STeve, I just don't think that's a good enough answer. 
Yes Grex is run by volunteers, and yes people can't expect the same
accountability from Grex that they can from someone they're paying to keep
their data safe.

But if Grex is to be anything people care about, then the board and staff
have to themselves care enough to do their best for Grex.

Frankly, I'm glad some people are really pissed about losing their mail.  I
wish more people were.  If no one gave a damn, well, then Grex would really
be nearing the end of its life as a viable community.

I agree with tsty that backing up the system ought not to be an ad hoc thing
that someone works out as he goes along.  THere ought to be a procedure in
the GrexDoc which gets followed each time.  If the system changes so that it
gets done differently, then the changes ought to go into the documentation.


#83 of 176 by steve on Thu Dec 1 02:15:17 2005:

   And you don't think I care about Grex?  Good God Mark, I made a mistake.
WHERE do you think that I a) don't feel badly about missing it, and b)
that I don't care?


#84 of 176 by aruba on Thu Dec 1 04:02:40 2005:

I didn't say that or mean it, STeve.  But I don't think telling people 
that they shouldn't ever store anything that matters on Grex is a good 
enough answer.  I believe Bruce when he says the mistake you made could 
have happened to anyone.  *But I don't think it should have been you and 
John alone in the room doing the backup and upgrade*.  I think there 
should be an institutional procedure in place for these things, so that 
the collective knowledge and experience of all the staff is brought to 
bear on the procedure.  Grex shouldn't be dependent on one person's 
judgement.

And actually, I thought there *was* a procedure in place, in the person of 
the GrexDoc.  And I thought you promised the board you would follow the 
GrexDoc when you did the upgrade.

Maybe the doc wasn't complete - I don't know if it covered the backup
part of the upgrade.  Did it?


#85 of 176 by scg on Thu Dec 1 04:17:08 2005:

I haven't logged into Grex in the last year or two, but I've been
lurking on the staff mailing list.  If I lost anything in this backup
error, it wasn't anything I cared about.  That said, I, too, am somewhat
puzzled at the procedure that was followed.

I got my start doing Internet stuff as a member of the Grex staff more
than ten years ago, so I remember the constraints we had to work under
then.  Grex was a rare piece of ancient Sun hardware, disks were really
expensive, and none of it was any more reliable than most of the other
stuff running on the Internet in those days.  When something needed to
be done, it often meant taking the system offline, sometimes for a full
weekend in the case of a few major upgrades or disk crashes.  We had a
much bigger staff back then, and for many of us whose social lives
revolved around the Grex community it was a pretty high priority, so
when something needed to be done there were typically lots of people
around to work on it.  I can certainly see how doing things as we did
them then, but with a smaller and less focused on Grex staff, would lead
long periods of downtime.

But I'm puzzled about why I see the same methods being used on Grex now,
when hardware is considerably cheaper and staff time appears to be a
much scarcer resource.  My perspective is arguably a bit skewed.  The
non-profit where I'm now a paid full-time staff member is pretty
impoverished, but still has a budget a couple of orders of magnitude
higher than Grex's, and I tend to come at systems stuff as a manager
rather than as a hands-on sysadmin these days.  Still, it doesn't look
to me like the problems that are being talked about here are difficult
to solve.

If I recall correctly, Grex is now running on PC hardware that's at
least two or three years old.  In other words, getting some equivalent
systems should be cheap (or free, given that that's replacement age at a
lot of places, and Grex is 501(c)3).  Installing new software versions
on new hardware, testing, and then copying over whatever is dynamic at
the last minute, seems pretty obvious.  Falling back to the old system
at that point if something doesn't work is at most a matter of moving an
ethernet cable.  Likewise, having spare systems ready to copy whatever
is dynamic onto is a good way of dealing with hardware failures.

This really, I think, comes down to whether anybody still cares enough
about Grex to make it worth dealing with.  My own view is that the
community I once cared about seems to have gone on to other things, and
the services Grex is providing aren't anything special anymore.  But if
people care about keeping Grex operating, it looks like something needs
to change.



#86 of 176 by eprom on Thu Dec 1 05:58:04 2005:

I'm upset about the mail debacle too but unfortunately, anytime any 
constuctive criticism is offered, it's somehow taken as a personal 
attack on that person.

Back when I was in the AF, every section had a recall roster, and a 
binder with documentation of a basic contigency plan and checklist. 
I think that would be a good idea for grex. Too bad that would be
construed as tieing staffs hands or micromanaging. 

I also think the heirachy of operation should change from every staff
operating as equals to someone volunteering as a main sys-admin, who
is accountable and reports to the board. It seems that the current way
of doing things is broken.


#87 of 176 by krj on Thu Dec 1 18:29:52 2005:

I think SCG (hi, Steve!) hits a few important points in his resp:85.
 
In particular, I would like to stress the need to evolve our thinking
from "This computer is Grex," to "This internet service is Grex, and 
these are the hardware components we have to support our service."
Right now, *everything* is a single point of failure for Grex, 
and as we just learned, staff can't back out of an upgrade because
the upgrade is done on top of the old disks.
 
Amazon.com and LiveJournal don't go dark for a week while they 
do upgrades; they acquire the hardware they need so that upgrades
can be rolled into production with a minimum of disruption.

----

Longer term: all of the community-building services Grex offers
are now offered, for free, by large organizations with professional
support staffs.  The one thing which isn't common is the open access
to a shell prompt; but that's also one thing which creates huge 
social/behavior management problems.  It's also unclear to me if that's 
a core function of Cyberspace Communications as it was organized,
rather than the tool towards the community-building goals which 
was available 14 years ago.


#88 of 176 by nharmon on Thu Dec 1 18:43:39 2005:

Amazon.com and LiveJournal have massive capital investments in hardware
and engineers that Grex simply can not and will not ever provide.
Further, the financial impact of outages is different for Grex than it
is for those two.


#89 of 176 by mcnally on Thu Dec 1 18:59:28 2005:

 You're right that Grex doesn't have as much money to spend as Amazon
 or LiveJournal, though I don't think that point escaped anyone even
 before you explicitly stated it.  A more salient point is that Grex
 has enough money in the bank to afford a backup disk.  We just didn't
 plan to use it for that. 


#90 of 176 by nharmon on Thu Dec 1 19:13:45 2005:

...Or the money could be spent on a colo that would give us 24/7 access
to the machine, thus giving staff a larger window to recover from outages.

You see, I think this is the sort of direction that some have been
saying Grex lacks. We're not sure what takes precedence.

Another suggestion:

Grex has security goals, why not have overall system goals? Maybe even a
mission statement? These goals could be put in order from most important
to least important...they could be things like: "Maintain a conference
system void of censorship", or "provide for limited dialup internet
access in the ann arbor area", or "provide for user data integrity
through fault-tolerant disk storage and regular backups".

Then, when it came to making decisions on expending resources, everyone
would be on the same page as to what problems took priority.


#91 of 176 by mcnally on Thu Dec 1 19:58:20 2005:

 I know that because of his work hours and long commute, physical access
 during the day and the early evening is not feasible for STeve, but when
 24/7 access is suggested nobody ever says who's hypothetically going to
 be fixing the system at 3 AM, so I'm not sure access hours are the real
 issue.


#92 of 176 by ric on Thu Dec 1 20:29:10 2005:

I care about Grex and M-Net (for different reasons).

And I still think that anyone who uses either system with the expectation that
their files are safe OR secure is a fool, and I don't have any sympathy for
people who lost important email they had stored on grex.


#93 of 176 by mcnally on Thu Dec 1 20:35:15 2005:

 Won't ric be surprised when he finds out I used my staff access to
 delete his home directory, conference participation files, and uid!

 Just kidding, of course, but if he thinks users shouldn't expect their
 e-mail to be safe from sudden disappearance I'm not sure what else on
 the system ought to be sacrosanct..


#94 of 176 by nharmon on Thu Dec 1 20:38:27 2005:

If most of the users agree with you mcnally, then that should be one of
Grex's goals.


#95 of 176 by krj on Thu Dec 1 20:57:16 2005:

Mike in resp:91 :: before Grex left the Pumpkin, there were numerous
times when I dropped Steve off there after we got back from work, 
and he worked on Grex for some hours in the very late evening or 
early morning.


#96 of 176 by glenda on Fri Dec 2 02:14:09 2005:

I seem to remember a few times that you dropped him off at the Pumpkin when
you got back into town, and picked him up there in the morning to go back to
work.

For those advocating having an equivalent system for doing upgrades and
recoveries:  where do we store it?  The colo charges for space.  If a staffer
stores it we still have problems with access unless that staffer is the ONE
doing the upgrade/recovery.


#97 of 176 by ric on Fri Dec 2 15:22:23 2005:

re 93 - I would be surprised, and I'd probably ask for your removal if you
did that on purpose without good reason.  But it wouldn't really bother me
much.  I'd just create a new account.  I participate in two conferences - coop
and agora.  And I have used the forget statement on all but one item in agora.

I don't have any files in my home directory that are important to me.

The only thing that might upset me is if I was unable to get my username "ric"
back, since I'm pretty much been known as "ric" in the mnet/grex world since
1986.

(Though I think there was a period of time in the mid 90s where someone else
had that ID on Grex cuz I got reaped)


#98 of 176 by slynne on Sat Dec 3 00:10:47 2005:

Even though we are just a small organization, there is nothing wrong 
with us doing the best we can in all situations. I also think that 
criticism is ok although I sometimes think that some people around here 
have trouble presenting their criticism in the best possible way. It is 
pretty easy to start feeling defensive about things. 

As for the email loss. It was a mistake. It cant be undone and that is 
that. No one did it out of malice. And even the most competant 
technical people make mistakes sometimes and email sometimes gets lost 
even at for-profit firms. 

As for what we can do to prevent such a loss in the future...Well, 
there are a lot of good ideas being presented here. I dont know what 
the answer is. Our finances arent great and I know that there is a 
reluctance to spend a lot of money. However, exploring backup options 
is really something we should do. 


#99 of 176 by keesan on Sat Dec 3 05:06:45 2005:

I just found the info that I had saved in a recent email and it is actually
nice not to have to go through all 200 or so old mails deciding if there was
anything important in them, so I am actually grateful now, and pine starts
up so much faster with an empty inbox.  I wish spamassassin would work again.


#100 of 176 by bhoward on Sat Dec 3 07:56:37 2005:

Have you tried in the last few days.  I reinstalled spamassassin and spamd
a day or so ago.


#101 of 176 by keesan on Sat Dec 3 14:41:00 2005:

I had been using a copy in someone else's account, because he said he updated
it more often.  I will switch to the grex version, thanks.  I had gone back
to my old filter, which is about 10 pages long and lets some things through.


#102 of 176 by tsty on Sun Dec 4 07:35:11 2005:

re #79 ...   excuse me! it wouild seem, that *i* undrsatnd "the nature of the
upgrade," one hulluva lot better than either you or other staff or other borg!

shit!  
  
"there was no path back" --- that is *precisely* the sysadmin situation 
for which i have been * t r a i n e d * !!
  
whtether it is an air defense missle ssytem or a fscking os upgrade - the 
cover-yur-ass attributes are identical. 
  
somewhare along the line i copied this:
  
Worse, a great deal of the delay was because we as staff really failed to work
 together effectively.  We ran into deep differences in basic philosophy about
 how grex should be run that cost us extra days.  Because we didn't all agree
 on what we were going to be doing before we started, our preparation for the
 rebuild was not complete.  We ended up redoing significant portions of the
 job more than once.

i don t know, at this moment, where it came from, but i did *not*write it.
  
some rooty-tooty (not sTeve) did .. and borg & staff are imtimately
responsible for the fsxk-up.
  
mostyl borg!
  
  'in-place' .... WTF!
  


#103 of 176 by cross on Sun Dec 4 15:29:31 2005:

I believe that Jan said that.


#104 of 176 by naftee on Sun Dec 4 16:33:30 2005:

StEve
steVE
sTeve
STeVE


#105 of 176 by cross on Sun Dec 4 17:35:23 2005:

Quick!  What's 5 choose 2?  Answer: (5!)/((5-2)!(2)!) = (5!)/((3!)(2!)) =
(5*4)/2 = 5*2 = 10.  Think of the permutations of the capitalization of
letters in Steve's name this way: Given a string of 5 characters, taken
from the Alphabet {0, 1}, how many ways may I write such a string with
exactly two 1's?  Clearly there are 5 choose 2 such ways, and as we have
seen, that means 10 possibilities.  Now, I take a 1 to mean a capital
letter and a 0 to be a lowercase letter and enumerate:

STeve = 11000
StEve = 10100
SteVe = 10010
StevE = 10001
sTEve = 01100
sTeVe = 01010
sTevE = 01001
stEVe = 00110
stEvE = 00101
steVE = 00011

These make up the set of permutations of Steve's name with his preferred
number of capitals (though his preferred choice is one specific element).


#106 of 176 by glenda on Sun Dec 4 21:22:52 2005:

His preferred choice came about by accident.  When he first started using
conferencing systems, he didn't release the shift key fast enough.  He went
to the National Computer Conference and while in a conversation someone asked
him if he was the S T eve.  He laughed and replied that he was and they had
a great time talking.  He decided that it was a good thing to keep and has
used it, purposely, even since.


#107 of 176 by scholar on Sun Dec 4 21:57:42 2005:

I purposely pervert his choice by being a wiseguy and taking the other oddity
of his name ('), which is applied to his last name, and applying it to his
first name.

I didn't do this because I had a great time talking.


#108 of 176 by naftee on Sun Dec 4 22:49:51 2005:

re 105 I prefer using my calculator to solve those types of problems, but
really; i was just goofing around !@


#109 of 176 by sholmes on Mon Dec 5 08:19:13 2005:

that's also the number of handshakes in a party with 5 ppl, if everyone shakes
handswith everyone else.


#110 of 176 by cross on Mon Dec 5 17:57:23 2005:

That's true.  Think of each bit as being two people shaking hands.


#111 of 176 by janc on Mon Dec 5 21:50:17 2005:

I agree 100% that we shouldn't have rebuilt the system by overwriting
the old disk partitions.  One of the recommendations I made in my
post-mortem item immediately after the new system came up was to never
do that again.  Alas, I did not make that recommendation before the
rebuild - though that was certainly part of the upgrade method defined
in Grexdoc - that's why the ALT partitions exist.  But I'm really not an
experienced system adminstrator anyway.  I'm not sure that the need to
avoid a destructive rebuilt was as clear in my head before this fiasco
as it was afterwards.  Live and learn.  In any case, I wasn't around to
give any recommendations.

Before the upgrade, John was really the only active staff member.  He
was doing the reboots.  He was debugging grexdoc on another machine.  He
was reluctant to undertake the rebuild by himself though.  My impression
was that there was something of a panic at the board meeting.  Grex was
crashing regularly, and their wasn't much of staff plan to do anything
about it.  STeve, a board member and a staff member, responded to the
emergency by committing his next weekend to a Grex upgrade.

I had been neglecting Grex so completely that I didn't even know about
it until I talked to John and Mary on the Grex walk the morning before
the upgrade. There was never really any staff meeting to discuss the
upgrade.  If there had been, we might have given it enough thought to
realize that there were alternatives to doing a destructive rebuild.  In
fact, I think we have a spare (rebuilt) 18G drive laying around.  I
think with that we could have managed the rebuild without buying a new
disk.  But buying a disk would have made sense too.  We rushed into the
upgrade.  It felt like Grex was in crisis.  If we had held a staff
meeting first, I'm not sure anyone except John would have shown up.


#112 of 176 by tod on Mon Dec 5 22:45:15 2005:

 Tod said the board should be trying harder to get more staff.  Well, I'm not
 on the board right now, but I think I speak for them when I say, they're
 open to suggestions.
Could have fooled me.  I see nothing but excuses being made and "we do/did
ENOUGH already"
Excuse me for asking for something more than an MOTD, decent backup, and
effort to find staff with more availability.  How dare me for making
suggestions. Shame shame.


#113 of 176 by naftee on Mon Dec 5 23:07:03 2005:

shame on you.


#114 of 176 by tod on Mon Dec 5 23:50:42 2005:

THanks Michael Moore!


#115 of 176 by naftee on Tue Dec 6 00:23:02 2005:

thanks tod :L)(


#116 of 176 by ric on Tue Dec 6 15:03:31 2005:

re 100 - I'm not suggesting that staff doesn't do the best they possibly can
to avoid email loss and other such things.  I'm suggesting that we as users
should not expect or demand anything more.  The fact is, if this were a
commercial organization, there would be daily tape backups, stored off site,
our hardware would probably be more "enterprise" level and all sorts of such
things - policies in place to prevent such occurences, and paid employees
whose PRIMARY responsibility is maintenance of the server(s).

Grex is nobody's primary responsibility.  I'm pretty sure it's nobody's
secondary responsibility - at the very best, I would expect Grex to come
somewhere after job and family.


#117 of 176 by tod on Tue Dec 6 16:45:37 2005:

 Grex is nobody's primary responsibility.
Grex is the fiduciary responsibility of all elected volunteer board members.
If someone is not willing to be responsible for Grex's operation, they
shouldn't be on the Cyberspace board of directors.  Its that simple.


#118 of 176 by mcnally on Tue Dec 6 17:29:26 2005:

 re #117:  Are you seriously arguing that the board has an *obligation*
 to ensure that Grex is run at the same level of service and reliability
 as a commercial service?

 If not, what *does* your statement imply?


#119 of 176 by tod on Tue Dec 6 18:52:23 2005:

re #118
Obligation: "ensure that Grex is run"-ning for such purposes as "public
education and scientific endeavor through interaction with computers, and
humans via computers, using computer conferencing.." because "The Corporation
assumes all liability to any person other than the Corporation or its members
for all acts or omissions of a volunteer director incurred in good faith
performance of their duty as an officer"

I'm not saying people are going to get sued or that businesses will crumble
as a result of downtiem.  What I am saying is that "good faith effort" should
be a minimum goal of any director of Cyberspace Communications when assuring
Grex stays online and maintained.


#120 of 176 by other on Tue Dec 6 21:37:25 2005:

And who says that it isn't? You're talking about an obligation which is so
vaguely defined that in legal terms, someone would have to be actively
subverting the system or sabotaging it to be provably NOT complying with your
demand. It is a volunteer organization, with a volunteer staff, and a volunteer
board. As such, the reality is that it will get whatever benefits of goodwill
it gets in terms of money and time, and that's it. You can't make it something
it isn't, and something that it isn't is a service with the possibility of
being held to the standard of performance of a commercial service provider with
contractual obligations.


#121 of 176 by tod on Wed Dec 7 01:18:52 2005:

re #120
I'm not making demands.  I'm simply reflecting on the current status of
Cyberspace.  A status that lacks some leadership in the management of
Grex when it craps out.  Is that so much to ask?
These cries of pay-for-service levels are spin.
We've had numerous outages and waited days on end before someone could
get to Grex.  And then when they did, it was ad-hoc, and files were lost.
I'm simply looking for a lil assurance from the Board that somebody is in
charge and that everyone knows who that is.
Who is accountable next time Grex goes offline for a week?  Answer that.


#122 of 176 by cross on Wed Dec 7 02:26:29 2005:

No one.  It's all volunteer.  But then, you're saying that's a problem (and
so it is).


#123 of 176 by slynne on Wed Dec 7 04:27:15 2005:

It is a problem but it isnt one I see an easy answer to. I am not going
to demand that a volunteer give more time than they offer to give. I try
to remember to let them know I appreciate their efforts but I am
admitedly not the best at that. I do really appreciate all the volunteer
time that goes into running this place though. And frankly, if someone
with more energy than me were to step up to do a better job, I would
gladly step out of their way to let them do it. 

So who is accountable the next time Grex goes offline for a week? I dont
know. Whichever staff person steps up. We are pretty lucky that we have
anyone at all really. Maybe next time no one will do anything and then
the board will have to scramble to figure something out although I hope
it never comes to that because I honestly dont have any idea what I
would do in such a sitution. 



#124 of 176 by tod on Wed Dec 7 04:29:22 2005:

I refuse to believe that Cyberspace's elected directors can't do a better job
with staffing Grex.


#125 of 176 by nharmon on Wed Dec 7 04:32:05 2005:

The organizations I volunteer with do not accept the excuse "I'm
only a volunteer". If you ask me, thats a learned attitude in an
organization.


#126 of 176 by naftee on Wed Dec 7 05:08:27 2005:

lol do u volunteer for gay fags 4 america

lol u probably do and ur excuse is im straigt lol


#127 of 176 by other on Wed Dec 7 12:44:01 2005:

When it is necessary for your volunteers to bring with them a certain and
specific skill set, and there are not large numbers of people from which to
choose to fill volunteer positions, then you have to accept less commitment.
That's just reality. You can't change it by wishing it away or declaiming it.
The other thing is that just because Grex has been more stable in the past than
it has been recently doesn't mean anyone was any more accountable then or that
anything has changed in the organization. The only thing that is substantively
different is that the machine is less accessible when it is convenient for
those who can do something with it to do so, and those volunteers who are able
to do something may be less available for whatever reason now than they may
have been in the past. This too may pass. Bitching about the situation and
blaming the existing volunteers for having lives and responsibilities other
than Grex only serves to make those volunteers feel less like the efforts they
do make are appreciated and that very likely has the natural consequence of
making their efforts here a lower priority in their lives than other things
they may find more rewarding. This has been a particularly wordy way of saying
"There's really nothing that can be done about it, so get over it and stop
potentially making it worse."


#128 of 176 by mary on Wed Dec 7 12:55:16 2005:

For some people here, Grex IS their life.  Check it out - barely an hour 
or two can go by without their jumping in with commentary.  They live 
here. I'm not surprised they have a hard time seeing that not everyone 
sets the same priorities.  But I certainly wouldn't wish that level of 
involvement on anyone who wasn't being paid to do a job and then get on 
with real life.


#129 of 176 by nharmon on Wed Dec 7 16:55:08 2005:

Hey Tod, I'm going to walk across the street and ask the volunteer
firemen what would happen if they showed up to fight fires whenever they
wanted.


#130 of 176 by mcnally on Wed Dec 7 17:12:19 2005:

 Also be sure to ask them what would happen if some person chewed
 them out for not going back into their burning house to save their
 photo album, and if a bunch of people joined in and started piling
 on about how poor the firefighting had been lately and how they
 really needed to commit themselves more "or else."


#131 of 176 by tod on Wed Dec 7 17:35:09 2005:

Or just do something simple and go in and ask "Who's in charge?"
I haven't seen any "or else" demands.  That's just more spin.
Mary is right.  Some of us take downtime a little more seriously.  I
appreciate Lynne's participation in this discussion because she is honest
without throwing rocks.  I ask who is in charge and she says "the first staff
person to step up."  That sounds logical.  Every disaster's first incident
responder is obviously the first person on the scene.  Now, how incidents are
handled after that are where things could probably improve.  There needs to
be a "go to" so when the system goes down, the rest of the Board knows who
to call for a status..and then the members can ask any Board member available
and get some sort of decent response.  I'm not saying it has to be chinese
fire drills and all corporate red tape but at least just some sort of formal
person that shows up at board meetings to represent staff.  If that's STeve
or Remmers or whoever, great.  I'd at least like to see the Board address it
at their next meeting and come up with something.


#132 of 176 by nharmon on Wed Dec 7 17:46:03 2005:

I'm not trying to pick on staff too much Mike. You guys really go a
fantastic job.


#133 of 176 by mcnally on Wed Dec 7 20:27:37 2005:

 No, actually, lately we don't, which is clearly a problem.

 I'm not trying to sugar-coat what happened or shut down criticism.
 What I would like to do, however, is promote a pragmatic view of
 the situation.  We do have a problem, but we also have very limited
 resources with which to fix it.  Arguing about what "should" happen
 is kind of pointless at this point unless it's something that also
 *could* happen.  Until/unless a proposed solution is possible with
 the constraints we have to deal with it's kind of a waste of time
 to spend a lot of time arguing about it.


#134 of 176 by cross on Wed Dec 7 22:31:05 2005:

Regarding #128; Maybe you should encourage some of them to become staff.
Oh, wait....

You know, something the board *could* do is advertise a position for a
staff liason person; something that someone could run for if they chose.
Them taking that position would sort of make them the chief staffer, but
also make them accountable.  If circumstances in their life changed so
that they couldn't handle it anymore, they could resign.  Since they
volunteered for that position, with the additional responsibilities it
entails, there really shouldn't be much of a problem with asking them to
do whatever extra it entails.


#135 of 176 by naftee on Wed Dec 7 22:46:28 2005:

i'm proud to call GreX my home!


#136 of 176 by scg on Wed Dec 7 23:44:38 2005:

I'm seeing a lot of comments here about how things work in commercial
environments, making it sound as if there's one way of doing things in
such places.  In fact, from what I've seen, there's a pretty wide
spectrum.  Commercial organizations have a wide variety of experience,
budgets, resource constraints, contractual obligations, perceived levels
of importance, and operational philosophies, even if they're providing
services that may look quite similar from the outside.

It seems non-useful for people to say, "commercial content providers do
X, therefore Grex should too."  It likewise seems non-useful to say,
"Grex isn't a commercial organization, so it can't do what commercial
organizations do."

It's perhaps worth taking a look at change management procedures in some
of the slowest changing but most stable network operators -- traditional
phone companies.  At the one I worked in the web hosting division of,
nothing could be done without filling out lots of change management
documentation: extensive documentation about the change procedure,
including exact commands that would entered, test procedures, backout
plans, justification of why the change was needed, who was going to be
involved, when it was going to happen, what the impacts were going to be
and to which customers, and so forth.  This all had to go through a
committee, which might approve it a couple of weeks after it was
submitted.  It wasn't fun.  Nobody did anything just because they
thought it might make some small incremental improvement.  Problems were
often left alone until they became emergencies, because the bureaucracy
involved in fixing them would become somewhat easier then.  But at the
same time, human error-caused outages became pretty rare.  The committee
that reviewed these things didn't really know how to do anything other
than see if the questions had been answered, but answering the questions
forced people to think through things carefully.  Adopting a very
stripped down version of that protocol, asking people to answer a list
of standard questions to their own satisfaction before diving into major
changes, gets a lot of the same benefits and doesn't cost much.

There are also the comments I've seen here about enterprise-class
hardware that Grex can't afford.  A lot of commercial sites also can't
afford it, or decide it's not worth the cost.  A lot of services which
the Internet would be perceived as not working without -- some of the
root and top level DNS infrastructure, Akamai caches, Google, etc. --
involves standard off the shelf hardware deployed in large enough
numbers that if some piece of it breaks, end users won't notice in the
few days it may take to fix it.  What sort of hardware to use, how much
of it, and how much support to provide in case it breaks, are
interrelated decisions with costs associated, and different
organizations come up with different answers.

Managing volunteers is different than managing employees.  Managing
employees who are paid less than they could earn elsewhere is different
than managing employees who are paid more than they could earn
elsewhere.  A general question to ask is, "are we getting more out of
this person than we're paying them."  I've dealt with employees who have
been hard to deal with, but who were occasionally doing things that were
really important, and they've seemed worth keeping.  I think I've even
been such an employee at a few former jobs.  At my current non-profit
employer, I've "fired" volunteers who were taking more of my time to
manage than it would have to do the work they were doing.  At the same
time, if somebody isn't doing anything, is known to not be doing
anything, and isn't costing anything, telling them to go away probably
isn't all that useful.  Having volunteers who occasionally do something
that wouldn't otherwise get done can be a very useful thing.  Telling
anybody to go away before you're sure you want them gone can have some
less than desirable consequences.  On the other hand, having somebody be
in charge, with at least the authority to tell voluneers what not to do,
may have more positive impact than its cost in ruffled feathers.


#137 of 176 by ric on Tue Dec 13 20:07:10 2005:

Tod - Yes, the elected officers have a fiduciary responsibility to manage
Grex.

It's still not their primary responsibility.  I'd feel sad for anyone who felt
running Grex was the most important thing in their life.

Aren't you on the arbornet board?  Seems to me that it is YOUR fiduciary
responsibility to have the annual meeting that was required by law which still
has not occurred.  But you see, Arbornet is not your primary responsibility,
is it?  It's not even your secondary responsibility.  I bet your family and
job come first.  I bet there's a lot of things you consider more important
than your obligations as a volunteer on the Arbornet Board of directors.


#138 of 176 by cross on Tue Dec 13 20:35:44 2005:

Please, that's just deflecting responsibility.  Someone really does need to
be "in charge" of Grex.

Besides, arbornet not having its annual meeting isn't necessarily Todd's
fault.


#139 of 176 by tod on Wed Dec 14 07:13:57 2005:

System downtime vs. annual meeting
Shall we take a poll on order of importance?  Governance has not been an issue
for Arbornet, nor has accountability of staff and system maintenance.
Let's talk about Grex since this is where we are.


#140 of 176 by naftee on Wed Dec 14 23:21:53 2005:

a very Romanian response.


#141 of 176 by ric on Mon Dec 19 14:21:06 2005:

re 138 - people are "in charge" of grex.  Where did I say they weren't?  Nor
did I say it was Todd's "fault" that Arbornet hasn't had it's legally required
annual meeting.  It's the Arbornet Board of Director's "fault".

People are in charge of Grex.  People are responsible for Grex.  But those
people have more important things in their life than Grex, and I don't blame
anyone for that.  

I have a responsibility to my job because without it, I can't provide for my
family.

What is Steve's responsibility to Grex?  He does these things as a volunteer,
but you can be sure that his job and his family are more important to him than
Grex.  (Speak up, Steve, if I am wrong).

That being said, if Grex is down for 3 days because Steve (or any other staff
member) doesn't have time to fix it because of family and job obligations,
I think it is ridiculous to criticize them for those decisions.

And if a MISTAKE is made during the operation of Grex, what are you going to
do, fire the staffer who made the mistake?  I don't see a huge line of people
volunteering to run these organizations.  Most of M-Net's volunteers left for
Grex or left the conferencing world entirely.  It doesn't look like there's
a ton of volunteers here on Grex either, so you take what you can get.

the fact that either of these systems still exist is nothing short of amazing.


#142 of 176 by scholar on Mon Dec 19 17:09:17 2005:

Being volunteers doesn't remove them from the responsibility to do quality
work when they decide to use the powers over the system they're given.

The whole backup thing was terribly poor work.  Even the most novice,
inexperienced of system administrators know how important backups are.  The
people involved in the mail mishap are apparently a gaggle of fools with FAKE
pocket protectors.


#143 of 176 by ric on Mon Dec 19 18:45:01 2005:

I, for one, appreciate the volunteer efforts of anyone willing to do such
jobs.  And I realistically understand that these people are volunteers and
have many other more important responsibilities in other areas of their life.

I choose to not rely on systems operated by such people, and therefore, I've
never lost anything important do to such issues.

You may choose to rely on systems operated by volunteers.  You may try to hold
someone responsible for mistakes leading to loss of data or anything else that
may arise from system downtime.  You'd be a fool to do so and you probably
won't get anywhere trying.


#144 of 176 by scholar on Mon Dec 19 19:01:21 2005:

The loss of data wasn't caused by system downtime.

It was caused by people not making proper backups.

Even in a volunteer organization, there must be some work ethic.


#145 of 176 by ric on Mon Dec 19 19:15:25 2005:

What do you intend to do to force that?

Have them all removed?


#146 of 176 by scholar on Mon Dec 19 19:23:27 2005:

I don't have to be able to "force" something for it to be the right thing.


#147 of 176 by tod on Mon Dec 19 19:36:22 2005:

Such adamant defenses for complacency.
I'm glad none of these folks work for larger non-profits.


#148 of 176 by glenda on Mon Dec 19 23:57:18 2005:

And how do you suppose you could do better?  Backups were made.  A listing
was made of the said backups to see that all the files were there, the listing
report the mail directory and files were there, it just didn't say how big
it was.  Is the person doing the backups supposed to go in and look at all
the 100s of thousands of files individually to make sure that the sizes are
correct?  When I do backups, I do listings to see that the major files exist,
I usually don't unzip them and look at the size, with that many files there
just isn't enough time to do so, especially when there are time limitations.


#149 of 176 by naftee on Tue Dec 20 00:16:11 2005:

ric is like richard, except he types better


#150 of 176 by scholar on Tue Dec 20 00:30:42 2005:

It's not particularly difficult to compare the size of files in an archive
to the size of files in a directory, though the fact you think it is difficult
speaks to your ignorance of Unix.

It's also not particularly difficult to make sure the backup is done right
in the first place.


#151 of 176 by cross on Tue Dec 20 02:38:21 2005:

Regarding #148; That's impossible.  If Steve's account was accurate,
none of the spool files would have shown up in the file listing.

Regarding #141; Oh please.  Call a spade a spade.  No one is saying
that people need to make grex the primary focus of their life.  But
someone needs to be accountable for it, and no one is.  No one takes
the responsibility for making sure grex is running.  If they did,
it wouldn't stay down for a week at a time.

Now, I'm not saying people shouldn't make the decisions they do,
just that grex needs to solicite someone to step up to the plate
when no one else does.

Of course, I expect I'll be flamed to pieces for challenging the
status quo and not being an apologist.  The grexists are a lot like
the neocons when it comes to questioning things.  They just don't
like it when anyone challenges anything.  Sad, really.

And people wonder why grex isn't as popular as it once was.


#152 of 176 by mcnally on Tue Dec 20 03:06:06 2005:

 re #148:  
 >  And how do you suppose you could do better?

 I've tried to refrain from criticizing STeve's mistake for a number
 of reasons -- (1) it doesn't get the deleted mail back, (2) I suspect
 he feels (or felt) bad enough, and (3) nobody else was stepping up to
 volunteer to get the job done and it's unfair how much of the
 responsibility has devolved onto STeve, but..

 Your defense, while commendable from a family loyalty standpoint,
 is wholly misguided from a technical standpoint.  A couple of really
 serious mistakes were made (chiefly, the backup was badly botched 
 and* the decision had been made to repartition in place.) The results
 turned out to be a minor disaster for many of us, and it's insulting
 to pretend that there was no way it could have been prevented..

 >  Backups were made.
  
 As it turned out, some were, some weren't.  That's the issue.

 >  Is the person doing the backups supposed to go in and look at all
 >  the 100s of thousands of files individually to make sure that the
 >  sizes are correct?

 Actually, it's not that hard to write a program to do that, but even
 if you don't want to go to that much trouble one can get a pretty good
 idea by comparing the size taken up by the backup with the size taken
 up by the originals.



#153 of 176 by tod on Tue Dec 20 05:22:28 2005:

re #152
Thanks, Mike.  I didn't even want to go there but you present a pretty simple
guideline for next time.


#154 of 176 by cross on Tue Dec 20 06:37:25 2005:

Actually, this would have been avoided had Steve used the dump program
instead of tar to do the backups, as I suggested.  Steve wrote something
somewhere that I thought was funny that seemed to indicate he thought it
wouldn't have made a difference; actually, it would.  Dump doesn't go
through the filesystem to get the data it backs up; rather, it looks at
the filesystem data on the raw disk devices.  Tar goes through the file
system; hence when it's sensative to whether the disk was mounted at the
time.  A better way to do the backups would have been to use dump.

But I really don't want to beat up on Steve about this.  I've done the
exact same thing myself (luckily, I only deleted the mail spool of one
user, but he was still pretty pissed off).  Hey, live and learn.

My major concern is with grex as a whole, and the idea that no one really
seems to be in charge, despite claims to the contrary.


#155 of 176 by ric on Tue Dec 20 14:22:59 2005:

Again, i'm not saying it could not have been prevented, and I'm not suggesting
that people don't try to do better "next time".

i'm just saying that we all know how Grex operates, and we should set our
expectations accordingly.


#156 of 176 by cross on Tue Dec 20 15:45:04 2005:

If we all know how grex operates, and should set our expectations accordingly,
then you *are* suggesting that people don't try to do better next time.  You
are, without a doubt, saying that the status quo is perfectly fine.  I am not.


#157 of 176 by tod on Tue Dec 20 16:53:42 2005:

Dan,
Don't you realize that most Grex folk get seasick if there is even the
slightest boat rocking?


#158 of 176 by cross on Tue Dec 20 17:42:49 2005:

Oh, sorry.  My bad.


#159 of 176 by ric on Tue Dec 20 17:51:00 2005:

To be quite honest, yes - the status quo works for me because I don't rely
on Grex for anything.  If my participation files get hosed, I'll get over it
pretty quickly.  I don't rely on Grex for email either because in my opinion,
nobody should rely on email hosted by an organization with no employees and
nobbody whose primary job responsibility is maintaining that s ystem.

I haven't seen anything suggested here that would make things on Grex any
better - other than simple acknowledgement of mistakes made, and some hope
that lessons have been learned.

I don't know what YOU got out of this "situation" but for me, it's just an
affirmation that relying on grex for anyting is foolish.


#160 of 176 by cross on Tue Dec 20 18:26:48 2005:

Well, I made a suggestion that grex solicit a staff member to be `in
charge' in the case of a failure.  Others suggested that a written plan
be made prior to a major change (such as an upgrade).  Both of those
seem like suggestions that could make things better.


#161 of 176 by tod on Tue Dec 20 18:45:46 2005:

I asked for a "go to" from the Board.  I think it makes sense to ask who we
should contact for status updates when Grex horks.  Ric doesn't care either
way and that's fine for him.  I don't see what his point is other than "this
place is unreliable"


#162 of 176 by mcnally on Tue Dec 20 19:40:07 2005:

> To be quite honest, yes - the status quo works for me because I don't
> rely on Grex for anything. 

How nice for you..  What does it have to do with the rest of us?


#163 of 176 by ric on Tue Dec 20 19:40:07 2005:

Actually, it's "This place is unreliable and if you lose anything important
that is stored here it's your own damn fault"


#164 of 176 by ric on Tue Dec 20 19:40:14 2005:

(but you were close)


#165 of 176 by ric on Tue Dec 20 19:42:53 2005:

i take it back, I rely on Grex for this sad and pathetic form of social
interaction that I've grown accustomed to over the last 20 years.  On grex
it's more party than BBS.. on m-net it's more BBS than party).  So I donate
financially to Grex and M-Net in hopes that they will both remain "alive"...
it's nice having both so that when one goes down (because they are both
unreliable) I can go hang out on the other.


#166 of 176 by slynne on Tue Dec 20 21:13:23 2005:

I dont like the status quo on grex either. Like other people, I wish 
there were lots of people with lots of energy running things. Instead, 
we have a pretty good board (in my opinion) with weak leadership. A lot 
of that has to do with the amount of time I am willing to put into 
grex. 

I dont know the solution to the issue except to say that I also wish 
things were different. I wish we had a lot more financial support. I 
wish that more people were willing to become voting members and 
participate more actively in grex. I wish there were so many great 
people running for board that I got voted out. Wishing for things to be 
different doesnt change anything. 



#167 of 176 by tod on Tue Dec 20 21:29:20 2005:

Change the board of directors titles to Official Eor


#168 of 176 by tsty on Fri Dec 30 05:46:46 2005:

what ever happened to that 24/7/365.25 access concept at provide?
 
  
what are the hours/days now?


#169 of 176 by aruba on Fri Dec 30 06:24:04 2005:

Provide Net never offered us all night access.


#170 of 176 by tod on Fri Dec 30 16:56:21 2005:

Not that we'd have someone on staff running over there sooner than 48 hours
anyhow.


#171 of 176 by tsty on Sat Jan 14 07:04:56 2006:

fwiw, i live about a spit and a hollar from hewitt adn mich ave. 
  
and if you notice i have a resonable amount of free time at all sorts
of odd times of the day/week.
  
i;ve pissed off a whole buncha folks - including myslef now and then - over
the years but when shit needs to get done, i get it done.
 

 fwiw.


#172 of 176 by tsty on Sat Jan 21 14:59:07 2006:

also, whilst thoughts of the above reverberate ....
  
how is the 8mm tape recovery coming? mcnally adn i and a *bunch* of us
are curious, seriously interested, hopefully awaiting, etc.


#173 of 176 by tsty on Mon Jan 30 15:20:48 2006:

now that grex has returned, maybe some action on 168-173 ??


#174 of 176 by cross on Tue Jan 31 00:36:36 2006:

How 'bout shutting off the idle daemon?


#175 of 176 by tsty on Tue Feb 21 18:47:49 2006:

hmmmm, seems neither borg nor staff has interpreted 168-173 ... yet.??


#176 of 176 by jesuit on Wed May 17 02:16:00 2006:

TROGG IS DAVID BLAINE


There are no more items selected.

You have several choices: