You are not logged in. Login Now
 0-24   25-49   35-59   60-84   85-109   110-134   135-159   160-176   
 
Author Message
25 new of 176 responses total.
other
response 60 of 176: Mark Unseen   Nov 29 23:05 UTC 2005

I have a suggestion.

Documentation tends to be written and then squirreled away in any of a
number of places where it may or may not ever be read or seen again.

I propose that Grex operational documentation be kept in a single file
or directory, and that the contents be tagged (XML, perhaps?) and the
relevant scripts/programs modified so that those scripts and programs,
when run, can access and echo to the screen of the calling user any
information which they should have in mind before any actions are
performed.  Ideally, they would require an acknowledgement before
continuing.

The advantages are: easy updating of documentation (all in the same
location); improvement of documentation (since it would constantly be
appearing, chances are it would be written or rewritten to better
communicate important information); and less time wasted either writing
useless documentation or because of lack of documentation where and when
it was needed.

This is a fairly easy to implement suggestion, and will make it much
easier to have new staff trained in the vagaries of Grex whenever there
is new staff to train.

This may represent a significant allocation of time, but for those who
have already spent lots of time writing documentation, it shouldn't be
hard to see why this is necessary.  It should be prioritized as highly
as any other staff responsibility including keeping the system running
and secure, because it will make both of those goals easier and faster.

Lastly, if anyone thinks they don't need to have stuff documented
because either "everyone knows it" or "I'm the only one who does this
and I know it," those are the persons who most need to be doing this.
other
response 61 of 176: Mark Unseen   Nov 29 23:10 UTC 2005

By the way, this scheme easily allows for pointing to additional tools
and documentation to supplement the echoed information.  

For that matter, both tools/scripts and documentation might be collected
in a keyword searchable database (using the same tagged source
documents) for anyone needing to know how to perform a certain function
on the system.

The more of this kind of thing that gets done, the less the system is
dependent on a few individuals with highly specialized knowledge to do
most of the things necessary to keep the system running properly.
tod
response 62 of 176: Mark Unseen   Nov 29 23:28 UTC 2005

re #58
 hey tod.  how do you say 'fiduciary' in Romanian ?
demn de incredere
steve
response 63 of 176: Mark Unseen   Nov 29 23:41 UTC 2005

   We have a good start at documentation in the /grexdoc directory
and in the staff conference.  Both need more work, but we do have
a good start for it.
aruba
response 64 of 176: Mark Unseen   Nov 30 01:14 UTC 2005

Re #59: I haven't ordered the PC Weasel yet, but I will soon.  Someone needs
to find out from Provide Net what it's going to cost us per month to have a
separate machine running, which is, as I understand it, what we will need in
order to make the PC Weasel work.

Tod said the board should be trying harder to get more staff.  Well, I'm not
on the board right now, but I think I speak for them when I say, they're
open to suggestions.
steve
response 65 of 176: Mark Unseen   Nov 30 01:39 UTC 2005

   I'll send mail to John A again about the cost.
glenda
response 66 of 176: Mark Unseen   Nov 30 02:02 UTC 2005

Re #41:  You keep saying that it would have been nice if notice had been given
before the upgrade.  It was in the motd for several days beforehand that the
upgrade would happen that weekend if at all possible.  The upgrade happening
as soon as STeve got everything together and could get with John was discussed
in at least one item for a couple of weeks before it was done.  How much
warning do you need, or did you expect a personal email?
keesan
response 67 of 176: Mark Unseen   Nov 30 02:25 UTC 2005

The problem with messages in the motd is that people forget to change them
when they get outdated so we tend to ignore them.  If there were only relevant
messages there I would read them.  I don't really want to know that grex was
down two weeks ago for a day.
nharmon
response 68 of 176: Mark Unseen   Nov 30 04:04 UTC 2005

Things like maintaining the MOTD are tasks to give to people who want to
join staff as a way of seeing how they handle it. Start him/her off
here, and then go up from there.
glenda
response 69 of 176: Mark Unseen   Nov 30 04:09 UTC 2005

If not the motd, where?  It was also discussed in at least one item here and
in Agora.  Short of sending email to every account on Grex what are we
supposed to do.  I manage to glance at the motd every time I log on enough
to notice if something new is posted.  It only takes a couple of seconds, it
isn't that long and it has been rather up to date lately.  If you choose it
ignore it, that is more your problem than it is staff's.  Yes, I agree that
outdated things should be removed, but lets get real here.
nharmon
response 70 of 176: Mark Unseen   Nov 30 04:11 UTC 2005

What would be the impact of sending an e-mail message to every account
on Grex?

OR, better yet, what about an opt-in mailing list for people who would
like to get system announcements.
steve
response 71 of 176: Mark Unseen   Nov 30 04:31 UTC 2005

  Now *that* is a good idea, a mailing list for announcements of system
work, downtime, etc.  Excellent.

  The impact of staff sending out mail to every acocunt on Grex would
be 1) to take about 20 minutes of system pounding to deliver about
29,000 emails, 2) consume about 50M of /var/mail space, and 3) would
likely generate a couple hundred emails back with 1/2 asking if this
was real, and the other half asking about why and when the system
would be back up, regardless of what we said in the mail. ;-)
naftee
response 72 of 176: Mark Unseen   Nov 30 05:08 UTC 2005

 ;)

GreX should provide an escort service
bhoward
response 73 of 176: Mark Unseen   Nov 30 12:20 UTC 2005

Re#67 Sindi, we can certainly remove motd messages more aggressively.
Notices for things such as recent outages tend to stay in motd for
at least a week to insure that folks not regularly logging in or
reading the conferences still will have some idea why the system
may have recently crashed or otherwise been unavailable.

A weeks notice for major notices is a (hopefully) reasonable balance
between those who log in daily and those who hit the system at least
weekly (arguably it is a balance between those who only need to be
told once and those for whom the message may not register until
they've seen it several times).
steve
response 74 of 176: Mark Unseen   Nov 30 13:26 UTC 2005

   Sorry Bruce but I don't think we should commit to that.  Access
to Grex's hardware is simply too limiting.  Back when Grex was starting
to crash every day some months ago, I wanted to get to Grex and do
things for a week, every day, and simply couldn't get there in time
to be able to do anything with the 10pm curfew we live under now.

   Yes, its a *good thing* to give advance notice on shutdowns, I
fully agree.  But let's not lock ourselves to it.
bhoward
response 75 of 176: Mark Unseen   Nov 30 14:08 UTC 2005

Steve, I was referring to Sindi's complaint that motd has messages
about past crashes and outages too long *after* the event.  I made
no comment as to how much warning there should be before there is
an outage.

I think our current routine of announcing system down time several
days in advance for scheduled downtime, and best effort warning for
anything else is sufficient.

On a related note, I think we should commit to updating the hvcn
page with current system status *before* commencing any system work,
emergency or otherwise, that will keep grex down or unavailable for
more than a few minutes.
steve
response 76 of 176: Mark Unseen   Nov 30 14:51 UTC 2005

  Sigh.  OK, upon rereading this I see what you mean.  Don't type
before coffee should be my mantra on these rare days when I'm up
before 8am.   Yes, putting an announcement on the hvcn page is
something we need to do.
tsty
response 77 of 176: Mark Unseen   Nov 30 14:55 UTC 2005

an operation as large as an os upgrade ought to have had a 
written checklist - and that checkllist could have been discussed
looong before hand in public, agora &/or coop.
  
but then shoulda/coulda/woulda only has recrimination value after the fact.
  
since this type of operation isn't about to happen too often, the
disaster is just  lurking around - AS ALWAYS - waiting for memories
to fade .
  
teh previous upgrades were not as complicated, had much more notice,
and were thought through with more precision - might even have had
a written checklist handy!
  

tsty
response 78 of 176: Mark Unseen   Nov 30 15:34 UTC 2005

re hvcn ... i went there for info but had to call mary and ask where
it was on hvcn .. there was nothing (at that time) that would have
led anyone to know where to click --- unless you already knew in 
advance and book marked it.
  
re  #74 -- what 10pm curfew???   i thought 24/7/365.25 was the deal?
  
we got out of ken's wharehouse for the same curfew problem, adn now we
are back into another curfew?  guess i wasn't payig enough attention.
  
btw, back on the checklist thought ... at least those with military
training wold have, by default, created their own instructions sheet.
  
not that you have to have had military training to figger that out
but it helps. and systems engineering 101, remedial, would have demanded
a check list ... system analysis 099, non-remedial, would have had
checklist provisions built-in to the coourse.
  
hell, the repetitive event sequence of starting up an airplane is
done by two pople with a checklist! 
  
hell, back wehn *i* halted grex, by accident, i wanted that non-existant
checkllist to provide some thoughtful path. wasn't one. left grex 'as is'.
nothing was damamged, noting was lost except my staff responsibillities - 
backups (how ironic). at least one of hte ppl (me) who forsaw precisely
this disaster *sometime* in the future and volunteered to backstop it
from being the disaster it now is, was 'offed' and dissed in teh process.
  
future borg adnfuture staff *could* have an adgenda item: monthly
backup accomplished? checklist adgenda item: yes/no 
  
first things first. secure your environment, what is contemporary, before
wnadering off into the unknown future with NO ROUTE HOME IN PLACE.
  
it is only isn the last few years that i no longer enter a new 
environment without already knowinig THE OTHER WAY OUT, just in case.
 
catastrophy theory (my masters subject) says, 'you can't get back from here.'
  
therefore you prepare everything so that you will never *HAVE* to go
back where you cant get to. 
 
if yo can't get back and you cant cover your ass, you fscking STAY PUT
until another path is found/created to prevent exactly this sort
of catastrophy.   my explanation of that, simplified, back when i
halted grex was apparently unintelligible to the recipeint(s).
 
or, it was forgotten over time, which is more of what i think, just
damn forgotten. blithly erased from teh cache of collected wisdom.
  
not much cache in that account anymore, eh?
steve
response 79 of 176: Mark Unseen   Nov 30 18:06 UTC 2005

   I'm not really sure how to respond to what you are saying.  I
don't think you understand the nature of the upgrade, in that 
there was no path back.  The op system had problems with both
the networking card and filesystem issues, and as an extra
treat hardware problems.

   This all dances around the critical issue that no one should
keep valuable mail on Grex only.  Keeping valuable mail in the
/var/spool area is even worse; it's an active filesystem, the
most active one on Grex and as such is more prone to failures
than anything else.
naftee
response 80 of 176: Mark Unseen   Nov 30 18:44 UTC 2005

I don't think anyone can coherently respond to one of tsty's posts :(
ric
response 81 of 176: Mark Unseen   Nov 30 19:11 UTC 2005

(hah)
aruba
response 82 of 176: Mark Unseen   Dec 1 00:23 UTC 2005

Re #79: You know, STeve, I just don't think that's a good enough answer. 
Yes Grex is run by volunteers, and yes people can't expect the same
accountability from Grex that they can from someone they're paying to keep
their data safe.

But if Grex is to be anything people care about, then the board and staff
have to themselves care enough to do their best for Grex.

Frankly, I'm glad some people are really pissed about losing their mail.  I
wish more people were.  If no one gave a damn, well, then Grex would really
be nearing the end of its life as a viable community.

I agree with tsty that backing up the system ought not to be an ad hoc thing
that someone works out as he goes along.  THere ought to be a procedure in
the GrexDoc which gets followed each time.  If the system changes so that it
gets done differently, then the changes ought to go into the documentation.
steve
response 83 of 176: Mark Unseen   Dec 1 02:15 UTC 2005

   And you don't think I care about Grex?  Good God Mark, I made a mistake.
WHERE do you think that I a) don't feel badly about missing it, and b)
that I don't care?
aruba
response 84 of 176: Mark Unseen   Dec 1 04:02 UTC 2005

I didn't say that or mean it, STeve.  But I don't think telling people 
that they shouldn't ever store anything that matters on Grex is a good 
enough answer.  I believe Bruce when he says the mistake you made could 
have happened to anyone.  *But I don't think it should have been you and 
John alone in the room doing the backup and upgrade*.  I think there 
should be an institutional procedure in place for these things, so that 
the collective knowledge and experience of all the staff is brought to 
bear on the procedure.  Grex shouldn't be dependent on one person's 
judgement.

And actually, I thought there *was* a procedure in place, in the person of 
the GrexDoc.  And I thought you promised the board you would follow the 
GrexDoc when you did the upgrade.

Maybe the doc wasn't complete - I don't know if it covered the backup
part of the upgrade.  Did it?
 0-24   25-49   35-59   60-84   85-109   110-134   135-159   160-176   
Response Not Possible: You are Not Logged In
 

- Backtalk version 1.3.30 - Copyright 1996-2006, Jan Wolter and Steve Weiss