|
Grex > Oldcoop > #294: Why Grex lost its mail partition | |
|
| Author |
Message |
| 25 new of 176 responses total. |
steve
|
|
response 50 of 176:
|
Nov 29 18:50 UTC 2005 |
Well Mike, I would have *liked* to have had spare disks for the
upgrade. Here at work I keep entire spare machines such that an
upgrade is done on the next machine, with data transfers done onto
the new machine, and a switchover of IP addresses. I've had as
little as 4 seconds of downtime for such upgrades.
But I did not succeed in getting the board to move on getting
the PC Weasel because of costs. yes, I thought of asking for
money for at least one more 36G scsi disk, but I didn't want to
go through that, dealing about money again.
I lost mail of mine too, Mike, so I felt the pain as well...
|
mcnally
|
|
response 51 of 176:
|
Nov 29 19:19 UTC 2005 |
> But I did not succeed in getting the board to move on getting
> the PC Weasel because of costs. yes, I thought of asking for
> money for at least one more 36G scsi disk, but I didn't want to
> go through that, dealing about money again.
Right. Which supports my counter-argument against Jan's statement
and suggests that maybe the recent trouble points to some organizational
problems that we can remedy before the next time we reach a crisis.
|
tod
|
|
response 52 of 176:
|
Nov 29 19:23 UTC 2005 |
re #49
And how do we pay for this person?
We can pay them double what you're getting from Grex. ;)
|
steve
|
|
response 53 of 176:
|
Nov 29 19:24 UTC 2005 |
I'm not so sure. The management of Grex has always been prudent
about financial things, to keep the system healthy. Perhaps one
could say there is too much of that at times, but thats life. No
organization is perfect. Overall I think Grex does things pretty
well, and, in spite of my thoughts on things, I'd rather have
this organization than even a lot of "professional" organizations
in terms of how they work. This is not to say that we couldn't
stand improvement, just that we're less screwed up than most
business places, from my point of view.
|
steve
|
|
response 54 of 176:
|
Nov 29 20:12 UTC 2005 |
Re #40: It was *my* fault that we lost the mail partition, and mine
alone.
|
nharmon
|
|
response 55 of 176:
|
Nov 29 20:31 UTC 2005 |
Thank you for not displacing blame and taking responsibility Steve. I'm
sure we all know you didn't do it deliberately and will probably not
make the same mistake again. Thank you for being honest.
|
mcnally
|
|
response 56 of 176:
|
Nov 29 20:32 UTC 2005 |
re #54: Your error was the proximate cause, but that doesn't mean that
there weren't contributing issues that we should address in anticipation
of future mistakes -- anytime people are involved mistakes are inevitable
but through proper planning and procedures you can make a huge difference
in the outcomes..
I think you're seeing this as a discussion about blame, which is probably
not an unreasonable way to look at it from your standpoint, especially as
there are still a lot of people who want to discuss blame. I'd much rather
try to figure out how best to keep it from happening again, which requires
some degree of understanding what happened and why, but is a question to
which blame is pretty much irrelevant.
|
steve
|
|
response 57 of 176:
|
Nov 29 20:38 UTC 2005 |
No, I'm not looking at it from a blame stanpoint, just the truth.
I do however fully agree with you that we need to be able to look at
things and do some things better. Thats a good thing to do.
|
naftee
|
|
response 58 of 176:
|
Nov 29 22:04 UTC 2005 |
hey tod. how do you say 'fiduciary' in Romanian ?
|
bhoward
|
|
response 59 of 176:
|
Nov 29 22:28 UTC 2005 |
Re#50: It should be noted for the record, as of the last board
meeting, we have reconfirmed that the cost of a PC Weasel is being
covered by an anonymous donation.
Has the order been placed?
Once this and other post-mortem discussions have run their course,
I suggest we (staff) should take the points made, write up a summary
of how we'll go about the next one, making certain the process
described will address the shortcomings identified in the last one.
|
other
|
|
response 60 of 176:
|
Nov 29 23:05 UTC 2005 |
I have a suggestion.
Documentation tends to be written and then squirreled away in any of a
number of places where it may or may not ever be read or seen again.
I propose that Grex operational documentation be kept in a single file
or directory, and that the contents be tagged (XML, perhaps?) and the
relevant scripts/programs modified so that those scripts and programs,
when run, can access and echo to the screen of the calling user any
information which they should have in mind before any actions are
performed. Ideally, they would require an acknowledgement before
continuing.
The advantages are: easy updating of documentation (all in the same
location); improvement of documentation (since it would constantly be
appearing, chances are it would be written or rewritten to better
communicate important information); and less time wasted either writing
useless documentation or because of lack of documentation where and when
it was needed.
This is a fairly easy to implement suggestion, and will make it much
easier to have new staff trained in the vagaries of Grex whenever there
is new staff to train.
This may represent a significant allocation of time, but for those who
have already spent lots of time writing documentation, it shouldn't be
hard to see why this is necessary. It should be prioritized as highly
as any other staff responsibility including keeping the system running
and secure, because it will make both of those goals easier and faster.
Lastly, if anyone thinks they don't need to have stuff documented
because either "everyone knows it" or "I'm the only one who does this
and I know it," those are the persons who most need to be doing this.
|
other
|
|
response 61 of 176:
|
Nov 29 23:10 UTC 2005 |
By the way, this scheme easily allows for pointing to additional tools
and documentation to supplement the echoed information.
For that matter, both tools/scripts and documentation might be collected
in a keyword searchable database (using the same tagged source
documents) for anyone needing to know how to perform a certain function
on the system.
The more of this kind of thing that gets done, the less the system is
dependent on a few individuals with highly specialized knowledge to do
most of the things necessary to keep the system running properly.
|
tod
|
|
response 62 of 176:
|
Nov 29 23:28 UTC 2005 |
re #58
hey tod. how do you say 'fiduciary' in Romanian ?
demn de incredere
|
steve
|
|
response 63 of 176:
|
Nov 29 23:41 UTC 2005 |
We have a good start at documentation in the /grexdoc directory
and in the staff conference. Both need more work, but we do have
a good start for it.
|
aruba
|
|
response 64 of 176:
|
Nov 30 01:14 UTC 2005 |
Re #59: I haven't ordered the PC Weasel yet, but I will soon. Someone needs
to find out from Provide Net what it's going to cost us per month to have a
separate machine running, which is, as I understand it, what we will need in
order to make the PC Weasel work.
Tod said the board should be trying harder to get more staff. Well, I'm not
on the board right now, but I think I speak for them when I say, they're
open to suggestions.
|
steve
|
|
response 65 of 176:
|
Nov 30 01:39 UTC 2005 |
I'll send mail to John A again about the cost.
|
glenda
|
|
response 66 of 176:
|
Nov 30 02:02 UTC 2005 |
Re #41: You keep saying that it would have been nice if notice had been given
before the upgrade. It was in the motd for several days beforehand that the
upgrade would happen that weekend if at all possible. The upgrade happening
as soon as STeve got everything together and could get with John was discussed
in at least one item for a couple of weeks before it was done. How much
warning do you need, or did you expect a personal email?
|
keesan
|
|
response 67 of 176:
|
Nov 30 02:25 UTC 2005 |
The problem with messages in the motd is that people forget to change them
when they get outdated so we tend to ignore them. If there were only relevant
messages there I would read them. I don't really want to know that grex was
down two weeks ago for a day.
|
nharmon
|
|
response 68 of 176:
|
Nov 30 04:04 UTC 2005 |
Things like maintaining the MOTD are tasks to give to people who want to
join staff as a way of seeing how they handle it. Start him/her off
here, and then go up from there.
|
glenda
|
|
response 69 of 176:
|
Nov 30 04:09 UTC 2005 |
If not the motd, where? It was also discussed in at least one item here and
in Agora. Short of sending email to every account on Grex what are we
supposed to do. I manage to glance at the motd every time I log on enough
to notice if something new is posted. It only takes a couple of seconds, it
isn't that long and it has been rather up to date lately. If you choose it
ignore it, that is more your problem than it is staff's. Yes, I agree that
outdated things should be removed, but lets get real here.
|
nharmon
|
|
response 70 of 176:
|
Nov 30 04:11 UTC 2005 |
What would be the impact of sending an e-mail message to every account
on Grex?
OR, better yet, what about an opt-in mailing list for people who would
like to get system announcements.
|
steve
|
|
response 71 of 176:
|
Nov 30 04:31 UTC 2005 |
Now *that* is a good idea, a mailing list for announcements of system
work, downtime, etc. Excellent.
The impact of staff sending out mail to every acocunt on Grex would
be 1) to take about 20 minutes of system pounding to deliver about
29,000 emails, 2) consume about 50M of /var/mail space, and 3) would
likely generate a couple hundred emails back with 1/2 asking if this
was real, and the other half asking about why and when the system
would be back up, regardless of what we said in the mail. ;-)
|
naftee
|
|
response 72 of 176:
|
Nov 30 05:08 UTC 2005 |
;)
GreX should provide an escort service
|
bhoward
|
|
response 73 of 176:
|
Nov 30 12:20 UTC 2005 |
Re#67 Sindi, we can certainly remove motd messages more aggressively.
Notices for things such as recent outages tend to stay in motd for
at least a week to insure that folks not regularly logging in or
reading the conferences still will have some idea why the system
may have recently crashed or otherwise been unavailable.
A weeks notice for major notices is a (hopefully) reasonable balance
between those who log in daily and those who hit the system at least
weekly (arguably it is a balance between those who only need to be
told once and those for whom the message may not register until
they've seen it several times).
|
steve
|
|
response 74 of 176:
|
Nov 30 13:26 UTC 2005 |
Sorry Bruce but I don't think we should commit to that. Access
to Grex's hardware is simply too limiting. Back when Grex was starting
to crash every day some months ago, I wanted to get to Grex and do
things for a week, every day, and simply couldn't get there in time
to be able to do anything with the 10pm curfew we live under now.
Yes, its a *good thing* to give advance notice on shutdowns, I
fully agree. But let's not lock ourselves to it.
|