|
Grex > Oldcoop > #294: Why Grex lost its mail partition | |
|
| Author |
Message |
| 25 new of 176 responses total. |
scg
|
|
response 136 of 176:
|
Dec 7 23:44 UTC 2005 |
I'm seeing a lot of comments here about how things work in commercial
environments, making it sound as if there's one way of doing things in
such places. In fact, from what I've seen, there's a pretty wide
spectrum. Commercial organizations have a wide variety of experience,
budgets, resource constraints, contractual obligations, perceived levels
of importance, and operational philosophies, even if they're providing
services that may look quite similar from the outside.
It seems non-useful for people to say, "commercial content providers do
X, therefore Grex should too." It likewise seems non-useful to say,
"Grex isn't a commercial organization, so it can't do what commercial
organizations do."
It's perhaps worth taking a look at change management procedures in some
of the slowest changing but most stable network operators -- traditional
phone companies. At the one I worked in the web hosting division of,
nothing could be done without filling out lots of change management
documentation: extensive documentation about the change procedure,
including exact commands that would entered, test procedures, backout
plans, justification of why the change was needed, who was going to be
involved, when it was going to happen, what the impacts were going to be
and to which customers, and so forth. This all had to go through a
committee, which might approve it a couple of weeks after it was
submitted. It wasn't fun. Nobody did anything just because they
thought it might make some small incremental improvement. Problems were
often left alone until they became emergencies, because the bureaucracy
involved in fixing them would become somewhat easier then. But at the
same time, human error-caused outages became pretty rare. The committee
that reviewed these things didn't really know how to do anything other
than see if the questions had been answered, but answering the questions
forced people to think through things carefully. Adopting a very
stripped down version of that protocol, asking people to answer a list
of standard questions to their own satisfaction before diving into major
changes, gets a lot of the same benefits and doesn't cost much.
There are also the comments I've seen here about enterprise-class
hardware that Grex can't afford. A lot of commercial sites also can't
afford it, or decide it's not worth the cost. A lot of services which
the Internet would be perceived as not working without -- some of the
root and top level DNS infrastructure, Akamai caches, Google, etc. --
involves standard off the shelf hardware deployed in large enough
numbers that if some piece of it breaks, end users won't notice in the
few days it may take to fix it. What sort of hardware to use, how much
of it, and how much support to provide in case it breaks, are
interrelated decisions with costs associated, and different
organizations come up with different answers.
Managing volunteers is different than managing employees. Managing
employees who are paid less than they could earn elsewhere is different
than managing employees who are paid more than they could earn
elsewhere. A general question to ask is, "are we getting more out of
this person than we're paying them." I've dealt with employees who have
been hard to deal with, but who were occasionally doing things that were
really important, and they've seemed worth keeping. I think I've even
been such an employee at a few former jobs. At my current non-profit
employer, I've "fired" volunteers who were taking more of my time to
manage than it would have to do the work they were doing. At the same
time, if somebody isn't doing anything, is known to not be doing
anything, and isn't costing anything, telling them to go away probably
isn't all that useful. Having volunteers who occasionally do something
that wouldn't otherwise get done can be a very useful thing. Telling
anybody to go away before you're sure you want them gone can have some
less than desirable consequences. On the other hand, having somebody be
in charge, with at least the authority to tell voluneers what not to do,
may have more positive impact than its cost in ruffled feathers.
|
ric
|
|
response 137 of 176:
|
Dec 13 20:07 UTC 2005 |
Tod - Yes, the elected officers have a fiduciary responsibility to manage
Grex.
It's still not their primary responsibility. I'd feel sad for anyone who felt
running Grex was the most important thing in their life.
Aren't you on the arbornet board? Seems to me that it is YOUR fiduciary
responsibility to have the annual meeting that was required by law which still
has not occurred. But you see, Arbornet is not your primary responsibility,
is it? It's not even your secondary responsibility. I bet your family and
job come first. I bet there's a lot of things you consider more important
than your obligations as a volunteer on the Arbornet Board of directors.
|
cross
|
|
response 138 of 176:
|
Dec 13 20:35 UTC 2005 |
Please, that's just deflecting responsibility. Someone really does need to
be "in charge" of Grex.
Besides, arbornet not having its annual meeting isn't necessarily Todd's
fault.
|
tod
|
|
response 139 of 176:
|
Dec 14 07:13 UTC 2005 |
System downtime vs. annual meeting
Shall we take a poll on order of importance? Governance has not been an issue
for Arbornet, nor has accountability of staff and system maintenance.
Let's talk about Grex since this is where we are.
|
naftee
|
|
response 140 of 176:
|
Dec 14 23:21 UTC 2005 |
a very Romanian response.
|
ric
|
|
response 141 of 176:
|
Dec 19 14:21 UTC 2005 |
re 138 - people are "in charge" of grex. Where did I say they weren't? Nor
did I say it was Todd's "fault" that Arbornet hasn't had it's legally required
annual meeting. It's the Arbornet Board of Director's "fault".
People are in charge of Grex. People are responsible for Grex. But those
people have more important things in their life than Grex, and I don't blame
anyone for that.
I have a responsibility to my job because without it, I can't provide for my
family.
What is Steve's responsibility to Grex? He does these things as a volunteer,
but you can be sure that his job and his family are more important to him than
Grex. (Speak up, Steve, if I am wrong).
That being said, if Grex is down for 3 days because Steve (or any other staff
member) doesn't have time to fix it because of family and job obligations,
I think it is ridiculous to criticize them for those decisions.
And if a MISTAKE is made during the operation of Grex, what are you going to
do, fire the staffer who made the mistake? I don't see a huge line of people
volunteering to run these organizations. Most of M-Net's volunteers left for
Grex or left the conferencing world entirely. It doesn't look like there's
a ton of volunteers here on Grex either, so you take what you can get.
the fact that either of these systems still exist is nothing short of amazing.
|
scholar
|
|
response 142 of 176:
|
Dec 19 17:09 UTC 2005 |
Being volunteers doesn't remove them from the responsibility to do quality
work when they decide to use the powers over the system they're given.
The whole backup thing was terribly poor work. Even the most novice,
inexperienced of system administrators know how important backups are. The
people involved in the mail mishap are apparently a gaggle of fools with FAKE
pocket protectors.
|
ric
|
|
response 143 of 176:
|
Dec 19 18:45 UTC 2005 |
I, for one, appreciate the volunteer efforts of anyone willing to do such
jobs. And I realistically understand that these people are volunteers and
have many other more important responsibilities in other areas of their life.
I choose to not rely on systems operated by such people, and therefore, I've
never lost anything important do to such issues.
You may choose to rely on systems operated by volunteers. You may try to hold
someone responsible for mistakes leading to loss of data or anything else that
may arise from system downtime. You'd be a fool to do so and you probably
won't get anywhere trying.
|
scholar
|
|
response 144 of 176:
|
Dec 19 19:01 UTC 2005 |
The loss of data wasn't caused by system downtime.
It was caused by people not making proper backups.
Even in a volunteer organization, there must be some work ethic.
|
ric
|
|
response 145 of 176:
|
Dec 19 19:15 UTC 2005 |
What do you intend to do to force that?
Have them all removed?
|
scholar
|
|
response 146 of 176:
|
Dec 19 19:23 UTC 2005 |
I don't have to be able to "force" something for it to be the right thing.
|
tod
|
|
response 147 of 176:
|
Dec 19 19:36 UTC 2005 |
Such adamant defenses for complacency.
I'm glad none of these folks work for larger non-profits.
|
glenda
|
|
response 148 of 176:
|
Dec 19 23:57 UTC 2005 |
And how do you suppose you could do better? Backups were made. A listing
was made of the said backups to see that all the files were there, the listing
report the mail directory and files were there, it just didn't say how big
it was. Is the person doing the backups supposed to go in and look at all
the 100s of thousands of files individually to make sure that the sizes are
correct? When I do backups, I do listings to see that the major files exist,
I usually don't unzip them and look at the size, with that many files there
just isn't enough time to do so, especially when there are time limitations.
|
naftee
|
|
response 149 of 176:
|
Dec 20 00:16 UTC 2005 |
ric is like richard, except he types better
|
scholar
|
|
response 150 of 176:
|
Dec 20 00:30 UTC 2005 |
It's not particularly difficult to compare the size of files in an archive
to the size of files in a directory, though the fact you think it is difficult
speaks to your ignorance of Unix.
It's also not particularly difficult to make sure the backup is done right
in the first place.
|
cross
|
|
response 151 of 176:
|
Dec 20 02:38 UTC 2005 |
Regarding #148; That's impossible. If Steve's account was accurate,
none of the spool files would have shown up in the file listing.
Regarding #141; Oh please. Call a spade a spade. No one is saying
that people need to make grex the primary focus of their life. But
someone needs to be accountable for it, and no one is. No one takes
the responsibility for making sure grex is running. If they did,
it wouldn't stay down for a week at a time.
Now, I'm not saying people shouldn't make the decisions they do,
just that grex needs to solicite someone to step up to the plate
when no one else does.
Of course, I expect I'll be flamed to pieces for challenging the
status quo and not being an apologist. The grexists are a lot like
the neocons when it comes to questioning things. They just don't
like it when anyone challenges anything. Sad, really.
And people wonder why grex isn't as popular as it once was.
|
mcnally
|
|
response 152 of 176:
|
Dec 20 03:06 UTC 2005 |
re #148:
> And how do you suppose you could do better?
I've tried to refrain from criticizing STeve's mistake for a number
of reasons -- (1) it doesn't get the deleted mail back, (2) I suspect
he feels (or felt) bad enough, and (3) nobody else was stepping up to
volunteer to get the job done and it's unfair how much of the
responsibility has devolved onto STeve, but..
Your defense, while commendable from a family loyalty standpoint,
is wholly misguided from a technical standpoint. A couple of really
serious mistakes were made (chiefly, the backup was badly botched
and* the decision had been made to repartition in place.) The results
turned out to be a minor disaster for many of us, and it's insulting
to pretend that there was no way it could have been prevented..
> Backups were made.
As it turned out, some were, some weren't. That's the issue.
> Is the person doing the backups supposed to go in and look at all
> the 100s of thousands of files individually to make sure that the
> sizes are correct?
Actually, it's not that hard to write a program to do that, but even
if you don't want to go to that much trouble one can get a pretty good
idea by comparing the size taken up by the backup with the size taken
up by the originals.
|
tod
|
|
response 153 of 176:
|
Dec 20 05:22 UTC 2005 |
re #152
Thanks, Mike. I didn't even want to go there but you present a pretty simple
guideline for next time.
|
cross
|
|
response 154 of 176:
|
Dec 20 06:37 UTC 2005 |
Actually, this would have been avoided had Steve used the dump program
instead of tar to do the backups, as I suggested. Steve wrote something
somewhere that I thought was funny that seemed to indicate he thought it
wouldn't have made a difference; actually, it would. Dump doesn't go
through the filesystem to get the data it backs up; rather, it looks at
the filesystem data on the raw disk devices. Tar goes through the file
system; hence when it's sensative to whether the disk was mounted at the
time. A better way to do the backups would have been to use dump.
But I really don't want to beat up on Steve about this. I've done the
exact same thing myself (luckily, I only deleted the mail spool of one
user, but he was still pretty pissed off). Hey, live and learn.
My major concern is with grex as a whole, and the idea that no one really
seems to be in charge, despite claims to the contrary.
|
ric
|
|
response 155 of 176:
|
Dec 20 14:22 UTC 2005 |
Again, i'm not saying it could not have been prevented, and I'm not suggesting
that people don't try to do better "next time".
i'm just saying that we all know how Grex operates, and we should set our
expectations accordingly.
|
cross
|
|
response 156 of 176:
|
Dec 20 15:45 UTC 2005 |
If we all know how grex operates, and should set our expectations accordingly,
then you *are* suggesting that people don't try to do better next time. You
are, without a doubt, saying that the status quo is perfectly fine. I am not.
|
tod
|
|
response 157 of 176:
|
Dec 20 16:53 UTC 2005 |
Dan,
Don't you realize that most Grex folk get seasick if there is even the
slightest boat rocking?
|
cross
|
|
response 158 of 176:
|
Dec 20 17:42 UTC 2005 |
Oh, sorry. My bad.
|
ric
|
|
response 159 of 176:
|
Dec 20 17:51 UTC 2005 |
To be quite honest, yes - the status quo works for me because I don't rely
on Grex for anything. If my participation files get hosed, I'll get over it
pretty quickly. I don't rely on Grex for email either because in my opinion,
nobody should rely on email hosted by an organization with no employees and
nobbody whose primary job responsibility is maintaining that s ystem.
I haven't seen anything suggested here that would make things on Grex any
better - other than simple acknowledgement of mistakes made, and some hope
that lessons have been learned.
I don't know what YOU got out of this "situation" but for me, it's just an
affirmation that relying on grex for anyting is foolish.
|
cross
|
|
response 160 of 176:
|
Dec 20 18:26 UTC 2005 |
Well, I made a suggestion that grex solicit a staff member to be `in
charge' in the case of a failure. Others suggested that a written plan
be made prior to a major change (such as an upgrade). Both of those
seem like suggestions that could make things better.
|