|
Grex > Oldcoop > #294: Why Grex lost its mail partition | |
|
| Author |
Message |
| 25 new of 176 responses total. |
cross
|
|
response 25 of 176:
|
Nov 22 17:31 UTC 2005 |
Presumably, you'd use dump on every filesystem on the system! That's the
whole point! /var/mail was missed because it was unmounted when you ran
tar. Dump doesn't care; you just tell it to dump a filesystem and it does
it, regardless of whether that filesystem is online at the time. That's
the big difference. With tar, it *has* to be online, with dump, it doesn't
(and in some ways, it's better if it's not. Dump favors a quiescent
filesystem).
I'm not sure, in this case, why moving the data to a Windows system could
have been useful, though I can see the portability of tar as being an asset
in more general situations.
|
tod
|
|
response 26 of 176:
|
Nov 22 17:41 UTC 2005 |
re #21
Thanks, Mike and STeve.
|
naftee
|
|
response 27 of 176:
|
Nov 22 19:15 UTC 2005 |
re 20 You're pissed at polytarp about him making fun of your mistake(s) ?
I guess that is something to get mildly ticked off at, but oh well@!
|
ric
|
|
response 28 of 176:
|
Nov 22 19:21 UTC 2005 |
Umm.. Steve.. stop responding to him/them.
|
tod
|
|
response 29 of 176:
|
Nov 22 20:22 UTC 2005 |
Be sure to send that RAM in for the lifetime guarantee replacement, STeve.
|
steve
|
|
response 30 of 176:
|
Nov 22 20:29 UTC 2005 |
I'm going to send these two in to Crucial for testing. Indeed they do
have a good warranty. Never had to use it before. We'll see.
|
scholar
|
|
response 31 of 176:
|
Nov 22 23:18 UTC 2005 |
What two?!
Me and naftee?!
|
steve
|
|
response 32 of 176:
|
Nov 22 23:24 UTC 2005 |
I wouldn't bother. We know you're defective.
|
naftee
|
|
response 33 of 176:
|
Nov 22 23:34 UTC 2005 |
haha zing
|
slynne
|
|
response 34 of 176:
|
Nov 25 20:31 UTC 2005 |
Ah well. I had some personal mail that I would have liked to have kept
but I guess it isnt the end of the world to lose it either.
|
tod
|
|
response 35 of 176:
|
Nov 25 21:14 UTC 2005 |
I also noticed some items from the parenting cf have come up missing.
Can you restore them, STeve?
|
naftee
|
|
response 36 of 176:
|
Nov 25 23:10 UTC 2005 |
yeah, steVE. it's bugging us.
|
cross
|
|
response 37 of 176:
|
Nov 26 03:11 UTC 2005 |
Badda bing!
|
tsty
|
|
response 38 of 176:
|
Nov 29 04:36 UTC 2005 |
grex did not go out of its way to destroy any data. borg/staff went out
of its way to favor its phony invincibility - not an unknown arrogance
inside consensus-only leaderless organizations which eschew 'outsiders'
of any stripe.
grex has, sometimes, 'taken in' 'strays' but only until somethng can
be trumped up into a 'scandal' and then ... poof!
different thoughts are prohibited inside the inner navel-gaze.
its sadly predictable but no one would listen - invincible arogance or
something similar. maybe bad history teachers helped. maybe no
systems analyts/engineers permitted
it's not STeve .. it's borg.
|
tsty
|
|
response 39 of 176:
|
Nov 29 04:37 UTC 2005 |
hey, cross, how ya doin?
|
janc
|
|
response 40 of 176:
|
Nov 29 15:08 UTC 2005 |
I think it is ridiculous to say that Grex lost the mail partition
because of some great organizational fault. Sure Grex has
organizational issues, and I believe some of them contributed directly
to the excessively long down time for the upgrade, but I don't think
that the mail partition problem was particularly caused by this.
The backups were performed by STeve Andre and John Remmers. Probably
STeve was typing and talking about what he was doing, and John was
looking over his shoulder (this is the mode they were working in when I
dropped by later). This is a pretty good way to do things like this,
because the second person has a good chance of catching the first
person's errors. In this case the error was subtle - I think they had
both forgotten that /var/mail was unmounted because it was done some
time before. When they tar'ed up /var, they got a huge file. STeve did
a partial listing of its contents to see if the right sort of thing was
in it, but he didn't look far enough. He saved a copy on Grex's IDE
disk, and uploaded another to his laptop. There was supposed to be an
additional safety net - the mirror drive. Unfortunately, my mirror
scripts weren't smart enough to not mirror an unmounted drive and nobody
remembered to turn them off, so we lost the mirror copy of /var/mail too.
So we had several safety nets in place, and all of them failed. It's
pathetic and unfortunate, but it's not an organizational failure, and
it's not a failure based on a phoney sense of invincibility. Like all
computer professionals, we are well aware of our capacity to screw up
and take precautions to protect ourselves. But sometimes the
precautions fail. That's life.
|
tod
|
|
response 41 of 176:
|
Nov 29 16:49 UTC 2005 |
I think it is ridiculous to say that Grex lost the mail partition
because of some great organizational fault.
Let's step back for a second. There was a time when it was considered polite
to notify users of intended downtime due to upgrades. That was so users could
ensure they have their precious data moved offline if they felt the need.
It was also so they'd know not to plan on being online at that time.
Organizational fault is written all over the last "upgrade." The Board is
slacking off by letting a bunch of part time hobbyhorse types take the system
offline without forewarning. The board has a fiduciary responsibility to the
members to keep the system around and available. No accountability in this
organization, imo. My recommendation is that the board seek some fresh blood
for the staff and also take some lessons in diplomacy and accountability.
No need to chant "volunteer organization" at me, neither. That's a dead horse
not worth beating and everyone is tired of that excuse.
|
keesan
|
|
response 42 of 176:
|
Nov 29 17:25 UTC 2005 |
Todd, why don't you start your own bbs and do it right? Our volunteers are
not perfect, they admit to this, they accept suggestions, and they don't need
more complaints. A polite request for a few days notice next time grex is
going down would accomplish more than #41. Are you looking for someone to
volunteer as full-time staff?
|
nharmon
|
|
response 43 of 176:
|
Nov 29 17:31 UTC 2005 |
The function of the staff should be to advise the BoD on technical
issues, implement the decisions made by the BoD, and intervene on their
own initiative in some circumstances. I say should be, because with the
exception of a few board motions, there is not a lot that defines what
staff's duties, responsibilities, requirements, etc. are.
Ideally, there should be one person appointed by the BoD responsible for
supervising the staff. This person would be accountable to the BoD, and
the other staffers accountable to him/her.
Right now it seems there is a hash of staff members who do a good job of
working together but without any real guidance or direction. I think
finding someone with the time and drive to give direction is something
Grex needs desperately.
|
nharmon
|
|
response 44 of 176:
|
Nov 29 17:32 UTC 2005 |
I think it is a shame that someone takes the time to voice their
recommend on how to improve Grex only to be told to leave and start
their own BBS if they don't like how Grex is run.
That sort of attitude is exactly what ruins good organizations like this.
|
tod
|
|
response 45 of 176:
|
Nov 29 18:10 UTC 2005 |
#42 of 44: by Sindi Keesan (keesan) on Tue, Nov 29, 2005 (12:25):
Todd, why don't you start your own bbs and do it right?
I'm a member of Cyberspace, Inc. I like this BBS (when its online.)
Our volunteers are not perfect, they admit to this, they accept suggestions,
and they don't need more complaints.
How do we provide a community service without community feedback?
A polite request for a few days notice next time grex is
going down would accomplish more than #41.
I would request politely if I thought the Board would listen. I suspect
that the Board is 2nd fiddle to janc and STeve's whims, though.
Therefore, I'm using a harsher tone in hopes of a constructive response
and action.
Are you looking for someone to volunteer as full-time staff?
I think the Board seriously should be. The downtimes of the recent past
have been at the full mercy of staffers with Grex nowhere near the
top of their priorities. I don't fault them for it but I do fault the
Board for not seeking additional available and willing staff.
Retaining staff should also include a bit more diplomacy in the
way it treats existing volunteers and members.
#43 of 44: by Nathan Harmon (nharmon) on Tue, Nov 29, 2005 (12:31):
Ideally, there should be one person appointed by the BoD responsible for
supervising the staff. This person would be accountable to the BoD, and
the other staffers accountable to him/her.
I agree with you, Nathan. I'd also interject that the BoD is
ultimately responsible. As members, we should not be told to shutup
when we ask the Board why no one is being accountable for Grex.
|
steve
|
|
response 46 of 176:
|
Nov 29 18:25 UTC 2005 |
Grex has never operated with a chief staff person. It's always been
more of a collective thing. It's worked out at least as good as work
places I've been at which had an official structure.
Tod you know as well I as I do that there isn't going to be a
full-time staff person.
|
mcnally
|
|
response 47 of 176:
|
Nov 29 18:39 UTC 2005 |
> I think it is ridiculous to say that Grex lost the mail partition
> because of some great organizational fault.
I don't think that's ridiculous at all.
I think the mirroring scheme was set up by someone other than did the
repartitioning and the person doing the repartitioning didn't fully
understand the implications of the backup scheme, namely that unmounted
partitions aren't backed up.
Furthermore I think that there are organizational issues that led to
the mail disaster in other ways, too. I've refrained from commenting
because I haven't had a good idea how to separate criticism of the
upgrade from criticism of the people who performed the upgrade, but I
personally think it was a very bad idea to upgrade and restore in place.
If we had a spare SCSI disk (or perhaps set of disks) [which we should
have anyway, for disaster recovery] the entire upgrade could have been
performed without ever risking the data on the disk(s) the system had
been running on. As I understand it Grex has got a not excessive, but
still reasonable amount of money in the bank. Perhaps we should
invest in preventing exactly this sort of behavior the next time around.
And while I would never suggest that anyone jettisoned the mail on
purpose, I suspect a contributing factor in the mail loss is that
none of the people involved depend on the mail system here in any
way that's truly important to them. They shouldn't *have to* to
administer the system but it does tend to focus one's attention
when you've got something to lose.
|
tod
|
|
response 48 of 176:
|
Nov 29 18:42 UTC 2005 |
I agree to disagree with STeve about a full-time staff person. The Board can
at least make an attempt to find such person(s) with flexibility in
availability and accountability. A staff of several persons with dedicated
timeslots would be ideal but needs to happen by someone taking that task as
the lead. "We never did it before" and "isn't going to be" are empty excuses,
imo. Why is improving Grex uptime and maintenance so painful a concept?
|
steve
|
|
response 49 of 176:
|
Nov 29 18:46 UTC 2005 |
And how do we pay for this person?
|