|
Grex > Coop13 > #294: Why Grex lost its mail partition | |
|
| Author |
Message |
| 25 new of 176 responses total. |
bhoward
|
|
response 75 of 176:
|
Nov 30 14:08 UTC 2005 |
Steve, I was referring to Sindi's complaint that motd has messages
about past crashes and outages too long *after* the event. I made
no comment as to how much warning there should be before there is
an outage.
I think our current routine of announcing system down time several
days in advance for scheduled downtime, and best effort warning for
anything else is sufficient.
On a related note, I think we should commit to updating the hvcn
page with current system status *before* commencing any system work,
emergency or otherwise, that will keep grex down or unavailable for
more than a few minutes.
|
steve
|
|
response 76 of 176:
|
Nov 30 14:51 UTC 2005 |
Sigh. OK, upon rereading this I see what you mean. Don't type
before coffee should be my mantra on these rare days when I'm up
before 8am. Yes, putting an announcement on the hvcn page is
something we need to do.
|
tsty
|
|
response 77 of 176:
|
Nov 30 14:55 UTC 2005 |
an operation as large as an os upgrade ought to have had a
written checklist - and that checkllist could have been discussed
looong before hand in public, agora &/or coop.
but then shoulda/coulda/woulda only has recrimination value after the fact.
since this type of operation isn't about to happen too often, the
disaster is just lurking around - AS ALWAYS - waiting for memories
to fade .
teh previous upgrades were not as complicated, had much more notice,
and were thought through with more precision - might even have had
a written checklist handy!
|
tsty
|
|
response 78 of 176:
|
Nov 30 15:34 UTC 2005 |
re hvcn ... i went there for info but had to call mary and ask where
it was on hvcn .. there was nothing (at that time) that would have
led anyone to know where to click --- unless you already knew in
advance and book marked it.
re #74 -- what 10pm curfew??? i thought 24/7/365.25 was the deal?
we got out of ken's wharehouse for the same curfew problem, adn now we
are back into another curfew? guess i wasn't payig enough attention.
btw, back on the checklist thought ... at least those with military
training wold have, by default, created their own instructions sheet.
not that you have to have had military training to figger that out
but it helps. and systems engineering 101, remedial, would have demanded
a check list ... system analysis 099, non-remedial, would have had
checklist provisions built-in to the coourse.
hell, the repetitive event sequence of starting up an airplane is
done by two pople with a checklist!
hell, back wehn *i* halted grex, by accident, i wanted that non-existant
checkllist to provide some thoughtful path. wasn't one. left grex 'as is'.
nothing was damamged, noting was lost except my staff responsibillities -
backups (how ironic). at least one of hte ppl (me) who forsaw precisely
this disaster *sometime* in the future and volunteered to backstop it
from being the disaster it now is, was 'offed' and dissed in teh process.
future borg adnfuture staff *could* have an adgenda item: monthly
backup accomplished? checklist adgenda item: yes/no
first things first. secure your environment, what is contemporary, before
wnadering off into the unknown future with NO ROUTE HOME IN PLACE.
it is only isn the last few years that i no longer enter a new
environment without already knowinig THE OTHER WAY OUT, just in case.
catastrophy theory (my masters subject) says, 'you can't get back from here.'
therefore you prepare everything so that you will never *HAVE* to go
back where you cant get to.
if yo can't get back and you cant cover your ass, you fscking STAY PUT
until another path is found/created to prevent exactly this sort
of catastrophy. my explanation of that, simplified, back when i
halted grex was apparently unintelligible to the recipeint(s).
or, it was forgotten over time, which is more of what i think, just
damn forgotten. blithly erased from teh cache of collected wisdom.
not much cache in that account anymore, eh?
|
steve
|
|
response 79 of 176:
|
Nov 30 18:06 UTC 2005 |
I'm not really sure how to respond to what you are saying. I
don't think you understand the nature of the upgrade, in that
there was no path back. The op system had problems with both
the networking card and filesystem issues, and as an extra
treat hardware problems.
This all dances around the critical issue that no one should
keep valuable mail on Grex only. Keeping valuable mail in the
/var/spool area is even worse; it's an active filesystem, the
most active one on Grex and as such is more prone to failures
than anything else.
|
naftee
|
|
response 80 of 176:
|
Nov 30 18:44 UTC 2005 |
I don't think anyone can coherently respond to one of tsty's posts :(
|
ric
|
|
response 81 of 176:
|
Nov 30 19:11 UTC 2005 |
(hah)
|
aruba
|
|
response 82 of 176:
|
Dec 1 00:23 UTC 2005 |
Re #79: You know, STeve, I just don't think that's a good enough answer.
Yes Grex is run by volunteers, and yes people can't expect the same
accountability from Grex that they can from someone they're paying to keep
their data safe.
But if Grex is to be anything people care about, then the board and staff
have to themselves care enough to do their best for Grex.
Frankly, I'm glad some people are really pissed about losing their mail. I
wish more people were. If no one gave a damn, well, then Grex would really
be nearing the end of its life as a viable community.
I agree with tsty that backing up the system ought not to be an ad hoc thing
that someone works out as he goes along. THere ought to be a procedure in
the GrexDoc which gets followed each time. If the system changes so that it
gets done differently, then the changes ought to go into the documentation.
|
steve
|
|
response 83 of 176:
|
Dec 1 02:15 UTC 2005 |
And you don't think I care about Grex? Good God Mark, I made a mistake.
WHERE do you think that I a) don't feel badly about missing it, and b)
that I don't care?
|
aruba
|
|
response 84 of 176:
|
Dec 1 04:02 UTC 2005 |
I didn't say that or mean it, STeve. But I don't think telling people
that they shouldn't ever store anything that matters on Grex is a good
enough answer. I believe Bruce when he says the mistake you made could
have happened to anyone. *But I don't think it should have been you and
John alone in the room doing the backup and upgrade*. I think there
should be an institutional procedure in place for these things, so that
the collective knowledge and experience of all the staff is brought to
bear on the procedure. Grex shouldn't be dependent on one person's
judgement.
And actually, I thought there *was* a procedure in place, in the person of
the GrexDoc. And I thought you promised the board you would follow the
GrexDoc when you did the upgrade.
Maybe the doc wasn't complete - I don't know if it covered the backup
part of the upgrade. Did it?
|
scg
|
|
response 85 of 176:
|
Dec 1 04:17 UTC 2005 |
I haven't logged into Grex in the last year or two, but I've been
lurking on the staff mailing list. If I lost anything in this backup
error, it wasn't anything I cared about. That said, I, too, am somewhat
puzzled at the procedure that was followed.
I got my start doing Internet stuff as a member of the Grex staff more
than ten years ago, so I remember the constraints we had to work under
then. Grex was a rare piece of ancient Sun hardware, disks were really
expensive, and none of it was any more reliable than most of the other
stuff running on the Internet in those days. When something needed to
be done, it often meant taking the system offline, sometimes for a full
weekend in the case of a few major upgrades or disk crashes. We had a
much bigger staff back then, and for many of us whose social lives
revolved around the Grex community it was a pretty high priority, so
when something needed to be done there were typically lots of people
around to work on it. I can certainly see how doing things as we did
them then, but with a smaller and less focused on Grex staff, would lead
long periods of downtime.
But I'm puzzled about why I see the same methods being used on Grex now,
when hardware is considerably cheaper and staff time appears to be a
much scarcer resource. My perspective is arguably a bit skewed. The
non-profit where I'm now a paid full-time staff member is pretty
impoverished, but still has a budget a couple of orders of magnitude
higher than Grex's, and I tend to come at systems stuff as a manager
rather than as a hands-on sysadmin these days. Still, it doesn't look
to me like the problems that are being talked about here are difficult
to solve.
If I recall correctly, Grex is now running on PC hardware that's at
least two or three years old. In other words, getting some equivalent
systems should be cheap (or free, given that that's replacement age at a
lot of places, and Grex is 501(c)3). Installing new software versions
on new hardware, testing, and then copying over whatever is dynamic at
the last minute, seems pretty obvious. Falling back to the old system
at that point if something doesn't work is at most a matter of moving an
ethernet cable. Likewise, having spare systems ready to copy whatever
is dynamic onto is a good way of dealing with hardware failures.
This really, I think, comes down to whether anybody still cares enough
about Grex to make it worth dealing with. My own view is that the
community I once cared about seems to have gone on to other things, and
the services Grex is providing aren't anything special anymore. But if
people care about keeping Grex operating, it looks like something needs
to change.
|
eprom
|
|
response 86 of 176:
|
Dec 1 05:58 UTC 2005 |
I'm upset about the mail debacle too but unfortunately, anytime any
constuctive criticism is offered, it's somehow taken as a personal
attack on that person.
Back when I was in the AF, every section had a recall roster, and a
binder with documentation of a basic contigency plan and checklist.
I think that would be a good idea for grex. Too bad that would be
construed as tieing staffs hands or micromanaging.
I also think the heirachy of operation should change from every staff
operating as equals to someone volunteering as a main sys-admin, who
is accountable and reports to the board. It seems that the current way
of doing things is broken.
|
krj
|
|
response 87 of 176:
|
Dec 1 18:29 UTC 2005 |
I think SCG (hi, Steve!) hits a few important points in his resp:85.
In particular, I would like to stress the need to evolve our thinking
from "This computer is Grex," to "This internet service is Grex, and
these are the hardware components we have to support our service."
Right now, *everything* is a single point of failure for Grex,
and as we just learned, staff can't back out of an upgrade because
the upgrade is done on top of the old disks.
Amazon.com and LiveJournal don't go dark for a week while they
do upgrades; they acquire the hardware they need so that upgrades
can be rolled into production with a minimum of disruption.
----
Longer term: all of the community-building services Grex offers
are now offered, for free, by large organizations with professional
support staffs. The one thing which isn't common is the open access
to a shell prompt; but that's also one thing which creates huge
social/behavior management problems. It's also unclear to me if that's
a core function of Cyberspace Communications as it was organized,
rather than the tool towards the community-building goals which
was available 14 years ago.
|
nharmon
|
|
response 88 of 176:
|
Dec 1 18:43 UTC 2005 |
Amazon.com and LiveJournal have massive capital investments in hardware
and engineers that Grex simply can not and will not ever provide.
Further, the financial impact of outages is different for Grex than it
is for those two.
|
mcnally
|
|
response 89 of 176:
|
Dec 1 18:59 UTC 2005 |
You're right that Grex doesn't have as much money to spend as Amazon
or LiveJournal, though I don't think that point escaped anyone even
before you explicitly stated it. A more salient point is that Grex
has enough money in the bank to afford a backup disk. We just didn't
plan to use it for that.
|
nharmon
|
|
response 90 of 176:
|
Dec 1 19:13 UTC 2005 |
...Or the money could be spent on a colo that would give us 24/7 access
to the machine, thus giving staff a larger window to recover from outages.
You see, I think this is the sort of direction that some have been
saying Grex lacks. We're not sure what takes precedence.
Another suggestion:
Grex has security goals, why not have overall system goals? Maybe even a
mission statement? These goals could be put in order from most important
to least important...they could be things like: "Maintain a conference
system void of censorship", or "provide for limited dialup internet
access in the ann arbor area", or "provide for user data integrity
through fault-tolerant disk storage and regular backups".
Then, when it came to making decisions on expending resources, everyone
would be on the same page as to what problems took priority.
|
mcnally
|
|
response 91 of 176:
|
Dec 1 19:58 UTC 2005 |
I know that because of his work hours and long commute, physical access
during the day and the early evening is not feasible for STeve, but when
24/7 access is suggested nobody ever says who's hypothetically going to
be fixing the system at 3 AM, so I'm not sure access hours are the real
issue.
|
ric
|
|
response 92 of 176:
|
Dec 1 20:29 UTC 2005 |
I care about Grex and M-Net (for different reasons).
And I still think that anyone who uses either system with the expectation that
their files are safe OR secure is a fool, and I don't have any sympathy for
people who lost important email they had stored on grex.
|
mcnally
|
|
response 93 of 176:
|
Dec 1 20:35 UTC 2005 |
Won't ric be surprised when he finds out I used my staff access to
delete his home directory, conference participation files, and uid!
Just kidding, of course, but if he thinks users shouldn't expect their
e-mail to be safe from sudden disappearance I'm not sure what else on
the system ought to be sacrosanct..
|
nharmon
|
|
response 94 of 176:
|
Dec 1 20:38 UTC 2005 |
If most of the users agree with you mcnally, then that should be one of
Grex's goals.
|
krj
|
|
response 95 of 176:
|
Dec 1 20:57 UTC 2005 |
Mike in resp:91 :: before Grex left the Pumpkin, there were numerous
times when I dropped Steve off there after we got back from work,
and he worked on Grex for some hours in the very late evening or
early morning.
|
glenda
|
|
response 96 of 176:
|
Dec 2 02:14 UTC 2005 |
I seem to remember a few times that you dropped him off at the Pumpkin when
you got back into town, and picked him up there in the morning to go back to
work.
For those advocating having an equivalent system for doing upgrades and
recoveries: where do we store it? The colo charges for space. If a staffer
stores it we still have problems with access unless that staffer is the ONE
doing the upgrade/recovery.
|
ric
|
|
response 97 of 176:
|
Dec 2 15:22 UTC 2005 |
re 93 - I would be surprised, and I'd probably ask for your removal if you
did that on purpose without good reason. But it wouldn't really bother me
much. I'd just create a new account. I participate in two conferences - coop
and agora. And I have used the forget statement on all but one item in agora.
I don't have any files in my home directory that are important to me.
The only thing that might upset me is if I was unable to get my username "ric"
back, since I'm pretty much been known as "ric" in the mnet/grex world since
1986.
(Though I think there was a period of time in the mid 90s where someone else
had that ID on Grex cuz I got reaped)
|
slynne
|
|
response 98 of 176:
|
Dec 3 00:10 UTC 2005 |
Even though we are just a small organization, there is nothing wrong
with us doing the best we can in all situations. I also think that
criticism is ok although I sometimes think that some people around here
have trouble presenting their criticism in the best possible way. It is
pretty easy to start feeling defensive about things.
As for the email loss. It was a mistake. It cant be undone and that is
that. No one did it out of malice. And even the most competant
technical people make mistakes sometimes and email sometimes gets lost
even at for-profit firms.
As for what we can do to prevent such a loss in the future...Well,
there are a lot of good ideas being presented here. I dont know what
the answer is. Our finances arent great and I know that there is a
reluctance to spend a lot of money. However, exploring backup options
is really something we should do.
|
keesan
|
|
response 99 of 176:
|
Dec 3 05:06 UTC 2005 |
I just found the info that I had saved in a recent email and it is actually
nice not to have to go through all 200 or so old mails deciding if there was
anything important in them, so I am actually grateful now, and pine starts
up so much faster with an empty inbox. I wish spamassassin would work again.
|