|
Grex > Coop13 > #294: Why Grex lost its mail partition | |
|
| Author |
Message |
| 25 new of 176 responses total. |
glenda
|
|
response 66 of 176:
|
Nov 30 02:02 UTC 2005 |
Re #41: You keep saying that it would have been nice if notice had been given
before the upgrade. It was in the motd for several days beforehand that the
upgrade would happen that weekend if at all possible. The upgrade happening
as soon as STeve got everything together and could get with John was discussed
in at least one item for a couple of weeks before it was done. How much
warning do you need, or did you expect a personal email?
|
keesan
|
|
response 67 of 176:
|
Nov 30 02:25 UTC 2005 |
The problem with messages in the motd is that people forget to change them
when they get outdated so we tend to ignore them. If there were only relevant
messages there I would read them. I don't really want to know that grex was
down two weeks ago for a day.
|
nharmon
|
|
response 68 of 176:
|
Nov 30 04:04 UTC 2005 |
Things like maintaining the MOTD are tasks to give to people who want to
join staff as a way of seeing how they handle it. Start him/her off
here, and then go up from there.
|
glenda
|
|
response 69 of 176:
|
Nov 30 04:09 UTC 2005 |
If not the motd, where? It was also discussed in at least one item here and
in Agora. Short of sending email to every account on Grex what are we
supposed to do. I manage to glance at the motd every time I log on enough
to notice if something new is posted. It only takes a couple of seconds, it
isn't that long and it has been rather up to date lately. If you choose it
ignore it, that is more your problem than it is staff's. Yes, I agree that
outdated things should be removed, but lets get real here.
|
nharmon
|
|
response 70 of 176:
|
Nov 30 04:11 UTC 2005 |
What would be the impact of sending an e-mail message to every account
on Grex?
OR, better yet, what about an opt-in mailing list for people who would
like to get system announcements.
|
steve
|
|
response 71 of 176:
|
Nov 30 04:31 UTC 2005 |
Now *that* is a good idea, a mailing list for announcements of system
work, downtime, etc. Excellent.
The impact of staff sending out mail to every acocunt on Grex would
be 1) to take about 20 minutes of system pounding to deliver about
29,000 emails, 2) consume about 50M of /var/mail space, and 3) would
likely generate a couple hundred emails back with 1/2 asking if this
was real, and the other half asking about why and when the system
would be back up, regardless of what we said in the mail. ;-)
|
naftee
|
|
response 72 of 176:
|
Nov 30 05:08 UTC 2005 |
;)
GreX should provide an escort service
|
bhoward
|
|
response 73 of 176:
|
Nov 30 12:20 UTC 2005 |
Re#67 Sindi, we can certainly remove motd messages more aggressively.
Notices for things such as recent outages tend to stay in motd for
at least a week to insure that folks not regularly logging in or
reading the conferences still will have some idea why the system
may have recently crashed or otherwise been unavailable.
A weeks notice for major notices is a (hopefully) reasonable balance
between those who log in daily and those who hit the system at least
weekly (arguably it is a balance between those who only need to be
told once and those for whom the message may not register until
they've seen it several times).
|
steve
|
|
response 74 of 176:
|
Nov 30 13:26 UTC 2005 |
Sorry Bruce but I don't think we should commit to that. Access
to Grex's hardware is simply too limiting. Back when Grex was starting
to crash every day some months ago, I wanted to get to Grex and do
things for a week, every day, and simply couldn't get there in time
to be able to do anything with the 10pm curfew we live under now.
Yes, its a *good thing* to give advance notice on shutdowns, I
fully agree. But let's not lock ourselves to it.
|
bhoward
|
|
response 75 of 176:
|
Nov 30 14:08 UTC 2005 |
Steve, I was referring to Sindi's complaint that motd has messages
about past crashes and outages too long *after* the event. I made
no comment as to how much warning there should be before there is
an outage.
I think our current routine of announcing system down time several
days in advance for scheduled downtime, and best effort warning for
anything else is sufficient.
On a related note, I think we should commit to updating the hvcn
page with current system status *before* commencing any system work,
emergency or otherwise, that will keep grex down or unavailable for
more than a few minutes.
|
steve
|
|
response 76 of 176:
|
Nov 30 14:51 UTC 2005 |
Sigh. OK, upon rereading this I see what you mean. Don't type
before coffee should be my mantra on these rare days when I'm up
before 8am. Yes, putting an announcement on the hvcn page is
something we need to do.
|
tsty
|
|
response 77 of 176:
|
Nov 30 14:55 UTC 2005 |
an operation as large as an os upgrade ought to have had a
written checklist - and that checkllist could have been discussed
looong before hand in public, agora &/or coop.
but then shoulda/coulda/woulda only has recrimination value after the fact.
since this type of operation isn't about to happen too often, the
disaster is just lurking around - AS ALWAYS - waiting for memories
to fade .
teh previous upgrades were not as complicated, had much more notice,
and were thought through with more precision - might even have had
a written checklist handy!
|
tsty
|
|
response 78 of 176:
|
Nov 30 15:34 UTC 2005 |
re hvcn ... i went there for info but had to call mary and ask where
it was on hvcn .. there was nothing (at that time) that would have
led anyone to know where to click --- unless you already knew in
advance and book marked it.
re #74 -- what 10pm curfew??? i thought 24/7/365.25 was the deal?
we got out of ken's wharehouse for the same curfew problem, adn now we
are back into another curfew? guess i wasn't payig enough attention.
btw, back on the checklist thought ... at least those with military
training wold have, by default, created their own instructions sheet.
not that you have to have had military training to figger that out
but it helps. and systems engineering 101, remedial, would have demanded
a check list ... system analysis 099, non-remedial, would have had
checklist provisions built-in to the coourse.
hell, the repetitive event sequence of starting up an airplane is
done by two pople with a checklist!
hell, back wehn *i* halted grex, by accident, i wanted that non-existant
checkllist to provide some thoughtful path. wasn't one. left grex 'as is'.
nothing was damamged, noting was lost except my staff responsibillities -
backups (how ironic). at least one of hte ppl (me) who forsaw precisely
this disaster *sometime* in the future and volunteered to backstop it
from being the disaster it now is, was 'offed' and dissed in teh process.
future borg adnfuture staff *could* have an adgenda item: monthly
backup accomplished? checklist adgenda item: yes/no
first things first. secure your environment, what is contemporary, before
wnadering off into the unknown future with NO ROUTE HOME IN PLACE.
it is only isn the last few years that i no longer enter a new
environment without already knowinig THE OTHER WAY OUT, just in case.
catastrophy theory (my masters subject) says, 'you can't get back from here.'
therefore you prepare everything so that you will never *HAVE* to go
back where you cant get to.
if yo can't get back and you cant cover your ass, you fscking STAY PUT
until another path is found/created to prevent exactly this sort
of catastrophy. my explanation of that, simplified, back when i
halted grex was apparently unintelligible to the recipeint(s).
or, it was forgotten over time, which is more of what i think, just
damn forgotten. blithly erased from teh cache of collected wisdom.
not much cache in that account anymore, eh?
|
steve
|
|
response 79 of 176:
|
Nov 30 18:06 UTC 2005 |
I'm not really sure how to respond to what you are saying. I
don't think you understand the nature of the upgrade, in that
there was no path back. The op system had problems with both
the networking card and filesystem issues, and as an extra
treat hardware problems.
This all dances around the critical issue that no one should
keep valuable mail on Grex only. Keeping valuable mail in the
/var/spool area is even worse; it's an active filesystem, the
most active one on Grex and as such is more prone to failures
than anything else.
|
naftee
|
|
response 80 of 176:
|
Nov 30 18:44 UTC 2005 |
I don't think anyone can coherently respond to one of tsty's posts :(
|
ric
|
|
response 81 of 176:
|
Nov 30 19:11 UTC 2005 |
(hah)
|
aruba
|
|
response 82 of 176:
|
Dec 1 00:23 UTC 2005 |
Re #79: You know, STeve, I just don't think that's a good enough answer.
Yes Grex is run by volunteers, and yes people can't expect the same
accountability from Grex that they can from someone they're paying to keep
their data safe.
But if Grex is to be anything people care about, then the board and staff
have to themselves care enough to do their best for Grex.
Frankly, I'm glad some people are really pissed about losing their mail. I
wish more people were. If no one gave a damn, well, then Grex would really
be nearing the end of its life as a viable community.
I agree with tsty that backing up the system ought not to be an ad hoc thing
that someone works out as he goes along. THere ought to be a procedure in
the GrexDoc which gets followed each time. If the system changes so that it
gets done differently, then the changes ought to go into the documentation.
|
steve
|
|
response 83 of 176:
|
Dec 1 02:15 UTC 2005 |
And you don't think I care about Grex? Good God Mark, I made a mistake.
WHERE do you think that I a) don't feel badly about missing it, and b)
that I don't care?
|
aruba
|
|
response 84 of 176:
|
Dec 1 04:02 UTC 2005 |
I didn't say that or mean it, STeve. But I don't think telling people
that they shouldn't ever store anything that matters on Grex is a good
enough answer. I believe Bruce when he says the mistake you made could
have happened to anyone. *But I don't think it should have been you and
John alone in the room doing the backup and upgrade*. I think there
should be an institutional procedure in place for these things, so that
the collective knowledge and experience of all the staff is brought to
bear on the procedure. Grex shouldn't be dependent on one person's
judgement.
And actually, I thought there *was* a procedure in place, in the person of
the GrexDoc. And I thought you promised the board you would follow the
GrexDoc when you did the upgrade.
Maybe the doc wasn't complete - I don't know if it covered the backup
part of the upgrade. Did it?
|
scg
|
|
response 85 of 176:
|
Dec 1 04:17 UTC 2005 |
I haven't logged into Grex in the last year or two, but I've been
lurking on the staff mailing list. If I lost anything in this backup
error, it wasn't anything I cared about. That said, I, too, am somewhat
puzzled at the procedure that was followed.
I got my start doing Internet stuff as a member of the Grex staff more
than ten years ago, so I remember the constraints we had to work under
then. Grex was a rare piece of ancient Sun hardware, disks were really
expensive, and none of it was any more reliable than most of the other
stuff running on the Internet in those days. When something needed to
be done, it often meant taking the system offline, sometimes for a full
weekend in the case of a few major upgrades or disk crashes. We had a
much bigger staff back then, and for many of us whose social lives
revolved around the Grex community it was a pretty high priority, so
when something needed to be done there were typically lots of people
around to work on it. I can certainly see how doing things as we did
them then, but with a smaller and less focused on Grex staff, would lead
long periods of downtime.
But I'm puzzled about why I see the same methods being used on Grex now,
when hardware is considerably cheaper and staff time appears to be a
much scarcer resource. My perspective is arguably a bit skewed. The
non-profit where I'm now a paid full-time staff member is pretty
impoverished, but still has a budget a couple of orders of magnitude
higher than Grex's, and I tend to come at systems stuff as a manager
rather than as a hands-on sysadmin these days. Still, it doesn't look
to me like the problems that are being talked about here are difficult
to solve.
If I recall correctly, Grex is now running on PC hardware that's at
least two or three years old. In other words, getting some equivalent
systems should be cheap (or free, given that that's replacement age at a
lot of places, and Grex is 501(c)3). Installing new software versions
on new hardware, testing, and then copying over whatever is dynamic at
the last minute, seems pretty obvious. Falling back to the old system
at that point if something doesn't work is at most a matter of moving an
ethernet cable. Likewise, having spare systems ready to copy whatever
is dynamic onto is a good way of dealing with hardware failures.
This really, I think, comes down to whether anybody still cares enough
about Grex to make it worth dealing with. My own view is that the
community I once cared about seems to have gone on to other things, and
the services Grex is providing aren't anything special anymore. But if
people care about keeping Grex operating, it looks like something needs
to change.
|
eprom
|
|
response 86 of 176:
|
Dec 1 05:58 UTC 2005 |
I'm upset about the mail debacle too but unfortunately, anytime any
constuctive criticism is offered, it's somehow taken as a personal
attack on that person.
Back when I was in the AF, every section had a recall roster, and a
binder with documentation of a basic contigency plan and checklist.
I think that would be a good idea for grex. Too bad that would be
construed as tieing staffs hands or micromanaging.
I also think the heirachy of operation should change from every staff
operating as equals to someone volunteering as a main sys-admin, who
is accountable and reports to the board. It seems that the current way
of doing things is broken.
|
krj
|
|
response 87 of 176:
|
Dec 1 18:29 UTC 2005 |
I think SCG (hi, Steve!) hits a few important points in his resp:85.
In particular, I would like to stress the need to evolve our thinking
from "This computer is Grex," to "This internet service is Grex, and
these are the hardware components we have to support our service."
Right now, *everything* is a single point of failure for Grex,
and as we just learned, staff can't back out of an upgrade because
the upgrade is done on top of the old disks.
Amazon.com and LiveJournal don't go dark for a week while they
do upgrades; they acquire the hardware they need so that upgrades
can be rolled into production with a minimum of disruption.
----
Longer term: all of the community-building services Grex offers
are now offered, for free, by large organizations with professional
support staffs. The one thing which isn't common is the open access
to a shell prompt; but that's also one thing which creates huge
social/behavior management problems. It's also unclear to me if that's
a core function of Cyberspace Communications as it was organized,
rather than the tool towards the community-building goals which
was available 14 years ago.
|
nharmon
|
|
response 88 of 176:
|
Dec 1 18:43 UTC 2005 |
Amazon.com and LiveJournal have massive capital investments in hardware
and engineers that Grex simply can not and will not ever provide.
Further, the financial impact of outages is different for Grex than it
is for those two.
|
mcnally
|
|
response 89 of 176:
|
Dec 1 18:59 UTC 2005 |
You're right that Grex doesn't have as much money to spend as Amazon
or LiveJournal, though I don't think that point escaped anyone even
before you explicitly stated it. A more salient point is that Grex
has enough money in the bank to afford a backup disk. We just didn't
plan to use it for that.
|
nharmon
|
|
response 90 of 176:
|
Dec 1 19:13 UTC 2005 |
...Or the money could be spent on a colo that would give us 24/7 access
to the machine, thus giving staff a larger window to recover from outages.
You see, I think this is the sort of direction that some have been
saying Grex lacks. We're not sure what takes precedence.
Another suggestion:
Grex has security goals, why not have overall system goals? Maybe even a
mission statement? These goals could be put in order from most important
to least important...they could be things like: "Maintain a conference
system void of censorship", or "provide for limited dialup internet
access in the ann arbor area", or "provide for user data integrity
through fault-tolerant disk storage and regular backups".
Then, when it came to making decisions on expending resources, everyone
would be on the same page as to what problems took priority.
|