|
Grex > Coop7 > #47: A question on bringing the system back up after it's been off the net | |
|
| Author |
Message |
steve
|
|
A question on bringing the system back up after it's been off the net
|
May 15 23:23 UTC 1995 |
Here's a question for everyone.
Does it make sense to keep telnet sessions (ie, humans) off
Grex for a little while after Grex has been down a while, in
order to more quickly get caught up on our mail?
A little background on this. When Grex is off the net for
whatever reason, other sites that give us mail try to connect to
us and fail. Thats fine; they keep the mail intended for Grex
in their queues, and once we're back up we get that mail when
the foriegn system tries contacting us again.
The problem is, when we're off the air for several hours,
A LOT of mail can sit there at 100+ different sites waiting to
get to Grex. I call this a sendmailfest, because even before
Grex allows humans to telnet in, Grex is accepting SMTP (ie, mail)
connections, and often by the time any human can log in to
Grex, more than a dozen sites are feeding us what they've been
holding.
Today I tried something different. As soon as the system was
up I disabled logins, and in theory any of you who tried getting
in from 6:16 to 7pm Monday night got a little message that was in a
file called /etc/nologin. This kept humans off Grex, and let
sendmail run wild. I've never seen anything like it before. At
once point after Grex had been up for about 10 minutes, there were
more than *70* sendmail connections running, quite merrily. It
seems that this little Sun-3/260 we're on is just about right for
a V.34 (28.8Kbps) modem connection to the net for mail processing!
As I watched, we were getting new peices of mail sometimes
once a second. As I said, I've never seen anything like this.
After a little while I started processing the mail in the queue as
well, so Grex was accepting mail and processing it, and doing fairly
well.
But, obviously, people couldn't get in during this period of time.
For this 6 hour period of downtime, about 40 minutes was needed
to mostly catch up on things. Do people think its worth it, keeping
Grex off the net for a while such that it can process mail after a
crash? There are times when we (staff) probably can't do this, but
overall, what makes more sense--
a) bring Grex back up to the world as quickly as possible and let
things be slower;
b) keep Grex unavailable for a while so the mailstorm that occurs
after reboots is less intense when people are on?
.'
|
| 39 responses total. |
robh
|
|
response 1 of 39:
|
May 15 23:35 UTC 1995 |
Well, firstly, when I tried telnetting in via M-Link, I got
the "breakfast food of champions" message, it asked for my login
and password, I gave them, and it promptly disconnected.
Might want to check that script of yours, steve. >8)
Secondly, painful though it is to admit it, I like the idea
of keeping Grex "down" a while longer to take care of the
mail flow. As I type this, the system is ludicrously
slow from all the people trying to connect at once, I hate to
think how much worse it would be if mail was trying to
get through too.
Thirdly, how would having a dedicated mail machine change this?
Would it be able to receive mail while Grex was down?
|
jep
|
|
response 2 of 39:
|
May 16 03:39 UTC 1995 |
I think keeping the connection free for e-mail for a little while is
an excellent idea. I may snag this one for M-Net. (-:
|
nephi
|
|
response 3 of 39:
|
May 16 06:39 UTC 1995 |
Forty-five minutes? That's a long time.
|
tsty
|
|
response 4 of 39:
|
May 16 10:04 UTC 1995 |
That's a lot of mail.
I like the idea, STeve, fwiw .. and i hope a lot.
Perhaps a "standard" amount of time could be dialed into a bootup,
like, say 20 minutes, and which, also, could be killed by the
booter if it doesn't apply to a specific boot.
And as robh noted, this is a painful conclusion, but i think it's
right.
|
steve
|
|
response 5 of 39:
|
May 16 12:30 UTC 1995 |
I probably could have cut the 45 minutes down some; perhaps to
30. We need to look around more.
The important question is, do we want to do this for periods
of down time? About a month ago we were off the net for 28 hours
becuase of problems getting into IC-Net. When we got back up, it
was pretty bad for about 6 hours, with all the mail coming in.
Thats the time that it would be really helpful to do this.
|
selena
|
|
response 6 of 39:
|
May 16 12:31 UTC 1995 |
I'm iffy.. not opposed, just iffy. I mean, I'm used to grex being
slow.. but mail isn't even the reason I come here- I come here for the
people! If it's slow, well, what's the diff between that, and some nights,
with 24+ loads?
If you guys think it'd do good, go for it. I'd just as soon be
online, though.
|
selena
|
|
response 7 of 39:
|
May 16 12:37 UTC 1995 |
SIX hours? Is that how long logins woud be cut?
|
popcorn
|
|
response 8 of 39:
|
May 16 14:10 UTC 1995 |
The idea is that logins were still available, and that's why it took
6 hours for all the e-mail to come in. If logins had been turned off,
the mail would have arrived a lot faster (maybe in an hour?), and then
Grex would have been back at its normal speed for the rest of the 6
hours. Even if you don't use Grex for mail, other people's mail slows
Grex down for you.
|
steve
|
|
response 9 of 39:
|
May 16 17:19 UTC 1995 |
Email is the single most resource intensive thing on Grex. I
think it will still be ahead of news, once we have that working
again. The amount of mail we handle, and the amount of IP traffic
it represents is just incredible.
|
adbarr
|
|
response 10 of 39:
|
May 16 22:54 UTC 1995 |
srw - when you get here - sounds remarkably relevant to the
issues we were talking about today. Hmm!
|
selena
|
|
response 11 of 39:
|
May 17 00:52 UTC 1995 |
Right, popcorn.. I know that. That's why I said that I wasn't
opposed, just iffy.. I did say go for it, if you guys felt it would do
good, didn't I?
<Selena tries to be nice, and still getstaken adversely>
*sigh*
|
davel
|
|
response 12 of 39:
|
May 17 01:48 UTC 1995 |
Eh? I don't think so, Selena. They just thought (from your response)
that maybe the issues hadn't been clearly enough stated. STeve, in
posting the thing originally, clearly wasn't offering this as something
we should do, but as something we should consider.
STeve, how does this impact dialins? Just curious.
|
zook
|
|
response 13 of 39:
|
May 17 01:59 UTC 1995 |
I think it's a good idea. If the system is already crashed, what's the big
deal waiting a few more minutes? Especially for long down-times. If it took
45 minutes for 6 hours, how long would it take for a couple days? And if
we didn't set aside some reboot time for mail, how many DAYS would Grex run
slowly to catch up on the mail for a couple-day downtime? (I assume this
delay could be heuristically estimated based on down-time).
Just my $0.02.
|
steve
|
|
response 14 of 39:
|
May 17 03:53 UTC 1995 |
I was thinking of disallowing all dialins, too. That way all of
Grex would be devoted to mail processing.
|
selena
|
|
response 15 of 39:
|
May 17 04:22 UTC 1995 |
Well, that'd be fair, too..
|
steve
|
|
response 16 of 39:
|
May 17 13:10 UTC 1995 |
Yes. The idea was to keep all humans except root off the system
while playing catch up.
|
ajax
|
|
response 17 of 39:
|
May 17 15:32 UTC 1995 |
Is it worth having dialins attach, display the nologin message file
(I assume it would say something like "grex'll be back up at 6pm"),
then hang up? Or would that irritate people who pay by the call?
|
helmke
|
|
response 18 of 39:
|
May 17 16:10 UTC 1995 |
It *would* be nice to know when Grex is going to be back up if you are
trying to decide to try in a little while or go somewhere.
|
steve
|
|
response 19 of 39:
|
May 17 20:32 UTC 1995 |
Something that I've wanted to do for a while now, but have
never thought quite important enough to do, would be to have
a little PC sitting there that is just smart enough to answer
the phone, spew out a message and hang the phone up. For
disasters, that would be wonderful.
|
srw
|
|
response 20 of 39:
|
May 18 10:44 UTC 1995 |
It would be nice to give out more info when Grex is down,
but #19 is tangential to the issue here. The time of Grex's return
to normalby could be made available during the sendmailfest.
Based on needing 30-40 minutes to recover from a 6 hour doentime,
would you say, STeve, that you would anticipate a sendmailfest duration
of about 10% of the downtime? If so, we would have been in that state for
2.8 hours after the 28 hour downtime. The right way to figure this out is
to know the total flow per unit time of all mail to Grex, and divide
it by the effective link speed.
I am in favor of the sendmailfest when we've been down for 6 hours or more.
It compensates in part for the fact that Grex routinely shuts mail processing
down because of high load averages.
|
steve
|
|
response 21 of 39:
|
May 18 21:27 UTC 1995 |
I think that a longer period of time, like a day might be worse,
since mail doesn't come in, in an exactly linear fashion. 10% might
be a good starting point for estimating downtime playing catch up.
|
nephi
|
|
response 22 of 39:
|
May 30 08:51 UTC 1995 |
Hmm. So, is this going to be implemented?
|
steve
|
|
response 23 of 39:
|
May 30 13:30 UTC 1995 |
We haven't had a long period of downtime since then. Unless
we automate this, it requires staff intervention. So the next
time we get over a period of downtime, someone might not be able
to monitor the situation. It can be done remotely, but it's really
nice to be able to sit there and watch the link's modem's lights
blink--a graphical way to see the incoming traffic. ;-)
|
peacefrg
|
|
response 24 of 39:
|
May 31 00:02 UTC 1995 |
I would rather like to connect to grex at a slower rate than not at all.
My vote and 2 cents
|