|
Grex > Helpers > #142: Grex System Problems - Summer 2005 | |
|
| Author |
Message |
| 25 new of 281 responses total. |
naftee
|
|
response 240 of 281:
|
Aug 25 18:19 UTC 2005 |
/etc/.
|
albaugh
|
|
response 241 of 281:
|
Aug 25 20:14 UTC 2005 |
User "XXX" on grex had the mail program (sending of mail from the command
prompt) hang/crash. Ever since then, on a daily basis or more frequently,
grex keeps sending the user the following notice, even though the recovery
file was deleted. What is doing this, and how can it be turned off.
(e-mailings of this to staff have not met with a response)
From root@cyberspace.org Thu Aug 25 15:45:55 2005
Envelope-to: XXX@cyberspace.org
Delivery-date: Thu, 25 Aug 2005 15:45:55 -0400
X-vi-recover-file: /tmp/mail.RenjUxL20914
X-vi-recover-path: /var/tmp/vi.recover/vi.UHFdq21332
Reply-To: root@cyberspace.org
From: root@cyberspace.org (Nvi recovery program)
To: XXX@cyberspace.org
Subject: Nvi saved the file mail.RenjUxL20914
Precedence: bulk
Date: Thu, 25 Aug 2005 15:45:55 -0400
On Wed Aug 17 14:08:22 2005, the user XXX was editing a file
named /tmp/mail.RenjUxL20914 on the machine
grex.cyberspace.org, when it was saved for recovery. You can
recover most, if not all, of the changes to this file using
the -r option to vi:
vi -r /tmp/mail.RenjUxL20914
|
drew
|
|
response 242 of 281:
|
Aug 26 21:37 UTC 2005 |
When I'm dialed in direct, it is *still* impossible for me to make any
responses or new items; I get some sort of core dump error from the editor.
Is there a more "automatic" way to get a response entered from a pre-written
file? Something like "respond < filename"?
|
drew
|
|
response 243 of 281:
|
Aug 27 05:19 UTC 2005 |
It appears to be something about the gate editor. I am entering this
direct-dialed with vi.
|
albaugh
|
|
response 244 of 281:
|
Aug 29 22:08 UTC 2005 |
While bbs'ing:
/log: write failed, file system is full
|
albaugh
|
|
response 245 of 281:
|
Aug 30 19:11 UTC 2005 |
Re: resp:241 finally the nags are no longer being sent.
|
rksjr
|
|
response 246 of 281:
|
Aug 31 05:46 UTC 2005 |
Re. #183. Regarding crashing, when I was disconnected shortly after
logging on (a little after 8 p.m.), I became curious as to how frequent
the crashes have been recently, which motivated my composing a log of
recent reboots.
Last login data is included to estimate system down time.
For the privacy of users who happened to be the last users logged-in
immediately prior to a reboot, I have replaced their userids with
"userid" in the data below.
reboot ~ Tue Aug 30 20:28 [8:28pm]
userid ttyq1 157.95.31.174 Tue Aug 30 20:13 - crash
(00:15) [approx. time following last login: 28 - 13 = 15 min.]
.... .... .... .... ....
reboot ~ Tue Aug 30 10:37
userid ttyq3 80.51.51.23 Tue Aug 30 10:18 - 10:21
(00:02) [approx. time following last login: 37 - 18 = 19 min.]
.... .... ....
reboot ~ Tue Aug 30 08:47
userid ttyq3 217.21.35.33 Tue Aug 30 06:50 - crash
(01:56)
[approx. time following last login: 8:47 - 6:50 = 1 hr. 57 min.]
.... .... ....
reboot ~ Mon Aug 29 10:02
userid ttyp2 dialup-4.159.214.153.dial1.chicago1.level3.net
Mon Aug 29 09:45 - crash
(00:16) [approx. time following last login: 10:02 - 9:45 = 17 min.]
.... .... .... .... .... .... ....
reboot ~ Mon Aug 29 00:43
userid ttyqe helix.kaist.ac.kr Mon Aug 29 00:27 - crash
(00:16) [approx. time following last login: 43 - 27 = 16 min.]
.... .... .... .... ....
reboot ~ Sun Aug 28 07:32
userid ttyp7 ACD6D4DE.ipt.aol.com Sun Aug 28 07:16 - 07:16
(00:00) [approx. time following last login: 32 - 16 = 16 min.]
.... .... ....
reboot ~ Sat Aug 27 10:34
userid ttypb ip68-13-188-36.om.om.cox.net
Sat Aug 27 02:09 - 02:11
(00:01)
[approx. time following last login: 10:34 - 02:09 = 8 hrs. 25 min.]
.... .... .... .... .... .... .... .... ....
reboot ~ Thu Aug 25 15:46 (3:46pm)
userid ttyp3 netsun.cl.msu.edu Thu Aug 25 15:30 - crash
(00:15) [approx. time following last login: 46 - 30 = 16 min.]
Mode of accessing the above data:
Step 1: Access the shell prompt.
Step 2: Type: "last [pipe symbol] more" (without the quotation marks and
without the square brackets). (The pipe is the uppercase symbol sharing
the same key with the backslash "\". Sometimes typing a pipe into an
editor screen will generate unpredictable results, but the field following
the shell prompt should accept it.)
Step 3: Type: "/" (without the quotation marks).
Step 4: Type "reboot" (without the quotation marks).
Step 5: To view prior reboots: repeat steps 3 and 4 for each prior reboot.
(You may be able to depress the up arrow key in lieu of retyping
"reboot".)
|
remmers
|
|
response 247 of 281:
|
Aug 31 11:37 UTC 2005 |
Re Step 5: Typing "n" instead of "/reboot" will also skip to the next
reboot entry.
Also, if you just want to see a list of recent reboots with other login
information filtered out, you can use the Unix 'grep' utility. Type
this at the shell prompt:
last|grep '^reboot '|more
Using 'awk', you can get a list of reboots, with each reboot followed by
a list of who was logged in at the time of the immediately preceding crash:
last|awk '{if (/^reboot /) print $0; else if (/- crash/) print " "$1}'
|more
(Backtalk wrapped the preceding command; it should be typed all on one
line.)
These reboots are not planned. For a few days now, Grex has been
crashing a couple of times a day, resulting in downtime of 20 minutes or
so while it reboots itself. At this point, cause unknown. Usually the
reboot is successful; when it's not, somebody (usually somebody at our
colo, and on some occasions me) has to push the reset button manually.
I realize the sporadic outages are annoying. Hopefully we'll get the
problem resolved soon.
|
keesan
|
|
response 248 of 281:
|
Aug 31 14:28 UTC 2005 |
I was logged on twice this week when it happened, I think. Lucky me.
I have been emailing gelinas each time - is this appropriate? Should I email
colo instead? Or phone them?
|
remmers
|
|
response 249 of 281:
|
Aug 31 14:46 UTC 2005 |
As a practical matter, I'm online often enough that most of the time I
notice that Grex is down sooner than another staff member is likely to
notice or to check their email. So for this particular problem, I don't
think emailing someone speeds up the process of getting Grex back up
when it doesn't successfully reboot itself.
You shouldn't contact the colo directly. They are just hosting our
server and don't maintain it. They are willing do something simple,
like power-cycle it or hit the reset button, but for security reasons
only on the direct request of a Grex staff member who is known to them.
|
albaugh
|
|
response 250 of 281:
|
Aug 31 17:51 UTC 2005 |
Is it known yet whether the reboots are hardware-initiated,
software-initiated, or both?
|
remmers
|
|
response 251 of 281:
|
Sep 1 12:54 UTC 2005 |
Not known to me.
No reboots in two days. (cross fingers)
|
albaugh
|
|
response 252 of 281:
|
Sep 1 16:36 UTC 2005 |
As opposed to "finger cross". ;-)
|
cross
|
|
response 253 of 281:
|
Sep 1 20:32 UTC 2005 |
This response has been erased.
|
mcnally
|
|
response 254 of 281:
|
Sep 2 01:15 UTC 2005 |
Think pretty highly of yourself, don't you?
|
cross
|
|
response 255 of 281:
|
Sep 2 17:03 UTC 2005 |
This response has been erased.
|
tod
|
|
response 256 of 281:
|
Sep 2 17:04 UTC 2005 |
Would you settle for an MRE?
|
cross
|
|
response 257 of 281:
|
Sep 2 17:13 UTC 2005 |
This response has been erased.
|
happyboy
|
|
response 258 of 281:
|
Sep 2 17:16 UTC 2005 |
/send dan a big bucket of popeye's wings and a soady-pop
|
drew
|
|
response 259 of 281:
|
Sep 3 17:47 UTC 2005 |
Now it's refusing to let me enter stuff direct-dialed using vi. Just got a
"nasty error message" or something when I tried to enter a response.
|
richard
|
|
response 260 of 281:
|
Sep 13 18:54 UTC 2005 |
grex is back! thanks to staff for what sounds like a lot of work to
repair the labor day attack.
what exactly happened that caused this mess anyway?
|
aruba
|
|
response 261 of 281:
|
Sep 14 14:16 UTC 2005 |
Thanks to the staff member(s) who got Grex back up. Could we hear the
story?
|
eprom
|
|
response 262 of 281:
|
Sep 14 16:28 UTC 2005 |
The response time was outragious! We need some accountability here.
People need to be fired or demoted and a contigency plan should be
drafted up just incase this happens again!
|
remmers
|
|
response 263 of 281:
|
Sep 14 16:35 UTC 2005 |
The staff member who got Grex back up was me, aided by Jan Wolter's
life-saving mirroring software and some helpful advice in email from
Marcus Watts. I'm only sorry that I wasn't able to devote much
attention to it sooner, due to other commitments last week.
What happened: Some files in the /etc disk partition (in particular,
the password file) became corrupt, for reasons unknown to me but
probably due to a software glitch (don't know if it was OS software or
application software, either). I made a trip to our colo and was able
to run some tests and verify that the disks and filesystems were
healthy, but didn't have time to investigate further. On a subsequent
trip, I booted into single user mode and took some time to look around
the filesystem, eventually discovering that the password file (and
possibly others) had been corrupted.
Grex's important file systems (system directories, user directories,
bbs) are backed up to a spare hard drive every few hours, thanks to some
mirroring software that Jan Wolter wrote. Because of this, I was able
to restore "good" versions of the files in /etc from the state they were
in about 4 hours before the crash. Thankfully, that's all it took to
get Grex to boot successfully. The most that was lost was whatever new
accounts were created via newuser in that 4-hour period, I think.
Diagnosis of the cause of the problem will have to be left to someone
who knows more about OpenBSD than I do. Until the cause is addressed,
the problem may well recur. If it does, at least we know where to look
now, and Grex should be up a lot sooner. I'm sorry that it all took so
long this time.
|
edina
|
|
response 264 of 281:
|
Sep 14 16:43 UTC 2005 |
John, thank you for your assistance. It is appreciated.
|