|
Grex > Helpers > #140: Grex System Problems - Spring 2005 | |
|
| Author |
Message |
| 25 new of 457 responses total. |
mcnally
|
|
response 237 of 457:
|
May 3 22:57 UTC 2005 |
> I really want to know why people take one hypothesis I propose
> (which I clearly stated was a hypothesis) and fixating on that,
My own theory is that it's due to the somewhat confrontational tone
of your messages. Even though I agree with most of your conclusions
I'll admit I'm somewhat put off by the way you've worded your responses.
Presumably it's because you feel strongly about the issue, which I
applaud.
But if Grex wants to know why volunteer staff resources are drying up,
we need look no further than the way this and other "discussions" about
the system develop. Very few people will volunteer to join a flame war
already in progress. Of course it takes more than one party to have a
good fight but whoever's responsible for the tone maybe we can all just
back off a little bit and instead of concentrating on what possibly
*should* have been done, figure out what to do now.
> Okay, so here are my major points:
>
> (1) We've had one *major* security hole in OpenBSD.
Agreed, and frankly it's a baffling one for a supposedly
secure OS. Granted the security promise offered by OpenBSD
partisans is usually "Only <n> root exploits since <time t>"
but world-readable ttys is bizarre.
> I don't think it was a configuration issue.
I'm not so sure about this -- it seems incredible to assume
that if this were really the system default it wouldn't be
very widely known, and OpenBSD rightly slammed for it.
I think when I get home tonight I'm going to have to install
OpenBSD on a spare computer and test to see whether this is,
in fact, the way things work out of the box on OpenBSD.
> (2) OpenBSD crashes quite a bit more than I or anyone else am
> comfortable with. It doesn't appear to be because of the network
> driver. It often crashes when filesystem errors, or, apparantly,
> because the proc table gets full.
Because of the time investment required to change OSes again
and the fact that we don't know for sure that FreeBSD will be
better, I'm inclined to give OpenBSD more time to prove itself
providing:
(a) we can guarantee a fix to the fork-bomb vulnerability, and
(b) we replace put another disk in place of the one that's erroring.
Of course I'm not sure I even get a vote, but that would be my
recommendation if my advice were solicited.
> (3) Our *application* is not disk intensive, but because things
> like soft metadata updates aren't reliable on OpenBSD, we're
> *making it* disk intensive. If that's chewing up drives, then
> fine, but it's not grex has such a high volume of *usage* that
> it *has* to be that way. *REAL* high volume usage
>
> Steve, you were one of the people who were adament about OpenBSD.
> Please respond to these problems. Should I believe that they'll
> be solved in the newest version of OpenBSD? Or did we make a mistake
> going with OpenBSD?
Aren't there disk-usage monitoring tools we can use to get some
sense of what's going on with our disks? And would it help
relieve thrashing if we made /tmp a 512MB RAM disk or picked
something like that?
Something's already seriously wrong if / (containing /etc) is
being written to so often that disk corruption is regularly
bringing the system down. How many things need to write to /
anyway? Newuser? What else? Virtually everything else on the
system seems like it should write to /var, /tmp, or a bbs or
homedir partition.
|
steve
|
|
response 238 of 457:
|
May 3 23:16 UTC 2005 |
sd0a went because of a general problem with it. I seriously doubt
that the runs of newuser caused this. Though /a was on sd0 and that
did have lots of i/o.
|
gull
|
|
response 239 of 457:
|
May 3 23:54 UTC 2005 |
Re resp:234: Okay, I'll make one up and email it to you.
Re resp:237 (1): I took a quick look at an OpenBSD 3.6 system at work.
It's used strictly as a firewall, no local logins except root for
maintenance, but that's not really relevant.
What I found is that pseudo-ttys appear to be world-readable until
they're used. For example, with root logged in on ttyp0:
crw--w---- 1 root tty 5, 0 May 3 19:44 /dev/ttyp0
crw-rw-rw- 1 root wheel 5, 1 Dec 18 13:53 /dev/ttyp1
crw-rw-rw- 1 root wheel 5, 2 Dec 18 13:53 /dev/ttyp2
(etc.)
Now, if I open another ssh connection, again as root:
crw--w---- 1 root tty 5, 0 May 3 19:46 /dev/ttyp0
crw--w---- 1 root tty 5, 1 May 3 19:46 /dev/ttyp1
crw-rw-rw- 1 root wheel 5, 2 Dec 18 13:53 /dev/ttyp2
(etc.)
|
cross
|
|
response 240 of 457:
|
May 4 02:02 UTC 2005 |
This response has been erased.
|
steve
|
|
response 241 of 457:
|
May 4 02:27 UTC 2005 |
Let's reverse the tty problem for a minute--if it wasn't us, then
it was in the release of OpenBSD 3.5. I've been looking for comments
about that and haven't seen any so far. It could be the case that we
didn't do anything, but I tend to think that the collective set of
people who worked on the machine could have done something. I agree
with you that we should dig into things.
We crashed at least twice with a trace leading back to a bge symbol.
I think thats fairly good evidence that it was in the nic.
I'm obviously pro OpenBSD. I came to be that way after staring at
several Linux flavors, then Net- and FreeBSD, then OpenBSD. Since Oct
1999 I've been using it exclusively and have found it rock stable
except for when hardware problems have messed things up. I know of
no other system that puts security and takes the pro-active stance of
fixing things and developing enhancements like the write xor execute
system. Grex needs these things. We get hit on by enough people
that we need all the help we can get.
It occurs to me that we ought to order a 3.7 CD set.
|
nharmon
|
|
response 242 of 457:
|
May 4 02:38 UTC 2005 |
> Why don't we take some of the money we have in the bank and buy a
> SCSI hardware RAID controller, and do disks properly, with 0+1
> striping of mirrors, so that in the event one disk dies, we don't
> end up in these situations?
I agree that a RAID set up would provide system continuity until a staff
member can replace a drive. Personally, I prefere a RAID 5 set up with a hot
spare (or RAID 5EE if we don't have a spare drive to spare...but I don't know
if this is an IBM-only thing or what, so it might not be possible). Further,
RAID 5 wouldn't batter the drives as much as RAID 0+1...
BTW, I might be mistaken, but isn't RAID 1+0 more reliable just by the fact
that a multiple-disk failure resulting in catastrophic data loss is
statistically more likely with 0+1?
|
gull
|
|
response 243 of 457:
|
May 4 04:14 UTC 2005 |
Dan, to be honest, I was with you until you started insisting that
OpenBSD had caused a good disk to generate read errors. To me, that
made it seem like you were really reaching for more reasons to dislike
OpenBSD, and I'm having a tough time believing you're really taking an
objective position, now.
|
cross
|
|
response 244 of 457:
|
May 4 11:28 UTC 2005 |
This response has been erased.
|
nharmon
|
|
response 245 of 457:
|
May 4 12:15 UTC 2005 |
Is the disk bad? Have we plugged it into another computer and verified it has
problems?
|
aruba
|
|
response 246 of 457:
|
May 4 14:31 UTC 2005 |
No, the disk is still attached to Grex.
|
steve
|
|
response 247 of 457:
|
May 4 15:03 UTC 2005 |
I should point out that in puting the sd0 disk in some other machine,
it might work. It might appear OK for an hour, or a week. Moving a
damaged disk jossles things. I had a small ide disk at work which did
exaqctly this. It was flaky in the machine it was running on, but
ran OK for some while on a test machine I had. Finally, after several
days of pounding on it, the exact same error cropped up. This is rare,
but if the problem involves something in the head or arm mechanics,
anything can happen. I do not believe that will happen in this case
but moving a suspect disk around can lead to unexpected results.
|
cross
|
|
response 248 of 457:
|
May 4 15:32 UTC 2005 |
This response has been erased.
|
tod
|
|
response 249 of 457:
|
May 4 16:02 UTC 2005 |
Let us know how the 3.7 disc works out.
|
twenex
|
|
response 250 of 457:
|
May 4 16:59 UTC 2005 |
If Plan9 has "dd", why not "fsck"? After all, "dd" isn't even (originally)
native to Unix.
|
cross
|
|
response 251 of 457:
|
May 4 17:18 UTC 2005 |
This response has been erased.
|
twenex
|
|
response 252 of 457:
|
May 4 17:23 UTC 2005 |
Yeeees, but you could still call it "fsck"....
|
mcnally
|
|
response 253 of 457:
|
May 4 17:52 UTC 2005 |
They could also call it "scandisk". After all, lots more people are used
to scandisk than fsck, right?
What does it matter to you what they called it?
|
twenex
|
|
response 254 of 457:
|
May 4 18:01 UTC 2005 |
Just seems arbitrary to name Plan9 "dd" after Unix "dd" but not do the same
with fdisk, that's all.
|
gull
|
|
response 255 of 457:
|
May 4 18:06 UTC 2005 |
A lot of such decisions are arbitrary. Heck, on Linux, 'fsck' is really
just a front end that calls any of a number of more specific
filesystem-checking tools, depending on the type of filesystem in question.
|
drew
|
|
response 256 of 457:
|
May 4 21:01 UTC 2005 |
FWIW, I've had a disk *image file* (created with 'dd if=/dev/hdc of=filename')
produce read errors when used in the virtual machine it was attached to.
|
keesan
|
|
response 257 of 457:
|
May 4 21:14 UTC 2005 |
Three times now, with two different modems, we have dialed into grex and got
garbage. The second dial logged us in. Another grexer reports that the modem
on 484-0513 works but the first one does not, from his location. Is there
any other reliable modem that could be switched with the 0512?
|
steve
|
|
response 258 of 457:
|
May 4 23:03 UTC 2005 |
I think first we need to verify that the line and connection is OK,
physically. Sindi, do you know when these problems started? That
would be good to know.
|
cross
|
|
response 259 of 457:
|
May 5 00:53 UTC 2005 |
This response has been erased.
|
keesan
|
|
response 260 of 457:
|
May 5 01:02 UTC 2005 |
The garbage on dialin happened this week, probably in the last three days.
Jim mentioned it to me yesterday but I had already noticed. It might just have
started yesterday. It occurred again this afternoon.
Jim tried switching from 38 to 19K which did not help.
|
steve
|
|
response 261 of 457:
|
May 5 01:30 UTC 2005 |
Is it always the same modem that messes up?
|