|
Grex > Helpers > #140: Grex System Problems - Spring 2005 | |
|
| Author |
Message |
| 25 new of 457 responses total. |
steve
|
|
response 238 of 457:
|
May 3 23:16 UTC 2005 |
sd0a went because of a general problem with it. I seriously doubt
that the runs of newuser caused this. Though /a was on sd0 and that
did have lots of i/o.
|
gull
|
|
response 239 of 457:
|
May 3 23:54 UTC 2005 |
Re resp:234: Okay, I'll make one up and email it to you.
Re resp:237 (1): I took a quick look at an OpenBSD 3.6 system at work.
It's used strictly as a firewall, no local logins except root for
maintenance, but that's not really relevant.
What I found is that pseudo-ttys appear to be world-readable until
they're used. For example, with root logged in on ttyp0:
crw--w---- 1 root tty 5, 0 May 3 19:44 /dev/ttyp0
crw-rw-rw- 1 root wheel 5, 1 Dec 18 13:53 /dev/ttyp1
crw-rw-rw- 1 root wheel 5, 2 Dec 18 13:53 /dev/ttyp2
(etc.)
Now, if I open another ssh connection, again as root:
crw--w---- 1 root tty 5, 0 May 3 19:46 /dev/ttyp0
crw--w---- 1 root tty 5, 1 May 3 19:46 /dev/ttyp1
crw-rw-rw- 1 root wheel 5, 2 Dec 18 13:53 /dev/ttyp2
(etc.)
|
cross
|
|
response 240 of 457:
|
May 4 02:02 UTC 2005 |
This response has been erased.
|
steve
|
|
response 241 of 457:
|
May 4 02:27 UTC 2005 |
Let's reverse the tty problem for a minute--if it wasn't us, then
it was in the release of OpenBSD 3.5. I've been looking for comments
about that and haven't seen any so far. It could be the case that we
didn't do anything, but I tend to think that the collective set of
people who worked on the machine could have done something. I agree
with you that we should dig into things.
We crashed at least twice with a trace leading back to a bge symbol.
I think thats fairly good evidence that it was in the nic.
I'm obviously pro OpenBSD. I came to be that way after staring at
several Linux flavors, then Net- and FreeBSD, then OpenBSD. Since Oct
1999 I've been using it exclusively and have found it rock stable
except for when hardware problems have messed things up. I know of
no other system that puts security and takes the pro-active stance of
fixing things and developing enhancements like the write xor execute
system. Grex needs these things. We get hit on by enough people
that we need all the help we can get.
It occurs to me that we ought to order a 3.7 CD set.
|
nharmon
|
|
response 242 of 457:
|
May 4 02:38 UTC 2005 |
> Why don't we take some of the money we have in the bank and buy a
> SCSI hardware RAID controller, and do disks properly, with 0+1
> striping of mirrors, so that in the event one disk dies, we don't
> end up in these situations?
I agree that a RAID set up would provide system continuity until a staff
member can replace a drive. Personally, I prefere a RAID 5 set up with a hot
spare (or RAID 5EE if we don't have a spare drive to spare...but I don't know
if this is an IBM-only thing or what, so it might not be possible). Further,
RAID 5 wouldn't batter the drives as much as RAID 0+1...
BTW, I might be mistaken, but isn't RAID 1+0 more reliable just by the fact
that a multiple-disk failure resulting in catastrophic data loss is
statistically more likely with 0+1?
|
gull
|
|
response 243 of 457:
|
May 4 04:14 UTC 2005 |
Dan, to be honest, I was with you until you started insisting that
OpenBSD had caused a good disk to generate read errors. To me, that
made it seem like you were really reaching for more reasons to dislike
OpenBSD, and I'm having a tough time believing you're really taking an
objective position, now.
|
cross
|
|
response 244 of 457:
|
May 4 11:28 UTC 2005 |
This response has been erased.
|
nharmon
|
|
response 245 of 457:
|
May 4 12:15 UTC 2005 |
Is the disk bad? Have we plugged it into another computer and verified it has
problems?
|
aruba
|
|
response 246 of 457:
|
May 4 14:31 UTC 2005 |
No, the disk is still attached to Grex.
|
steve
|
|
response 247 of 457:
|
May 4 15:03 UTC 2005 |
I should point out that in puting the sd0 disk in some other machine,
it might work. It might appear OK for an hour, or a week. Moving a
damaged disk jossles things. I had a small ide disk at work which did
exaqctly this. It was flaky in the machine it was running on, but
ran OK for some while on a test machine I had. Finally, after several
days of pounding on it, the exact same error cropped up. This is rare,
but if the problem involves something in the head or arm mechanics,
anything can happen. I do not believe that will happen in this case
but moving a suspect disk around can lead to unexpected results.
|
cross
|
|
response 248 of 457:
|
May 4 15:32 UTC 2005 |
This response has been erased.
|
tod
|
|
response 249 of 457:
|
May 4 16:02 UTC 2005 |
Let us know how the 3.7 disc works out.
|
twenex
|
|
response 250 of 457:
|
May 4 16:59 UTC 2005 |
If Plan9 has "dd", why not "fsck"? After all, "dd" isn't even (originally)
native to Unix.
|
cross
|
|
response 251 of 457:
|
May 4 17:18 UTC 2005 |
This response has been erased.
|
twenex
|
|
response 252 of 457:
|
May 4 17:23 UTC 2005 |
Yeeees, but you could still call it "fsck"....
|
mcnally
|
|
response 253 of 457:
|
May 4 17:52 UTC 2005 |
They could also call it "scandisk". After all, lots more people are used
to scandisk than fsck, right?
What does it matter to you what they called it?
|
twenex
|
|
response 254 of 457:
|
May 4 18:01 UTC 2005 |
Just seems arbitrary to name Plan9 "dd" after Unix "dd" but not do the same
with fdisk, that's all.
|
gull
|
|
response 255 of 457:
|
May 4 18:06 UTC 2005 |
A lot of such decisions are arbitrary. Heck, on Linux, 'fsck' is really
just a front end that calls any of a number of more specific
filesystem-checking tools, depending on the type of filesystem in question.
|
drew
|
|
response 256 of 457:
|
May 4 21:01 UTC 2005 |
FWIW, I've had a disk *image file* (created with 'dd if=/dev/hdc of=filename')
produce read errors when used in the virtual machine it was attached to.
|
keesan
|
|
response 257 of 457:
|
May 4 21:14 UTC 2005 |
Three times now, with two different modems, we have dialed into grex and got
garbage. The second dial logged us in. Another grexer reports that the modem
on 484-0513 works but the first one does not, from his location. Is there
any other reliable modem that could be switched with the 0512?
|
steve
|
|
response 258 of 457:
|
May 4 23:03 UTC 2005 |
I think first we need to verify that the line and connection is OK,
physically. Sindi, do you know when these problems started? That
would be good to know.
|
cross
|
|
response 259 of 457:
|
May 5 00:53 UTC 2005 |
This response has been erased.
|
keesan
|
|
response 260 of 457:
|
May 5 01:02 UTC 2005 |
The garbage on dialin happened this week, probably in the last three days.
Jim mentioned it to me yesterday but I had already noticed. It might just have
started yesterday. It occurred again this afternoon.
Jim tried switching from 38 to 19K which did not help.
|
steve
|
|
response 261 of 457:
|
May 5 01:30 UTC 2005 |
Is it always the same modem that messes up?
|
albaugh
|
|
response 262 of 457:
|
May 5 15:22 UTC 2005 |
Drift: Does anyone else think that the fsck program name was partially chosen
because it looks like a get-past-the-censors-disguise for the f-word? ;-)
|