|
Grex > Coop7 > #69: Minutes of the June 28, 1995 Board Meeting | |
|
| Author |
Message |
| 19 new of 43 responses total. |
popcorn
|
|
response 25 of 43:
|
Jul 7 15:05 UTC 1995 |
TS: You asked for a posting of the discussion of the disk problem, from
the staff conference. Actually, most discussions of the disk problem
have been verbal, at staff meetings and on various phone calls between
staffers, so there *isn't* any text that someone could easily just re-post
here.
I think you have a highly inflated idea of the staff conference.
|
mju
|
|
response 26 of 43:
|
Jul 7 16:58 UTC 1995 |
I believe the problem is due to a bug in the Sun-3 SCSI controller
driver, not a problem with the filesystem. Grex is currently using
two SCSI controllers: a Sun-2 SCSI board (aka Sun SCSI-2), which runs
the two 330MB disks; and a Sun-3 SCSI board (aka Sun SCSI-3), which
runs the 2.1GB disk. The 2.1GB disk can't be connected to the SCSI-2
controller because the SCSI-2 controller won't handle newer SCSI
disks. And Grex won't boot from the SCSI-3 controller, most likely
due to us having an old version of the Sun-3 boot PROM on our CPU
board. So we can't connect the 330MB disks to the SCSI-3 controller
and ignore the SCSI-2 board.
Believe it or not, we *have* spent a lot of time thinking about this
problem. We did try connecting all the disks to the SCSI-3 board
and booting from there; that's how we found out that we couldn't do
it. We did try connecting the 2.1GB disk to the SCSI-2 board;
that's how we found out that we couldn't do it. (Not that we didn't
suspect that anyway, since several staff members have other
experience with putting newer SCSI drives on older Sun systems.)
This leaves us with two options: fix the bug in the SunOS SCSI subsystem
that is causing the data corruption problem; or move to a platform
where the bug has been fixed for us. I don't know about you, but
I find it fairly difficult to debug a problem that is not reliably
reproducible and occurs intermittently, especially when I don't
have the source to the code in question. Who knows, maybe the bug
is timing-related, so it would go away if you added debugging to the
kernel or tried to step through the code. Or maybe it only happens
on every 2^20-1'th disk write. This kind of uncertainty is exactly
why we'd much rather fix the bug by moving to a platform where it's
known to not exist, especially when we were planning to move to that
platform anyway.
|
tsty
|
|
response 27 of 43:
|
Jul 7 18:48 UTC 1995 |
re #24,5,6 are excellant compilations ... and condensations ... for
the value to all of us. If I may speak for a "bunch of us," "we" *do*
thank you for the mental and physical investment devoted to solving
a most vexing problem. << I *HATE* intermittents, as does any other
systems Trouble Shooter. >>
Fwiw, one of my first thoughts about the problem, way back whe and
stated at that time, was re-enforced by mju's reference to a
"timing-related" problem. Certainly i'm not the first or only pern
to consider that. I also am fully aware of the drudgery of Trouble
Shooting along those lines. <<Oh, boy, am I!>>
And yet ... working with mju, (and/or others) I +would+ like to
work on that tact.
|
rcurl
|
|
response 28 of 43:
|
Jul 7 19:38 UTC 1995 |
I had an intermittant in a CGA monitor - it would blank out now and
then. Poked around a little, with no good result. I solved the problem
by donating the monitor for the Grex JCC sale.
|
adbarr
|
|
response 29 of 43:
|
Jul 8 00:50 UTC 1995 |
Rane, Stop that. These people were just getting started.
|
ajax
|
|
response 30 of 43:
|
Jul 10 15:25 UTC 1995 |
Re 26, the Sun-3 SCSI can't boot because of "an old version of
the Sun-3 boot PROM"; do you mean it's comparatively older than a
newer version from which it could boot, or that it's just old in
absolute terms (and nothing newer is available)?
|
mju
|
|
response 31 of 43:
|
Jul 10 16:30 UTC 1995 |
I'm not sure. At one point STeve Andre' tried to get a newer version
of the PROM that we thought might fix the problem, but as far as I
know we never got that newer version to work. Whether that's becuase
we didn't get a "good" copy of the PROM, or because that PROM just
won't work in our 3/260, or some other reason, I don't know.
STeve or Greg Cronau might be able to say more...
|
tsty
|
|
response 32 of 43:
|
Jul 15 18:14 UTC 1995 |
mju, wanna tackle this pursuant to your #26? I tend to suspect the
timing problem myself.
|
tsty
|
|
response 33 of 43:
|
Aug 14 19:37 UTC 1995 |
I overheard that gregc has another idea - mju, wanna work with him on it?
|
tsty
|
|
response 34 of 43:
|
Aug 28 03:18 UTC 1995 |
OK - gaging from the volume of silence, is it now "system policy"
to let the disk-bug eat at its leisure while we all await the
implementation of the Sun 4 ?
And how about all those zero'd out files that +used+ to contain stuff
and now only have a fylename with 0 bytes?
|
steve
|
|
response 35 of 43:
|
Aug 28 15:33 UTC 1995 |
We've pretty much made the decision to work on the Sun-4 instead
of dealing with the current disk problem, even though it means short
term pain for users who get hit, and espically Valerie who has
pretty much taken the job of informing people about the problem
and helping them out (thanks Valerie!!).
Why? Becuase this way, getting onto the Sun-4, we're on a faster
platform, on more a more recent version of SunOS, which fixes numerous
little things, and is generally better. SunOS 4.1.3 is one of the
better operating systems around, simply becuase it's problems are so
well understood, and a large numberof patches exist for it, thus
making it one of the more secure systems out (when the patches are
applied, of course).
We could still do battle with the dragon, and ultimately concure
the disk problem I'm sure, but at the risk of getting onto a much
better platform. Getting the Sun-4 up solves both the disk instability
problem *and* makes Grex faster.
Given that we have limited resources this is the best way to solve
things. It's also easier. Figuring out the problem in 4.1.1 would
take a lot of effort quite possibly, even probably. If the problem
turned out to be purely in the software we'd be in the position of
having to make patches to the kernel without source, which is something
not for the faint-of-heart. It also makes for one hell of an odd
ball system, in terms of getting other staff to take over the system,
should / when that happens. Going to 4.1.3 is something that a LOT
of people understand. While not perfect, it's one neat operating system,
and a lot of people have stuck to it.
Speaking of people staying with SunOS 4.1.3, I'll mention here
that Sun recently came out with a new version of SunOS, 4.1.4, which
is something that they said they'd never do. When Solaris was released,
Sun said that SunOS was dead. After starting at the first release
of Solaris, a lot of people said No Way. After the fourth release of
Solaris (2.4) a lot of people have still said "no way". So Sun,
seeing that there really was a "SunOS 4.1.3 or die" camp, came out
with a new version of it, which in theory had all the patches applied
in the right order, and would be in theory of of the most stable
operating systems out. No "new" improvements here, just bug fixes.
As it turns out, there are some problems with 4.1.4, and I have
heard rumor of feirce agrument in Sun that says there will be
another release of 4.1.4 called "u2", which means "update two",
and maybe will fix the problems with 4.1.4. Time will tell.
But in the mean time, 4.1.3 will work very well for Grex.
|
dpc
|
|
response 36 of 43:
|
Aug 28 20:54 UTC 1995 |
Is there a firm date by which the Sun-4 will be up with 4.1.3?
|
mdw
|
|
response 37 of 43:
|
Aug 28 23:13 UTC 1995 |
We had hoped to have it up by this past friday; that's when mju had to
go back to Pittsburgh. Unfortunately, we didn't quite make it. It's
certain to happen *sometime* this fall. It's not a project we want to
rush -- one of the reasons we have the disk instability problem is we
rushed installing a large drive for /home; if we had had a good chance
to test/debug that, we might have been able to avoid today's problem
entirely, and we don't want to run a similar risk with the sun-4. There
are really two things we want to do with the sun-4; one of them is to
build all the software we see on grex today, and the other is to give
the hardware enough run time that we're confident it's stable. That is
certainly not less than one month away, it may be as much as 3.
|
steve
|
|
response 38 of 43:
|
Aug 29 01:25 UTC 1995 |
The Sun-4 is nice and happy with the 1.7G disk it has right
now; I don't believe the 540M disk that we had to buy is installed
yet, but I could be wrong. Greg will pipe up on that one.
The good news is the system is stable, has been up for a while
now, and no obvious hardware problems. If there were something
seriously wrong with a piece of the new system I think the
chances are fiarly good that we'd have seen it by now. The
even better news is, after the system being up for a while
the filesystems on the 1.7G disk have passed fsck's without
problems (as they should).
Marcus is right about not wanting to rush things. Of course,
we don't want to delay things either.
|
gregc
|
|
response 39 of 43:
|
Aug 29 03:39 UTC 1995 |
Yes, the new 530 meg disk is on the machine. It was on last thursday.
So far, the 1.7gb and 530 have been up for about a week together. The Sun-4
and 1.7gb together have been up for over two weeks. I do occaisional fsck's.
So far, no problems. No crashes as yet either. I ordered the external drive
enclosure today.
|
lilmo
|
|
response 40 of 43:
|
Aug 29 06:19 UTC 1995 |
Zippity-doo-dah, zippity-hey !!! *smile*
|
tsty
|
|
response 41 of 43:
|
Aug 29 18:12 UTC 1995 |
All good news- gregc, what about crash-testing thje Sun4 tosee how
fsck reacts to thje disks? This is also often called thje "smoke test."
thej Sun4 is getting +very limited+ use right now (as it should) what
about having some sort of thrashing party on the new b0x just for
kiks and grims...
|
gregc
|
|
response 42 of 43:
|
Aug 29 22:34 UTC 1995 |
Oh, yes, that is definately planned. But there's alot of other things that
need to be done first. Yes, I know it's under a light load right now, and
the fact that it's running fine is no clear indication of future performance
under production loads, but it's a *start*. We need to get code built, patches
installed, *backups* made, and more-ort-less get the machine configured into
it's final production configuration *before* heavy duty testing, otherwise
alot of the testing becomes meaningless.
|
tsty
|
|
response 43 of 43:
|
Sep 7 09:12 UTC 1995 |
werkz fer me.
|