You are not logged in. Login Now
 0-24   25-49   50-60        
 
Author Message
steve
Changes to Grex (hardware and software) Mark Unseen   Oct 1 14:46 UTC 1994

   Recent changes to Grex can be talked about here.
60 responses total.
steve
response 1 of 60: Mark Unseen   Oct 1 14:52 UTC 1994

   Valerie and I went to CE the other day, and installed the 32M
card.  Sort of.  Its in, but not functioning yet.  When the system
wakes up, the diagnostics in ROM look at the system.  Its supposed
to be able to read some memory that states how much memory there is
and uses that figure for the memory checker.  It isn't working, so
the system still thinks it has 32M of memory.  We believe that the
ROMs need to be upgraded to a newer version, which I've asked the
person we bought the card from for.  I doubt they'll be more than
$10.  Thanks to Greg for the info on where in memory to change the
values for memory, and remembering that *all* the VME cards have to
be reinstalled in order to make the system boot (otherwise you have
to change jumpers)...
tsty
response 2 of 60: Mark Unseen   Oct 5 07:21 UTC 1994

Ummm, tha means that unless the ROM (bios?) is re-burned, anything
not included therein is ignored, and electrically invisible?
  
I would think there would be a software "overwrite" for something
like that ...... but then I didn't design it.
power
response 3 of 60: Mark Unseen   Oct 6 04:30 UTC 1994

        On a completely different thread, whatever happened to the 'frm'
command?  Was that a shell script that just got killed, or what?

scg
response 4 of 60: Mark Unseen   Oct 6 04:50 UTC 1994

What did it do?
kentn
response 5 of 60: Mark Unseen   Oct 7 19:53 UTC 1994

It plowed, then planted, tilled, and harvested a virtual crop...
popcorn
response 6 of 60: Mark Unseen   Oct 7 21:08 UTC 1994

The manual page says that frm and nfrm list "From:" and "Subject:"
from your mailbox or from a folder.
remmers
response 7 of 60: Mark Unseen   Oct 9 11:06 UTC 1994

An alternative to "frm" is "mailpk".
carson
response 8 of 60: Mark Unseen   Oct 9 12:47 UTC 1994

Got it! Thanks!

(oh, you mean you didn't *know* I'm secretly Power?)
tsty
response 9 of 60: Mark Unseen   Oct 10 06:39 UTC 1994

Uhhhh, about the additional 32 meg of RAM ???????????
popcorn
response 10 of 60: Mark Unseen   Oct 10 13:13 UTC 1994

STeve is ordering (ordered?) new ROMs.  We are waiting for those to arrive.
steve
response 11 of 60: Mark Unseen   Oct 11 01:01 UTC 1994

   Well, we're running with 64M right now.  It's just that 32M of
it is "shadow" memory.      ;-) ;-)

   We're in a wait loop for the new ROMs which will hopefully fix
the problem.
tsty
response 12 of 60: Mark Unseen   Oct 12 14:32 UTC 1994

cool - great - and all that ....
steve
response 13 of 60: Mark Unseen   Oct 20 22:17 UTC 1994

   The observant will have already noticed that Grex now sports
yet another SCSI disk.  dev/sd6a is Greg's 1.3G disk, which has
a 500 some M /usr/local partition on it.

   The idea is to see what happens with another disk that has
less than 1G allocated on it.  If our thoughts on SunOS not being
able to deal with partitions beyond a 1G boundary, then the new
/usr/local will be stable, and the next time Grex is rebooted it
(/usr/local) will not have any damaged file entries on it.  If we
get to Sunday with /usr/local in tact, then we'll have a pretty
good idea that its a) a problem with SunOS, b) a problem with the
hardware setup for the 2G disk.  While we think that a) is the likely
scenario, we can't rule out b) yet.  So, if /usr/local is stable,
we will create a /home on the new disk such that its end is at
999M on the disk, and move the present /home over.  If that is
stable, then we have a kludge solution for the moment--we will 
then recreate a /home on the 2G disk with the shorter boundary,
and then live once again on our disk.

   If we still have corrupted data on /usr/local, then we have
one of those "interesting" issues where possibly the SMPL SCSI
controller and the Sun SCSI controller and the Xylogics get into
a fight, with Grex the loser.  We can then deal with this by
testing by putting all the SCSI disks on the Sun SCSI board,
and if needed, put a smaller SCSI disk on-line for news, such
that the Xylogics controller is obsolete (yeah!).

   So, we'll have a much better idea of where our problem lies
in the next 72 hours.
steve
response 14 of 60: Mark Unseen   Oct 21 15:37 UTC 1994

   We've learned something: Grex crashed last night, and during
the reboot process, we had errors on the new /usr/local partition
on /dev/sd6a.  Interestingly, we did not have any errors on /dev/sd4a,
which is the currently unused orignal /usr/local.  /home had errors,
too.
   This suggests to me that the partition size is not directly
involved, and possiblly we have a hardware conflict somehere, or
that the kernel as configured with two SCSI controllers and a
Xuylogics controller is a bad mix.
   We'll be trying some new stuff very soon.  I want to see about
getting rid of the older SCSI controller and just running with one.
steve
response 15 of 60: Mark Unseen   Oct 22 05:00 UTC 1994

   We're on to the next test.  The SMD disks, controlled by the
Xylogics card are now off line and slumbering.  This means that
our "spare" 130M SMD disk and the news disk are gone at the
moment.  This is to test the possibiluty that the SCSI controllers
and the Xylogics are having some sort of fight, where the SCSI
disks loses.
  If we see s table system, it will be interesting (and good, too).
It will mean that some sort of thing like a timing conflict exists
between them, and we'll probably want to get rid of the SMD disk
*asap* with a SCSI disk in replacement.  There is a place in Ann Arbor
selling 1G SCSI disks for $490, so I wonder what we might find if we
actually hunt for one.

   If we see problems still, then we're still possibly looking at
a fight between the two SCSI controllers.  We Shall See...
steve
response 16 of 60: Mark Unseen   Oct 22 19:19 UTC 1994

   ...And we've seen.  Taking the Xylogics disks offline and
rebooting doesn't seem to matter.  The Xylogics drivers are
still in the kernel, so there is a chnce that they're still
doing something bad, but I doubt it.  Its interesting to see
that the system aooears less stable now, with the second SCSI
(external) disk than before.

   So we know (or rather, suspect) that

 - a disk partiton of less than 1G on the SCSI-3 controller can
   still get corrupted

 - taking the Xylogics disks offline doesn't appear to effect
   things much

 - adding a second disk perhaps makes things less stable

   The next test I think, will be to move the SCSI-3 card in
the backplane.  Although not normally seen, there are occaisonal
timing problems with cards in certain slots.  We should be able
to test this today.  The other thing that will happen, maybe is
to take the Greg's disk off.  Thats just a matter of taking the
disk off the SCSI chain and telling the system where the original
/usr/local is.
kentn
response 17 of 60: Mark Unseen   Oct 22 20:56 UTC 1994

Wow, lots of progress there, STeve.  Thanks and good luck (to all
involved in the nitty gritty of the disk testing)!
wh
response 18 of 60: Mark Unseen   Oct 22 21:19 UTC 1994

Yeah, thanks for the hard work on this vexing problem.
popcorn
response 19 of 60: Mark Unseen   Oct 22 23:13 UTC 1994

Thanks, also, for the progress reports here, where everybody can see them.
I think it's most important to keep everybody informed about what's going on.
tsty
response 20 of 60: Mark Unseen   Oct 23 05:09 UTC 1994

 re :  #16  - it was "timing problems" specifically that I talked about
when this "happenning" first occurred. It was my *first* thought, fwiw,
but what do I know?
rcurl
response 21 of 60: Mark Unseen   Oct 23 06:15 UTC 1994

"Timing is everything."  Was that you, who said that, TS?
steve
response 22 of 60: Mark Unseen   Oct 24 00:51 UTC 1994

   Todays battle included:

   - the xlogics tape and SMD cards are now out of the system.

   - the SCSI cards are now in different VME slots to test the
     idea of timing problems to the cards.

   - the 2G disk now is 'actively terminating' rather than passive.


   With these changes, we'll either be stable, or we won't. ;-)
If we crash again with corrupted file systems, then the finger of
fate points towards a) an inherent conflict between the two SCSI
cards, b) the fact that any partition larger than 1G upsets the
kernel enough that the entire disk chain on that controller is
unhappy.

   So, we'll know either way in about a day or so.  The longer the
system stays up, the better off we are.  If we can get get to three
days without a crash, we'll check the filesystems.  If there are no
problems, we'll be happy, but won't really think the problem is solved
'till we've been up for a week or so.

   Most likely, the next step if we have problems will be to switch
over to Greg's 1.3G HP disk, with all the partitons comming in to
under 900M or so.  That will let us go to a completely under 1G disk
subsystem.
tsty
response 23 of 60: Mark Unseen   Oct 24 13:29 UTC 1994

For we hardware geeks ... there now appear to be 4 scsi disks on
line, sd0, sd2, sd4 and sd6. Are these 4 physically different disks
running off 2 scsi cards? 
  
Do you have any propogation delay specs on any of the hardware, cards
or disks? ARe the cable(s) different lengths? 
  
What value is there in an active termination as opposed to a passive
termination? Is everything active now? 
 
Are +all+ of the crashes reporting the same symptom every time, i.e., the
"inode disappearance" thingie? How often/when are inodes "reported?" Under
what circumstances are inodes required and from where (ram?) do they 
disappear?
 
Are all Grex'x devices running on the same power conditioner? 
  
When (from #14) /dev/sd6a was reporting errors (it was the disk in
use I thnk) and /dev/sd4 did not (I think this was a disk that was
online, but idleing, with its filesystem "available but not being
called in any way"), were both disks working frm the same/different
scsi cards?
  
Onto what physical/logical disk is the swap space located? 
 
ARe the files for which inodes disappear "in use" (either read or write or
status or something) in any way? 
  
Have the problems with inodes spanned across all teh "in use" disks and all
teh disk controller cards?  Is there any connection between the filesystems
which are linked instead of directly called? (awkward question, but I hope
itis clear enough) 
 
In reference to an above question, are some conditions reported on the
leading edge of a logic change and others on the trailing edge? 
  
Is there any possibility of installing some  NOP instructions  in carefully
selected locations to chew up a couple of clock cycles, thus allowing a
"settling period" for the results of some carfully selected instructions?
 
Is there anything special about having all the scsi devices labled with
an "even" number (although an "odd" count) 1st=sd0, 2nd=unused, 3rd=sd2,
4th=unused, 5th=sd4, 6th=unused, 7th=sd6? 
  
Are the filesystem "locations" grouped together on the same scsi card, 
i.e., sd0 and sd2 on one card, and sd4 and sd6 on another card? What
effect would there be in, say, having sd0 and sd6 on one card and sd2 and
sd4 on the other (presuming 2 scsi cards)? How about having the largest
filesystem (maybe all 2 Gig) on an isolated card,a nd all the others
on the other card?
 
Of course these are a lot of general questions, but these are the sorts
of conditions I have run into at various hardware levels of troubleshooting.
 
Oh, lastly, what confirmation is available for assuring that there are
no ground loops, either hardwire or capacitive/inductive, screwing
up the werks? 
  
'Nuff for now - good luck!
steve
response 24 of 60: Mark Unseen   Oct 24 15:06 UTC 1994

   I don't think we have time to answer all of those TS.  You've hit
on a lot of the things that we've gone over.  Its most likely not a
physical problem though, in that the errors we've been seeing are
specific to things like inodes.  If we had a physical problem causing
this, we'd be seeing errors all over the disk, and we're not.
 0-24   25-49   50-60        
Response Not Possible: You are Not Logged In
 

- Backtalk version 1.3.30 - Copyright 1996-2006, Jan Wolter and Steve Weiss