In the past, we've always made sure that we had enough extra bits and pieces
of computers so that if we had a hardware failure, we could swap spare parts
in and get Grex running again quickly.
After we shift to the 4/670, we won't have any spare parts, except a few spare
memory chips. If the 4/670 fails, we will have to fall back to the 4/260.
This isn't a totally simple fall back. The 4/260 is a slightly different
architecture than the 4/670. It runs most of the same binaries, but not all.
So if we have to fall back, we need to make software changes as well as
hardware changes. We will keep around the current 4/260 disk partitions, but
unless we continue to maintain those, they will rapidly become obsolete. So
if we have a serious hardware failure, it is likely to take us as long as a
day of downtime to get the 4/260 back up.
Options:
(1) Start shopping for spare 4/670 parts:
motherboard - this contains scsi controller, sockets for the
first 128Meg of memory, the ethernet interface, and several serial
ports. We have only one of these. I haven't found any prices.
I think they may be about $500-$600. (it's called 501-1686 or
501-2055).
cpu board - this is a little piggyback board that contains two CPUs.
The motherboard can take two of these. We have one. We originally
budgetted money for two and bought two, but one of them turned out
to have only one working processor on it. I think Rob never charged
us for this and still has it. The money to buy a spare was in the
4/670 budget. If we buy one, we not only have a spare, but we can
experiment with running Grex on 4 processors instead of 2. (Actually,
using Rob's half-dead board to run on three processors might work
better than either - we should probably try to acquire that card).
(I think the ones we have are SM100's, aka 370-1388 - prices seem to
range from $150-$450).
chassis: The chassis we are running the 4/670 in is a newer model
than the other three we have. We don't know if it can be run in
the older chassis. My guess is that it can be, especially after we
change over to the terminal server. After that the 4/670 will be
essentially a single-board computer, so the only thing it wants from
the bus is power. So I don't think there is a hurry to buy a spare
chassis.
memory: In the old Grex, memory chips weren't socketed, so we had to
have whole spare memory board. Now we have SIMMs, so we mainly
just need spare SIMMs. We have two or three, I think. That should
be fine.
memory board: The first 128Meg of memory is socketed in the mother-
board. To expand beyond that, we have a memory board. We haven't
got any memory to put in it yet, and haven't budgetted for any yet.
Getting a spare for this is probably not a priority.
So basically, we'd need to budget money for a motherboard spare. We
should buy a cpu spare. We should keep an eye out for good deals on
other stuff.
(2) Continue using the 4/260 as a backup machine.
This means staff needs to improve procedures for maintaining both the
4/260 and 4/670 software suites. In theory this shouldn't be too
difficult, since the differ mainly in the kernels and in a few programs
that interact closely with the kernel (like "ps").
37 responses total.
You could ask on Usenet News and at a SemiSLUG meeting for donations. I don't know how current this computer is; if it isn't very current, someone might have one lying around that they'd be willing to give to Grex.
I support the notion of parallel maintenance of both machines, especially since it will be much simpler to do the same or similar changes on both machines at one time (at the cost of a little extra time) than to try to play catch-up in a crisis. Some analysis and discussion to determine the availability of the necessary time should take place, and if necessary, we should consider adding more staff. Certainly this scheme seems to make the most sense until either we have the necessary parts or a greater sense of the reliability of the current ones. I'd like to learn more about programming in the unix environment, though my background is fairly limited (old BASIC and HYPERTALK, the scripting language for the Mac HyperCard program, with a very little introduction to HTML). I've also dabbled with a little scripting in my home directory to facilitate some frequently used commands...
The only problem with "adding more staff" is that the staff is always on the lookout for talent and if they knew of someone appropriate and interested, they would probably have recommended them to the board by now. Of course they may have their eye on someone who is still in the consideration period, but if so, I haven't heard about it, so it must be early on. We want to be careful about adding staffers since no one likes having it not work out.
understandably. i just want to avoid taxing existing staff beyond burnout. we place a lot of demands on them, and this is just another one...
Oh, I agree. I think everyone on staff does.
I'd like to see enough spare hardware to keep the 670 running. This means, as Jan said, a spare motherboard, and a spare CPU. We already have a spare ALM, for when we will be using that, and a spare terminal server, for when we will be using that. Among other things, keeping the old computer in synch as a hot backup makes it harder to use it for anything else, and I'd like to see us use it for something.
This response has been erased.
SM100s should cost no more than $100, and could be $50 if you find a deal. Motherboards are a little hard to come by separately, but you might find a good deal on a cheap system, if you keep up with Usenet ads. If you read Usenet through DejaNews, I'd suggest creating your filter for the newsgroup misc.forsale.computers.workstation, covering the last couple months, then for the subject search after creating the filter, use "'600mp' or '630mp' or '670mp' or '690mp'" as search criteria. An example "package" deal listed from this June: > Sun Sparcserver board out of a 670MP, includes 2 dual cpu > boards (ross 40mhz) , GX framebuffer, 64 megs of ram, and > keyboard & mouse. I would like to get $750 for this. > > The case may be available with a 1gig drive and cdrom. I'm sure we purchased a couple spare SIMMs before. I reimbursed Grex for the bad SM100 board, intending to return it, although since I didn't, I'll give that back as a freebie...it did seem to work in the 670 with only one functioning CPU, so it should work as a hobble-along backup processor. Stuff I've read says running two SM100s (i.e. four processors, since each SM100 has two CPUs) with SunOS is usually slower than using one SM100, though it depends on the type of load on the system. If you buy another processor, you may want to consider an SM41 or SM51 (I've forgotten precisely what submodels are compatible with SunOS), rather than an SM100...I think SM51s might be around $150-200 each.
My understanding of the bad CPU is that it is the secondary CPU which works and the primary one which does not work. So this board is a valid backup for the second CPU, but we would still need a backup primary CPU.
Right. My memory of the bad SM100 was that the first CPU on it was bad. The 4/600 boots up on the first CPU, and only starts using the second one later. I think that means that we could not boot with the bad SM100 card as the only CPU card. However, the card may be repairable. It could be something as simple as a bad trace on the PC board. Maybe someone who understands such things can look at it. I think Birdsall's Sun Hardware Reference says that SM41 and SM51 modules can only be used with Solaris 2.something. That means that to use those, we'd have to upgrade from SunOS to Solaris. This may be something we want to do someday, but it won't be simple. So I think we should stick with SM100's for now. Steve Weiss recently ordered and installed another 16 4M memory chips for the 4/670. It is now running on 128Meg of memory.
I assume we bought a license to run SunOS4 when we bought the hardware. Technical questions aside, how much would it cost to license a copy of Solaris 2.x if we were thinking about upgrading?
You can run SunOS 4.1.3 on SM41s and SM51s, but you need boot PROMs of a certain version (2.8v2 for SM41s and 2.10 for Sm51s), and I'm not positive if you can run multiple processors with those CPUs. But having the same processor module would be best for trouble-shooting, too. I'll keep a lookout for SM100s. By the way, this is a hardly-related tangent other than being hardware Grex might want, but the latest Corporate Systems Center catalog (from which Grex got some 2 gig HP drives) lists an 8.5 gig Micropolis for $300...5.25" FH SCSI2, new w/warranty. Seems like a decent deal.
Is somebody else covering Micropolis's warranties now that Micropolis is going out of business, or is the warranty still as worthless as it was when Micropolis was in business? (of the two Micropolis disks that Grex sent back for warranty replacement, neither was ever seen again)
Didn't know they were going under...that would explain the decent price! The catalog lists a one year warranty, but not by whom. You could ask CSC by phone (408.743.8770) or check their web site (www.corpsys.com) for an e-mail address.
This response has been erased.
however, it sounds as if grex has already experienced two not-good disk deals from micropolis - beware of the third?
Have we had other disks from them for which we had no problems?
This response has been erased.
both of our curent 2 gig disks are HP's. We have a 1.5 or so gig disk that I don't know about.
Unless we get *really* good deals, such that using a backup is still economical, then I don't mind patronizing Micropolis again. Otherwise, let's not bother with them.
I'm not sure if this is necessarily a good place to volunteer a donation of hardware, but I didn't see any better place after some cursory poking around. I have a not quite two year old 3.5" half-height 1.2G Quantum Fireball single-ended narrow SCSI-2 drive which I'm not actively using now that I've upgraded to bigger drives. It's been run only in a well ventilated drive enclosure and is in top shape as far as I can tell. This unit has a three year warranty which expires in June of 1999. Someone involved in Grex admin just drop me a note if interested in this drive.
I've replied to this in E-mail - yes, we'd be very interested. We have been planing to set up a mail machine and need a drive for it.
This response has been erased.
You're quite welcome. I have used M-Net and Grex off and on since around 1985, though mostly "off" these days. The Altos was the first UNIX machine I ever used. It's what prompted me to buy my first UNIX box, a firesale AT&T 3B1, in 1987. After having made my living doing UNIX and IP network admin stuff for several years now I'm glad to be able to give something back. In 1985 I'm sure I would have seriously doubted that I'd ever own a 1.2G drive, much less be giving one away. It's a testament to how far things have come since then.
I've been away for some time now, but I'll pick up this thread. I have the motherboard and CPU for this system, and would love to put it in a case for grex. I don't have a case for it, or I would have put it together already and given it.
We have a case sitting in the Pumpkin. Let me know when you want to pick it up (late evenings are probably best).
Or give me a call, my schedule is a bit more flexible than Steve's. I also have David's hard disk.
Late evenings are fine. After this weekend I should be free.
Back to parts for the 4/670, since we're up on it, might i suggest maxing out the number of cpu's? i seem to recall steve (login: steve) saying a spare cpu was $30, why not spend $60, and have 4 cpus, and do it that way. That should increase the speed of the non io-bound operations on the system. I'm going to do some stats as i find time to see if the system is slow because of io operations or cpu, etc..
I think the argument was that, since SunOS isn't multithreaded, more CPU's wouldn't help much. We do have another one we could possibly put in and test with.
Well, we should buy another CPU card with two more CPU's as a spare. Once we have it, we should try running on both to see if it helps. Rumor says it doesn't and can even hurt. I've also seen some suggestions that SunOS won't even work with more than two CPUs. But certainly we should give it a try.
The SunOS kernel isn't multi-threaded, so only one process can be executing kernel code at a time. Grex, like most typical Unix systems, spends about 50% of its time in kernel mode. This means Grex can make good use of 2 processors, because (on average) one processor can be in the kernel while the other one is executing user code. In theory, with 3 or more processors, the fact that only one processor can be executing kernel code becomes the limiting factor, and performance should not be significantly better than with 2 processors. It could be worse if there is any significant penalty for extra processors. Obviously, in SunOS, the MP support is pretty primitive. It is possible to fix this, basically by completely rewriting the kernel. Sun has done this, and the result is called Solaris 2. Solaris 2 can, in fact, make good use of many processors, however, for a uniprocessor system, the resulting MP overhead actually hurts performance. Rather than trying to deploy a single large MP system, however, there is another very different architectural approach, and that is to implement a real distributed environment. Basically, this means deploying a number of smaller loosely coupled machines, and splitting up the stuff that you see here on grex, between these multiple machines. That might mean, for instance, a few file servers, an authentication server, a mail server, several login machines, and several terminal servers. This architecture has some important advantages over a monolothic large MP server; for instance, if a machine breaks, it's easier to take it out of service and replace it. This architecture also scales better (up to thousands of online users at once), and is in some ways much more robust (something that clogs up the mail server with tons of mail may barely affect anything else.) Another important advantage (for grex) is that new MP servers are extremely expensive, and used MP servers are not likely to be at all common; while the "small server" distributed environment can make good use of used workstations, which are extremely common.
If you run a modern version of solaris (2.5,2.5.1,2.6) it's nice and fast. I run those on various locations, including machines slower than grex, and it works well. An upgrade to a multithreaded OS is something to consider, as for the security part, it's even simpler: Most security Issues are with X software, or fancy software that does fun stuff. This software is not needed on grex. The kernel hacks for restricting network access are required for Solaris, i have nothing to do with those, and can't recommend a good fix for that. Remove all suid bits, except on well known, trusted software, such as top, etc.. Nether.net is a very secure system that way :) Grex can also fix some of it's spending time in kernel mode (it's actually blocking for IO) by purchasing 7200RPM disks instead of the slow disks it has, and using those for swap instead. Using 3600RPM disks for swap is not a great thing for performance.
Faster disks will make very little difference with kernel overhead. The kernel overhead for a scsi disk of any speed will be virtually identical, consisting of the time to set up the control blocks, pass them to the controller, and later, to respond to the interrupt from the controller when the transfer is done. The speed of the disk is completely irrelevant for these 2 activities; they'll take almost exactly the same amount of CPU no matter how fast (or slow) the disk is. 7200RPM drives will have lower rotational latency, which will certainly make things more spiffy, but may also have higher transfer rates, which (if the controller is capable of it), could actually "hurt" CPU performance by eating more memory bandwidth. (The "hurt" disappears when you consider overall system performance; it takes the same # of memory cycles to transfer disk data to memory, regardless of whether it's all squashed together or spread out a bit.) 7200RPM drives would certainly still help performance, but finding a disk with fast seek times is probably more important. This is very different from the PC world, where with IDE, a faster disk might well result in improved CPU utilization. IDE disks don't do DMA, but instead, the CPU transfers the data from the disk. I've seen one report that on some newer machines, an IDE drive can eat about 50% of the CPU, vs. about 10% for a similar SCSI disk. A faster spinning IDE drive might well be able to manage faster transfer rates, and thus take less CPU.
I'm talking about 7200rpm scsi disks. I'm not aware of one of those that has a seek time above 9ms. The current disks we have are much higher in their seek times, and it takes longer to get back because of the slow spin. You'd be shocked to notice the speed increase you'll get in having a disk 2x as fast in rpm, and half the seek time, which is what you'd get. I'm not sure why you stuck a useless note in there about IDE, as that is irrelevant to the way grex operates, as it's not on a PC.
You can get a good idea of how much a faster disk will help throughput on Grex by looking at the queue lengths as reported by vmstat. Try vmstat 5 and let it run for a minute or two. Because the "r" queue lengths completely dominate the "b" numbers, I am not convinced that a faster disk will speed things up very much. More CPU might even help more than disk, even under SunOS 4.1.4, though the need for more CPUs is not as great for performance as it is for backup. Grex effectively uses all the time while waiting for disk IO to complete by running other processes. If we did have a disk with higher IO rates, though, swap would be the first choice for where to put it, simply to make sure that there were always enough processes in memory to keep the CPUs busy. It seems there are, though, in our 128M of Ram.
The main reason I mentioned IDE is that you claimed 7200 RPM disks would decrease kernel mode time. That might be true for IDE; it's not likely to be the case with SCSI. Although this is certainly not an issue with the sun-4, it *is* an issue with the proposed 486 based mail server we've been talking about, or with other similar possible future servers. Unix does not spend kernel CPU while waiting for an I/O completion interrupt - it either schedules another process to run (almost always true on grex), or it idles (likely case on a Unix workstation). The system is currently managing about 11% user, 30% system, & 58% idle. There does not seem to be much actual paging going on, so a faster paging device may not make much difference in system performance. However, the paging system does seem a bit active in terms of page attaches and detaches, and that means we're a bit short on memory - so more memory would certainly be helpful.
You have several choices: