|
|
| Author |
Message |
| 25 new of 547 responses total. |
janc
|
|
response 347 of 547:
|
May 22 14:48 UTC 2003 |
Hmmm...the faq (http://www.openbsd.org/faq/faq11.html) talks about the
BUFCACHEPERCENT kernel value. It says the default is 5%. I haven't touched
it, so if I'm reading this right, there should be 75M or less of disk cache.
Hmmm...Linux uses all free memory as disk cache. A much nicer setup.
Well, if that's the case then I'm not sure what make those benchmark numbers
so goofy.
|
gull
|
|
response 348 of 547:
|
May 22 17:12 UTC 2003 |
Re #344: I think that's starting to make sense, yes. Unless it turns
out the performance hit you're seeing is an artifact of your testing
method, we may be better off going with using the disks "straight".
Getting only 20% of the potential performance of the disk subsystem in
exchange for easier recovery on the rare occasions when we have disks
fail doesn't seem like a good tradeoff. I'm still having trouble
believing RAIDframe is *that* inefficient, though.
|
cross
|
|
response 349 of 547:
|
May 22 18:20 UTC 2003 |
So am I; it seems unreasonably slow, and it looks vaguely like the
numbers start to converge when you have many processes working at
once, which is the normal mode of operation. I'd be interested in
seeing what a test simulating a timesharing load would be like.
|
lk
|
|
response 350 of 547:
|
May 23 03:37 UTC 2003 |
One simple way to "fool" the kernel into thinking that NextGrex has less
memory is... to remove all but one memory module. Guaranteed to work. (:
You might also want to test mirroring. Might be more efficent (less CPU
utilization for striping and no extra parity data) while offering both
availability and redundancy. The "cost" here is 50% drive overhead.
The boot disk, with the system partitions (and /tmp or was that IDE?)
could be one disk while the other pair could be mirrored.
|
janc
|
|
response 351 of 547:
|
May 23 13:49 UTC 2003 |
I'm reluctant to take the machine appart for such purposes. Anyway, I'm a
software guy.
I certainly agree that we need better benchmarks, but I'm not sure how to
obtain them. Anyone with better ideas is welcome to suggest them. Those of
you with accounts on the system can probably run them yourselves, as the
relevant disk partitions are permitted 777. We really want to get some sense
of how RAID would effect a realistic multi-user load.
I tried running a benchmark with a really small file, one where you should be
getting lots of use from cache. Here's the 50 MB and 2000 MB results. Explain
this, if you will:
-------Sequential Output-------- ---Sequential Input-- --Random--
-Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU
raid 5 50 24882 20.9 22641 3.5 3731 1.2 7555 9.6 64346 14.7 511.7 3.1
raid 5 2000 9520 6.8 7974 1.3 5706 2.0 50932 62.5 63815 13.0 147.9 1.6
The small run has much faster output, and significantly faster seek times.
The block read is about as fast as the large file (suggesting that it is
mostly reading from buffer). But what's going on with the per char read?
Note that the sequence of the tests is:
Per Char Output
Rewrite Output
Block Output
Per Char Input
Block Input
Seek
So it may be that the Per Char read was from disk, but left the entire file
in cache, so the block read was then very fast. But why wouldn't it already
be in cache after the block output? And why would the same speed be
achieved on the Block Read with the 2M file, which can't have all been
in cache.
I don't think I know enough about how buffering and disk I/O works in openBSD
to really interpret this stuff.
|
aruba
|
|
response 352 of 547:
|
May 23 14:06 UTC 2003 |
The information on the Bonnie web page (http://www.textuality.com/bonnie/)
makes it sound like the tests are designed to correct for caching. There's
some info there on how to interpret the results.
|
janc
|
|
response 353 of 547:
|
May 23 15:00 UTC 2003 |
Maybe to help more people figure out what is being discussed here,
I should give a brief over view of RAID.
RAID stands for "Redundant Array of Inexpensive Disks" (the I-word
varies). Some wrote a paper once upon a time survey various options for
putting a lot of small disks together, and named the variations RAID 1,
RAID 2, RAID 3, RAID 4, and RAID 5. The RAID 0 name was coined later and
isn't really RAID. The interesting ones are RAID 0, RAID 1 and RAID 5.
I'll also discuss RAID 4 because understanding it makes RAID 5 easier
to understand.
Suppose you needed a 100 Gig disk, and all you had was ten 10 Gig disks.
Well, you could put them all together in a box, and write a little
controller that would write the first 10 Gig to disk one, the next 10
Gig to disk two and so on. To the computer, your box would look like
a single disk.
The performance of this disk array wouldn't be so hot though. Most
programs access file sequentially, so as the 100 Gig file was read,
we'd first have disk one very busy, while the other nine sit idle,then
disk two would be busy, and so forth. It'd be nice to balance the load
among the disks.
Which brings us to RAID 0 - also known as striping. We slice the disks
into 32K chunks. As you write a big file to the disk, the first 32K
goes to disk one, the second 32K to disk two, on through the tenth
32K chunk going to disk ten. That completes a stripe. The eleventh
32K chunk goes to disk one again. This balances the load over all ten
disks, so you get better performance. You can vary the chunk size for
different applications.
So RAID 0 gets you a large virtual disk and balances load over your
drives. It doesn't give you any increase in reliability. Quite the
contrary. If a drive dies, than instead of losing a 10Gig hunk of data,
you lose lots of 32K hunks of data scattered through all your data.
This is probably harder to restore.
Load balancing over multiple spindles would be nice for Grex, but not
vital. We don't have just a single process reading the disk sequentially.
Increasing the difficulty of reconstructing the file system after a
disk crash is too high a cost to pay for slightly better load balancing.
I think we can rule RAID 0 out as an option.
There is no Reduncancy in RAID 0 (so it should be called "AID 0").
Real RAID starts with RAID 1 - also called "mirroring". We are still
trying to make a virtual disk out of many real disks. This time we'll
group our ten 10Gig disks into five pairs, disk 1A, 1B, 2A, 2B, etc.
Whenever we write data to disk 1A, we also write a copy of the same data
to the corresponding location on disk 1B. The first obvious effect is
that our virtual disk only contains 50 Gig instead of 100 Gig. But now,
if disk 1B dies, we have an up-to-the-nano-second backup copy. We can
replace the disk 1B with a new disk, copy the contents of disk 1A onto
it, and be back up and running with no loss of data.
Ideally, in RAID 1, we'd do the writes to the two disks simultaneously,
so writing is no slower than reading. (In software implementations of
RAID 1, this may not entirely work.) On reads, we don't have to read
from both disk. We just select the one that is less busy at the moment
and read from that. So, we get decent performance and the capability
to survive a single drive failure, but at the cost of half our disk space.
I've heard of RAID 0+1, but not read much about it. I assume it's just
striping over the 5 pairs of mirrored disks in the example above.
RAID 4 is an attempt to get the same benefits as RAID 1, but with less
loss of disk space. This time we call 9 of our disks "data disks" and
the other one a "parity disk". Parity just means "even" or "odd". The
129th bit stored on the parity disk depends on the values of the 129th
bit stored on the other nine drives. If an odd number of those nine bits
are 1's, then a 1 is stored at that location on the parity disk. If an
even number of them ar 1's then a 0 is stored at that location on the parity
disk. In geek terms, the content of the parity disk is just a bit-wise
exclusive-OR of the contents of all the other drives.
Suppose a drive dies. If it was the parity drive, we can just recompute its
value from the other drives. But what if a data drive dies? Well, we have
all the other drives and the parity drives. So for each bit we have something
like:
data1 data2 data3 data4 data5 data6 data7 data8 data9 parity
1 0 1 X 0 1 0 0 1 1
The parity bit is 1, so we originally had an odd number of 1's on the
data disk. There are 4 ones on the surviving drives, so the bit on the
dead drive must have been 1. (In fact the dead drives contents are just
the bit-wise exclusive-OR of all the surviving data and parity drives, so
the reconstruction process for a dead data drive is identical to the
reconstruction process for a dead parity drive).
So, this is cool. We now have a virtual drive holding 90Gig of data, so
we've lost only 10% of our storage, and we can still reconstruct all the
data on any lost drive.
There are some additional performance costs though. The first problem is
the parity drive. Every time you write data to a drive, you have to update
the data on the parity drive. So though data writing is split over nine
drives, parity writing is all on one drive, so that drive is nine times as
busy as the other drives. It becomes a performance bottle neck.
The solution to this problem is RAID 5 - stripe the parity data over all the
drives. For example, the parity data for the first 32K of all the drives
would be on drive 1, the parity for the second 32K of all the drives would
be on drive 2, and so on. So there is no one parity drive and parity is
spread over all disks. (Note that disk reconstruction doesn't change -
you still just exclusive-OR all the other drives to reconstruct get the
lost drive.)
There is a second performance hit in RAID 4 and 5 though. Like RAID 1, every
write is to two drives - data to one drive and parity to another. However,
before we can write the parity, we have to compute the parity, and that means
we need to read the corresponding data from the other eight data drives. So
a simple write turns into 8 reads and 2 writes.
Also, in RAID 1, we were able to improve read performance by always reading
the data from the less busy drive of the two that had the data. In RAID 4
and 5, the data is only one one drive, so we can only read it from that drive.
However, we like to assume the striping in RAID 5 will balance the load among
the drives pretty well anyway.
There are lots of hardware RAID devices that optimize this kind of thing, but
we can't afford them. The option we are considering is software RAID, which
is implemented in the OpenBSD kernel by a program called RAIDframe. It's
pretty solid and rather nice. You can set up a RAID array, possibly with
spare drives. If a drive fails, and there is a spare on-line, it will
automatically bring the spare on line, reconstruct the lost data and proceed
without interuption of service. If there are no spares, it'll run with a
drive short (in RAID 5, any read from the dead drive is simulated by reading
from all the others and exclusive-Oring them). This is all terrific if you
need a server up 24x7, which Grex doesn't really.
Note that the redundancy in RAID gives you some protection against single
disk failures (it's assumed that you do something before the second disk
dies). It does not replace a backup. If you accidentally delete the wrong
file, or a vandal breaks in and alters all your files, the RAID will give
you nice redundant copies of the altered files, not the original ones.
So RAID is not a subtitute for backups. It's protection against hardware
failure and that's all.
RAID 0 can give you some performance enhancements by load balancing. The
other versions of RAID are all likely to be slower than a non-RAID setup,
especially if implemented in software. RAID 0 doesn't cost you any disk
space. The other versions are going to eat up some of your disk space.
In our case, since we have 3 drives, RAID 1 doesn't quite work and RAID 5
would eat up 1/3 of our disk space.
|
janc
|
|
response 354 of 547:
|
May 23 15:03 UTC 2003 |
OK, that wasn't so brief. But writing it just make me more sure that RAID
isn't right for Grex. The problem it is primarily designed to solve isn't
an important issue for Grex.
I may do some experimenting with rsync, and see if I can get a sense of how
expensive it would be to regularly rsync to the IDE disk.
|
gull
|
|
response 355 of 547:
|
May 23 15:43 UTC 2003 |
Where I work, we use rsync to keep a mirror of about 50 gigs worth of
data. We're doing it across the Internet, via a T1, as well. It does
cause a fair amount of disk thrashing on both ends when it figures out
what files need to be transferred (very much like doing a 'find' across
the filesystem) but overall it seems very efficient. It's worked well
for us. My guess is the "expense" of doing an rsync to another local
disk a couple times a day is going to be pretty low, especially since
you're not transferring over a network and so won't need to involve ssh
or compression.
|
cross
|
|
response 356 of 547:
|
May 23 15:51 UTC 2003 |
I ran a bench mark last night; one of my own design. It's nothing really
fancy or scientific; I wrote it a few years ago to try and get a feel for
how various disk subsystems and filesystem times handled a load I thought
was fairly typical of timesharing style machines. Basically, it just
copies a bunch of 32KB files all over the place.
Running on both the IDE and SCSI drives took about 4 seconds. Running on
the RAID took around 80 seconds.
Something is wrong here; there's no reason RAIDframe should be *20 times*
slower than a `normal' filesystem, I just can't believe it's that bad.
Perhaps I'm wrong about the stripe size; maybe 64 is just to small. Jan,
could you up it to 256 and see if that helps any? I see at least one
post from someone who says they used an interleave size of 168 and got
decent performance, but 32 (and probably 64) was too small.
|
janc
|
|
response 357 of 547:
|
May 23 16:02 UTC 2003 |
Right. I installed rsync from the ports tree (I like the ports tree).
I then went to /sd0 (the test partition on the first scsi disk) and did
time rsync -ax /usr .
This should copy the whole /usr partition from the IDE disk to the SCSI
disk (which is backwards from the direction we would be going) and give
me some statistics. The /usr partition contains 664,632K of data. The
result from 'time' was;
12.0u 24.5s 4:28.34 13.6% 0+0k 161046+660947io 36pf+0w
So it took 4.5 minutes elapsed time, eating 13.6% of an otherwise idle CPU.
I than reran it. In this case it should be checking the two copies against
each other, and copying over only what changed (little or nothing). The
time result was:
3.8u 3.5s 0:46.13 16.1% 0+0k 47000+1454io 1pf+0w
This took 45 seconds.
In real life we'd want the --delete option on the command, so files that
don't exist on the source are removed from the copy, but I didn't do it in
the test because I was paranoid about getting the arguments backward. Even
so, we'd want our target partitions rather larger than the source partitions.
Maybe just one big target partition instead of separate ones corresponding
to the different source partitions, the whole thing readable only by root
and possibly unmounted when it isn't being updated.
Doing this a couple times a day seems a much lower impact way to data
reduncancy than RAID.
It'd be tempting to keep two copies of some partitions, and update them on
alternate days. Dunno if that's necessary.
This is not a substitute for real backups to tape, of course.
|
janc
|
|
response 358 of 547:
|
May 23 16:04 UTC 2003 |
Dan slipped in. I'll try reconfiguring the RAID.
|
janc
|
|
response 359 of 547:
|
May 23 16:52 UTC 2003 |
OK, I reconfigured it with a 256 K stripe size. The current config file
is in /etc/raid1.conf. Running bonnie now.
|
jep
|
|
response 360 of 547:
|
May 23 16:55 UTC 2003 |
How much would a RAID controller cost? I'm sure Jan is right that it'd
cost too much, but if there are substantial benefits maybe the users
would spring for some more money.
I'm not sure the benefits would be all that substantial in any case.
We're going to brand new spiffy hardware and I expect that will already
mean a big improvement in reliability. Grex isn't unreliable even
now. But it seems like it'd be easier to discuss it now than after the
new machine is in place and in use.
|
scg
|
|
response 361 of 547:
|
May 23 17:11 UTC 2003 |
I want to dispute the claim that since Grex doesn't have to be up all the
time, the high availaibility provided by RAID isn't important. Grex doesn't
pay anything for its staff time, but it is a scarce resource. The difference
in staff time required to format a new disk and restore data to it, versus
just putting in a new disk and letting it happen automatically, is huge.
I too am curious about the costs of hardware RAID controllers. It's been
years since I looked at such things, but given that they were widely
available three or four years ago, I'm surprised to hear the price hasn't
come down.
|
janc
|
|
response 362 of 547:
|
May 23 17:16 UTC 2003 |
-------Sequential Output-------- ---Sequential Input-- --Random--
-Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU
raid 64 2000 9520 6.8 7974 1.3 5706 2.0 50932 62.5 63815 13.0 147.9 1.6
raid 256 2000 9483 7.1 8768 1.9 5443 2.6 56017 67.9 70599 14.3 183.1 1.3
scsi 2000 53754 43.4 54106 14.1 10090 2.6 60326 70.9 61067 11.5 201.2 0.8
OK, the second line is RAID with stripe size of 256 kiB instead of 64 kiB.
Generally things are better, but not dramatically so. (Doing
'raidtcl -sv raid1' confirms that it did get reconfigured.)
Generally, if you do a large number of small reads and writes to small files,
then a large stripe size is better, and if you read a smaller number of larger
files, a smaller stripe size is better. Grex probably belongs on the larger
end of the spectrum.
|
janc
|
|
response 363 of 547:
|
May 23 17:26 UTC 2003 |
Note that we have a hardware RAID controller on our motherboard, a "Promise"
device whose model number I've forgotten. It works only with IDE drives and
is not supported by OpenBSD (they don't seem to think they are going to
support such things either). So, there is a wide range of hardware RAID
controllers with different capabilities and prices.
Recovering from a disk crash certainly costs less staff time with RAID. But
how often does it happen? If you have a recent snapshot on another disk,
recovering from a disk crash isn't all that hard even without RAID. Ammortize
the time difference over the low frequency with which it happens, and I don't
see much weight to that argument.
|
jep
|
|
response 364 of 547:
|
May 23 18:45 UTC 2003 |
I did a quick search on RAID controllers, and saw prices in the mid-
several hundreds ($300-700). I don't know anything about what value
would be provided by the different types. I am not in position to
analyze the number and effects of disk hardware failures, either. I'm
only asking a question.
|
gull
|
|
response 365 of 547:
|
May 23 20:27 UTC 2003 |
Also, OpenBSD's hardware support is pretty limited even compared to
other open-source operating systems, so you can't buy just any RAID
controller and expect it to work.
|
janc
|
|
response 366 of 547:
|
May 23 20:59 UTC 2003 |
http://www.openbsd.com/i386.html#hardware includes a list of hardware RAID
controllers supported by OpenBSD. Not thaat I think we should get one.
|
lk
|
|
response 367 of 547:
|
May 24 03:12 UTC 2003 |
As jep said, you can get a decent RAID controller for about $400.
OpenBSD drivers, though, are another matter.
I think Grex needs to move forward. The 2nd guessing can continue for
years, but the hardware is already in place (perhaps there should have
been more discussion earlier). Keep in mind that what we're "bickering"
over is what may (or may not) be a little bit better than the alternative.
Having said that, what about my idea?! Have one boot disk with all the
(rarely changing) system directories on it and then configure the other
two "data" disks as RAID 1 (mirroring). It entails 50% disk "waste",
but shouldn't have the performance hit while retaining availability
and redundancy.
After all, we live in compromising times.... (:
|
jep
|
|
response 368 of 547:
|
May 24 04:09 UTC 2003 |
I didn't have the impression I was holding anything up, or that anyone
else was, either, with the questions about RAID. Dan has been making
what appear to be useful suggestions -- I can conclude that, if only
that Jan has been accepting some of them.
As for my part, I think it's clear enough to everyone here that I
shouldn't have any input about RAID. I've never set up a RAID system.
If there's a choice for a staffer between doing anything about the new
system, and answering one of my questions or comments, by all means,
work on the system. (As if I even have to say that.)
|
i
|
|
response 369 of 547:
|
May 24 12:47 UTC 2003 |
Back in janc's "Intro to RAID":
RAID 5 turns a disk write into 2 reads & 2 writes. Better than what
janc suggested that grex (with 3 disks, not 10) would face, but still
not good when (i believe) grex is doing plenty of writes. (Is it?)
Good hardware RAID (with dedicated hardware to do parity calculations,
lots of private cache memory to reduce disk activity, etc.) could improve
this. But disk space is cheap enough these days to make RAID 1 the way
to go if one wants redundancy in a "lots of writes" situation. (At least
for our size & budget.) RAID 1 is also considerably easier to do
"acceptablely" in software, and great software RAID is obviously not a
priority for OpenBSD.
If we're eager to avoid downtime, a spare hard drive's great to have.
When a dead drive has you down or limping, there's often a huge downtime
difference between "have an identical, well-tested spare drive on hand"
and "rush to research suitable replacement models, where they might be
bought, costs, and lead times". *Especially* since different generations
of SCSI hard drives sometimes fail to "play well together" in flakey,
intermittent ways.
|
cross
|
|
response 370 of 547:
|
May 24 16:03 UTC 2003 |
Hmm. It would appear that RAID5 performance is just unacceptably slow
with RAIDframe in OpenBSD. Weird; I'd have thought it'd be better. Oh
well, it's not the first time I've been wrong.
If a hardware RAID controller is $400, one would have to weigh the cost
of buying one of those versus bying another SCSI disk for $200 and using
raid 0+1 (mirroring, and striping over the mirrors). That I am reasonably
confident would be fast. Is it worth it for grex? That's another matter.
I agree with scg that it is, but I'm not paying all the bills.
I disagree with Leeron that doing mirroring by itself is the way to go;
I think the price/performance ratio isn't worth it.
|
lk
|
|
response 371 of 547:
|
May 24 16:18 UTC 2003 |
Sorry, jep, I didn't mean to imply that you (or others) were holding up
anything. I certainly have no idea what the implementation time frame is.
For all I know, Grex budgeted the next 3 months for such discusssion
before finalizing NewGrex and putting it on-line. (:
There's a lot of worthy discussion here and many good suggestions.
But I do know how over-discussion can become negative on a BBS, and I don't
want to see that happen here. Not to sound like the US Patent Officer of
125 years ago, I think all the constructive comments about RAID, with all
its pluses and minuses have been made. It's time to make a decision....
These are the points I'd consider:
(Note that whether RAID is useful for Grex almost becomes a moot point)
1. We have no RAID controller
(and I'm not impressed by the list supported by OpenBSD)
2. The software RAID-5 performance rules that out.
3. Software RAID-1 remains a possibility.
(At least Walter and I think so.)
|