|
|
| Author |
Message |
| 25 new of 547 responses total. |
gull
|
|
response 325 of 547:
|
May 21 17:33 UTC 2003 |
Yeah, from what I've seen a lot of OpenBSDers are a bit elitist and
don't suffer newbies gladly. It's an unfortunate attitude.
|
janc
|
|
response 326 of 547:
|
May 21 17:37 UTC 2003 |
Hmmm...I'm trying to run the bonnie benchmark
(http://www.textuality.com/bonnie) on the raid disk, but I'm not sure it
will work. Bonnie wants me to use a file size several times larger than main
memory. Main memory is 1.5 Gig, so I told it to use 4 times that: 6144 Meg.
But the first thing it said is:
File './Bonnie.28521', size: -2147483648
Uh-oh. Someone may be using signed longs for the file size. If that's the
case, then the biggest file size I can use is probably around 2048 Meg, which
isn't several times the size of our memory. Well, I'll let it run and see
what happens.
|
janc
|
|
response 327 of 547:
|
May 21 18:00 UTC 2003 |
So, if we went with RAID, what would we do?
On the disks we'll have partitions
sd0a - pretty tiny. A place to store kernels. We'll boot from here.
sd1a - A copy of /dev/sd0a, so we can boot if sd0 dies
sd0b - swap partitions, one Gig each. You can put swap on raid, but
sd1b it doesn't appear to be a great idea. We'll trust OpenBSD to
sd2b balance swap load over the three spindels.
sd0d - the remainders of the disks, about 16 Gig each.
sd1d
sd2d
Now, sd0d, sd1d and sd2d will be clustered together into a RAID 5 array, called
raid0. To all intents and purposes, this appears as a single big disk. It
should come out at about 29 Gig, a more than adequate amount of space for all
of Grex's needs for a while. Raid0 gets partitioned into all the various
partitions we need, with root on raid0a, usr on raid0d and so on.
The 80 Gig IDE disk doesn't participate in this. We could put the boot
partition on this, but I'd want copies on two disks anyway, so we'll need at
least some non-raid partitions on the SCSI disks anyway, so let's leave
everything critical off the IDE.
|
cross
|
|
response 328 of 547:
|
May 21 18:38 UTC 2003 |
Why not make sd2a a copy of sd0a as well? It wouldn't hurt anything,
and might help, since each disk would be exactly like every other disk
in terms of how the partitions are layed out. That makes partitioning
easy; you can keep a copy of the disklabel for one of the disks around
in a file somewhere, and just write it to a new disk with the disklabel
command if necessary. Then, just plop the new disk in, tell RAIDframe
to rebuild it, and let it go on its merry way.
|
janc
|
|
response 329 of 547:
|
May 21 19:08 UTC 2003 |
Probably would. Actually, you don't even have to keep the layout in a file.
You can just copy it from one disk to another:
disklabel sd0 | disklabel -R sd1 /dev/fd/0
That's how I built the current setup.
|
janc
|
|
response 330 of 547:
|
May 21 19:08 UTC 2003 |
Bonnie croaked while doing some seeks. Try it again with a smaller file to
see if that works better.
|
other
|
|
response 331 of 547:
|
May 21 19:53 UTC 2003 |
I thought the IDE was mainly for a comprehensive backup of the boot
partition plus storage for sources.
|
cross
|
|
response 332 of 547:
|
May 21 19:55 UTC 2003 |
Yeah, you can do that, but if you also keep the disk label around in
another file, you can label the disk on any machine with a SCSI controller.
Does that matter? I don't know. It might be slightly more convenient.
It's just a nit, though; it's trivial to get a copy of the disklabel once
the machine's set up, and I doubt it would matter....
|
cross
|
|
response 333 of 547:
|
May 21 19:58 UTC 2003 |
Oh yeah.... That's an idea. Put /usr/src and /usr/obj on the IDE
drive, and then you don't have to do anything hacky with linking them
to /var as in my latest proposal. /var can be decreased accordingly,
and more space allocated to /u.
|
janc
|
|
response 334 of 547:
|
May 21 20:09 UTC 2003 |
OK, with a file size of 2000M, I get results from Bonnie. The validity of
these results is, however, questionable, since a lot of the file may have been
in memory instead of on disk.
-------Sequential Output-------- ---Sequential Input-- --Random--
-Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU
raid 5 2000 9520 6.8 7974 1.3 5706 2.0 50932 62.5 63815 13.0 147.9 1.6
scsi 2000 53754 43.4 54106 14.1 10090 2.6 60326 70.9 61067 11.5 201.2 0.8
We have two lines of results. The first was using the raid 5 array of three
SCSI disk. The second was on a single plain ordinary SCSI disk.
For each test we have the speed and the % of CPU used.
There are three output tests:
Per Char - file written sequentially with 2 billion calls to putc()
Block - file written with block writes
Rewrite - each block read, changed and rewritten
There are two input tests
Per Char - 2 billion calls to getc()
Block - block reads
And a seek test
Seeks - four child processes each execute 4000 seeks and reads. After
10% of these they change and rewrite the block.
So, on writing, RAID was 5 to 6 times slower. Notice that the supposedly
optimum block writes were actually slower than the character writes for the
RAID. The SCSI was twice as fast as RAID on the rewrite test.
On read the RAID array was still slower than the plain disks on the Per
Char reads, but a bit faster on the block reads. It was substantially slower
on the seeks.
Admitting that the benchmark is seriously questionable due to the small size
of the file relative to the large size of memory, this is not at all an
impressive result.
I reran the tests and got similar results.
-------Sequential Output-------- ---Sequential Input-- --Random--
-Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU
raid 5 2000 9520 6.8 7974 1.3 5706 2.0 50932 62.5 63815 13.0 147.9 1.6
raid 5 2000 8745 6.4 7654 1.3 5717 2.2 51345 63.5 64022 14.6 150.0 1.1
scsi 2000 53754 43.4 54106 14.1 10090 2.6 60326 70.9 61067 11.5 201.2 0.8
scsi 2000 54058 43.4 54618 14.1 10129 2.8 60552 71.0 60865 11.1 203.4 0.9
I suppose the main advantage in performance is in balancing load among multiple
spindles, but this would really only be noticable if multiple processes were
reading/writing the disk at once. With a single process, we aren't going to
gain much. Only in the seek test are there multiple processes, and then only
four.
|
cross
|
|
response 335 of 547:
|
May 21 23:48 UTC 2003 |
Are softupdates turned on on the raid filesystem?
|
janc
|
|
response 336 of 547:
|
May 22 02:31 UTC 2003 |
No. They are not even enabled in the kernel. From what little I understand
of it, it improves performance only with respect to metadata updates -
updating inodes when files are created or destoryed. That wouldn't effect
these benchmarks. I don't get a clear feeling that it is super stable yet
either.
|
cross
|
|
response 337 of 547:
|
May 22 03:29 UTC 2003 |
Every write and every read is also a metadata update (mtime and atime).
Soft updates are definitely stable at this point; they're enabled by
default in FreeBSD. OpenBSD tends to be somewhat more conservative,
though.
Gads; security be damned. Grex would've been better off with FreeBSD.
|
janc
|
|
response 338 of 547:
|
May 22 05:55 UTC 2003 |
Well, I argued that. I have the impression softupdates are more mature in
FreeBSD than OpenBSD. It's not really clear though.
|
janc
|
|
response 339 of 547:
|
May 22 06:30 UTC 2003 |
For the heck of it, I ran eight copies of the Bonnie benchmark simultaneously
on the RAID 5 partition. Below, A through H were started simultaneously.
The last line is just one benchmark process running
-------Sequential Output-------- ---Sequential Input-- --Random--
-Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU
A 2047 8087 5.9 5867 1.2 331 0.1 1647 2.0 1775 0.4 22.5 0.1
B 2047 1889 1.4 7770 1.4 545 0.2 890 0.9 1646 0.3 14.2 0.1
C 2047 1020 0.8 7038 1.2 417 0.1 1929 2.7 1578 0.3 19.9 0.2
D 2047 8647 6.3 7474 1.3 253 0.1 1905 2.4 4597 1.1 89.4 0.8
E 2047 3997 2.9 6946 1.2 215 0.1 23458 27.9 29250 6.4 155.5 1.4
F 2047 8314 6.2 7149 1.3 369 0.1 1333 1.6 1707 0.3 21.2 0.1
G 2047 8926 6.3 7899 1.4 512 0.2 865 1.1 1132 0.3 15.0 0.1
H 2047 4280 3.2 7861 1.3 458 0.1 954 1.2 1649 0.4 19.1 0.1
raid 5 2000 9520 6.8 7974 1.3 5706 2.0 50932 62.5 63815 13.0 147.9 1.6
They didn't stay well synchronized - you can tell that process E continued
running long after the others had finished (process scheduling doesn't seem
to be very fair). The write speeds didn't suffer too badly from the
competition, but the read times took a terrific beating - they are mostly
around 1/25 of the speed of one process. Note that there were probably some
write processes still running while the read processes were going.
Here's a more sensible test, a comparison against the SCSI and IDE drives,
in non-RAID configuration, with just one process running:
-------Sequential Output-------- ---Sequential Input-- --Random--
-Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU
scsi 2000 53754 43.4 54106 14.1 10090 2.6 60326 70.9 61067 11.5 201.2 0.8
ide 2000 27188 21.7 27038 6.9 9634 2.6 24889 29.9 25640 5.2 99.0 0.8
Seems the SCSI is about twice as fast on most benchmarks, and about the same
on the Rewrite test.
|
gull
|
|
response 340 of 547:
|
May 22 13:07 UTC 2003 |
RAID 5 is always going to be slower than a single disk, especially using
software RAID. There's more processing overhead, and you're doing a
third more reads/writes because of the parity. Still, I'm surprised to
see it 5 times slower. That doesn't seem very acceptable at all.
|
scott
|
|
response 341 of 547:
|
May 22 13:10 UTC 2003 |
RAID would be nice, and if we're making such a huge jump in processing power
then I don't think the performance penalty (assuming it's only 2-1 or
something less) isn't an issue.
|
janc
|
|
response 342 of 547:
|
May 22 13:50 UTC 2003 |
I'm beginning to suspect that some of these some of these fast read times are
coming out of buffers. The drastic crash in read speed when I ran 8 bonnies
could because instead of trying to buffer one 2G file in 1.5G of memory, we
were trying to buffer a total of 16G of files in 1.5G of memory. Some of
these really fast speed (the ones around 50M/sec) are likely being done
largely out of cache. This makes the results pretty meaningless.
Anyway, I ran three simultaneous bonnies on a plain SCSI. I couldn't run
8 becase I didn't have 16 Gig partition.
-------Sequential Output-------- ---Sequential Input-- --Random--
-Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU
A 2047 15240 16.1 20747 5.0 2018 0.6 3882 4.5 9506 1.7 164.7 0.6
B 2047 16768 13.7 20491 5.3 3016 0.9 4543 5.3 5598 1.1 31.6 0.2
C 2047 16812 13.6 17945 4.6 2977 0.9 4145 4.9 4513 0.8 46.5 0.2
scsi 2000 53754 43.4 54106 14.1 10090 2.6 60326 70.9 61067 11.5 201.2 0.8
scsi/3 2000 17918 43.4 18035 14.1 3363 2.6 20108 70.9 20355 11.5 67.1 0.8
The last line is just the one-process SCSI values divided by three. Notice
the write statistics for the three processes are all pretty close to one
third of the write statistics for a single process. The reads are way lower.
Is this an artififact of buffering? The seeks are a bit hard to tell, because
by that time the processes were pretty much out of synchronization.
The degradation in read performance is similar in magnitude to what we saw
on the raid (keeping in mind that we only have 3 processes instead of 8).
I think there must be a buffering thing going on here. The write statistics
are much better for the RAID - most of the 8 process wrote much faster than
1/8 of the single process.
Note that in both cases, the single processes read faster than they write,
while the multiple process write faster than they read. That's just weird.
|
aruba
|
|
response 343 of 547:
|
May 22 13:59 UTC 2003 |
Jan - can you fool the OS into thinking Grex has less memory than it really
does? Or tell it not to cache disk reads?
|
janc
|
|
response 344 of 547:
|
May 22 14:31 UTC 2003 |
Re #341: We are certainly taking a huge jump in processing power, but the
disk I/O performance improvement, while good, probably isn't as spectacular.
Disk speeds just haven't been growing as fast as processor speeds, and old
Grex's disks aren't nearly as old as it's processor. So the performance jump
in disk I/O from old Grex to new Grex might not be that huge. (Maybe I
should run some benchmarks on old Grex to compare with - will everyone please
log off?). I expect the new Grex will have memory to spare, cpu to spare,
disk space to spare, but maybe not disk bandwidth to spare (and certainly not
net bandwidth to spare).
I think the main benefits of RAID are:
- Availability. If a disk dies, the system can keep running. Performance
degrades, but it still works. If you have a hot spare disk, it can
be brought on line, replacing the dead disk, without interruption in
service.
I do not consider this very important to Grex. We can afford short
downtimes in the case of disaster.
- Data Protection. If a disk dies, the data on the drives is not lost.
This is important to Grex. However, it can be achieved other ways.
We could do daily rsync's from /var, /bbs, /home, and /etc to the IDE
drive (or even another machine). You might copy certain critical files
(/etc/passwd) more frequently. This has a performance penalty, of course.
In the case of a crash, your backup will not be fully up to date, so there
will be some data lose, but it should be tolerable. In the case of
accidental (or deliberate) deletion of data, this gives you a much better
safety net then RAID, so much so that we'll want to do at least some
of this even if we have RAID.
- Performance. RAID can balance the load over the drives nicely.
Yes, but so can ccd (pretty much equivalent to RAID 0).
So this doesn't really make a strong argument for RAID. However, there is
a bit of a flaw in the above break-down. These three aspects are not fully
separable. Suppose we merge our three SCSI drives into one big virtual ccd
drive and parition it up. Load balancing over the drives should be great.
Then one SCSI drive fails. You just lost a third of your data, scattered
randomly all over the system. Though you still have the other two thirds,
but doing anything with it is going to be a nightmare. Effectively a single
drive failure cooks all your data, instead of 1/3 of your data. I don't think
the performance improvement given by ccd or RAID 0 is worth the increased
risk of losing the whole system.
So I think the real alternative to RAID is what I originally proposed -
simple partitions, scattered across the drives in an ad hoc manner in hopes
of balancing the load across the spindles, with rsyncs to the IDE drive
for data protection.
I'm really starting to feel that might be the best choice. The advantages
of RAID for Grex are faint enough so that they don't quite overwhelm the
KISS factor in my estimation.
|
janc
|
|
response 345 of 547:
|
May 22 14:33 UTC 2003 |
Re 343: probably - but I'm not sure how. I thought of just creating a
RAMDISK and letting that eat up much of the memory (I could also run the
benchmark on a ramdisk, which might be interesting), but it looks like
you need to do a lot of kernal work to bring up a ramdisk, and I'm
insufficiently motivated.
|
cross
|
|
response 346 of 547:
|
May 22 14:37 UTC 2003 |
One can lower the amount of memory the kernel will use for caching by
mucking with the kernel. It looks like, when caching is taken out of
the picture, performance between RAID and the straight SCSI disks is
more or less on par?
|
janc
|
|
response 347 of 547:
|
May 22 14:48 UTC 2003 |
Hmmm...the faq (http://www.openbsd.org/faq/faq11.html) talks about the
BUFCACHEPERCENT kernel value. It says the default is 5%. I haven't touched
it, so if I'm reading this right, there should be 75M or less of disk cache.
Hmmm...Linux uses all free memory as disk cache. A much nicer setup.
Well, if that's the case then I'm not sure what make those benchmark numbers
so goofy.
|
gull
|
|
response 348 of 547:
|
May 22 17:12 UTC 2003 |
Re #344: I think that's starting to make sense, yes. Unless it turns
out the performance hit you're seeing is an artifact of your testing
method, we may be better off going with using the disks "straight".
Getting only 20% of the potential performance of the disk subsystem in
exchange for easier recovery on the rare occasions when we have disks
fail doesn't seem like a good tradeoff. I'm still having trouble
believing RAIDframe is *that* inefficient, though.
|
cross
|
|
response 349 of 547:
|
May 22 18:20 UTC 2003 |
So am I; it seems unreasonably slow, and it looks vaguely like the
numbers start to converge when you have many processes working at
once, which is the normal mode of operation. I'd be interested in
seeing what a test simulating a timesharing load would be like.
|