You are not logged in. Login Now
 0-24   25-49   50-74   75-99   100-124   125-149   150-174   175-199   200-224 
 225-249   250-274   275-299   300-324   302-326   327-351   352-376   377-401   402-426 
 427-451   452-476   477-501   502-526   527-547      
 
Author Message
25 new of 547 responses total.
janc
response 327 of 547: Mark Unseen   May 21 18:00 UTC 2003

So, if we went with RAID, what would we do?

On the disks we'll have partitions

   sd0a   - pretty tiny.  A place to store kernels.  We'll boot from here.
   sd1a   - A copy of /dev/sd0a, so we can boot if sd0 dies
    
   sd0b   - swap partitions, one Gig each.  You can put swap on raid, but
   sd1b     it doesn't appear to be a great idea.  We'll trust OpenBSD to
   sd2b     balance swap load over the three spindels.

   sd0d   - the remainders of the disks, about 16 Gig each.
   sd1d
   sd2d

Now, sd0d, sd1d and sd2d will be clustered together into a RAID 5 array, called
raid0.  To all intents and purposes, this appears as a single big disk.  It
should come out at about 29 Gig, a more than adequate amount of space for all
of Grex's needs for a while.  Raid0 gets partitioned into all the various
partitions we need, with root on raid0a, usr on raid0d and so on.

The 80 Gig IDE disk doesn't participate in this.  We could put the boot
partition on this, but I'd want copies on two disks anyway, so we'll need at
least some non-raid partitions on the SCSI disks anyway, so let's leave
everything critical off the IDE.
cross
response 328 of 547: Mark Unseen   May 21 18:38 UTC 2003

Why not make sd2a a copy of sd0a as well?  It wouldn't hurt anything,
and might help, since each disk would be exactly like every other disk
in terms of how the partitions are layed out.  That makes partitioning
easy; you can keep a copy of the disklabel for one of the disks around
in a file somewhere, and just write it to a new disk with the disklabel
command if necessary.  Then, just plop the new disk in, tell RAIDframe
to rebuild it, and let it go on its merry way.
janc
response 329 of 547: Mark Unseen   May 21 19:08 UTC 2003

Probably would.  Actually, you don't even have to keep the layout in a file.
You can just copy it from one disk to another:

  disklabel sd0 | disklabel -R sd1 /dev/fd/0

That's how I built the current setup.
janc
response 330 of 547: Mark Unseen   May 21 19:08 UTC 2003

Bonnie croaked while doing some seeks.  Try it again with a smaller file to
see if that works better.
other
response 331 of 547: Mark Unseen   May 21 19:53 UTC 2003

I thought the IDE was mainly for a comprehensive backup of the boot 
partition plus storage for sources.
cross
response 332 of 547: Mark Unseen   May 21 19:55 UTC 2003

Yeah, you can do that, but if you also keep the disk label around in
another file, you can label the disk on any machine with a SCSI controller.
Does that matter?  I don't know.  It might be slightly more convenient.
It's just a nit, though; it's trivial to get a copy of the disklabel once
the machine's set up, and I doubt it would matter....
cross
response 333 of 547: Mark Unseen   May 21 19:58 UTC 2003

Oh yeah....  That's an idea.  Put /usr/src and /usr/obj on the IDE
drive, and then you don't have to do anything hacky with linking them
to /var as in my latest proposal.  /var can be decreased accordingly,
and more space allocated to /u.
janc
response 334 of 547: Mark Unseen   May 21 20:09 UTC 2003

OK, with a file size of 2000M, I get results from Bonnie.  The validity of
these results is, however, questionable, since a lot of the file may have been
in memory instead of on disk.

              -------Sequential Output-------- ---Sequential Input-- --Random--
              -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine    MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU  /sec %CPU

raid 5   2000  9520  6.8  7974  1.3  5706  2.0 50932 62.5 63815 13.0 147.9  1.6
scsi     2000 53754 43.4 54106 14.1 10090  2.6 60326 70.9 61067 11.5 201.2  0.8

We have two lines of results.  The first was using the raid 5 array of three
SCSI disk.  The second was on a single plain ordinary SCSI disk.

For each test we have the speed and the % of CPU used.

There are three output tests:

  Per Char   -  file written sequentially with 2 billion calls to putc()
  Block      -  file written with block writes
  Rewrite    -  each block read, changed and rewritten

There are two input tests

  Per Char   - 2 billion calls to getc()
  Block      - block reads

And a seek test

  Seeks      - four child processes each execute 4000 seeks and reads.  After
               10% of these they change and rewrite the block.

So, on writing, RAID was 5 to 6 times slower.  Notice that the supposedly
optimum block writes were actually slower than the character writes for the
RAID.  The SCSI was twice as fast as RAID on the rewrite test.

On read the RAID array was still slower than the plain disks on the Per
Char reads, but a bit faster on the block reads.  It was substantially slower
on the seeks.

Admitting that the benchmark is seriously questionable due to the small size
of the file relative to the large size of memory, this is not at all an
impressive result.

I reran the tests and got similar results.

              -------Sequential Output-------- ---Sequential Input-- --Random--
              -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine    MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU  /sec %CPU

raid 5   2000  9520  6.8  7974  1.3  5706  2.0 50932 62.5 63815 13.0 147.9  1.6
raid 5   2000  8745  6.4  7654  1.3  5717  2.2 51345 63.5 64022 14.6 150.0  1.1

scsi     2000 53754 43.4 54106 14.1 10090  2.6 60326 70.9 61067 11.5 201.2  0.8
scsi     2000 54058 43.4 54618 14.1 10129  2.8 60552 71.0 60865 11.1 203.4  0.9

I suppose the main advantage in performance is in balancing load among multiple
spindles, but this would really only be noticable if multiple processes were
reading/writing the disk at once.  With a single process, we aren't going to
gain much.  Only in the seek test are there multiple processes, and then only
four.
cross
response 335 of 547: Mark Unseen   May 21 23:48 UTC 2003

Are softupdates turned on on the raid filesystem?
janc
response 336 of 547: Mark Unseen   May 22 02:31 UTC 2003

No.  They are not even enabled in the kernel.  From what little I understand
of it, it improves performance only with respect to metadata updates -
updating inodes when files are created or destoryed.  That wouldn't effect
these benchmarks.  I don't get a clear feeling that it is super stable yet
either.
cross
response 337 of 547: Mark Unseen   May 22 03:29 UTC 2003

Every write and every read is also a metadata update (mtime and atime).
Soft updates are definitely stable at this point; they're enabled by
default in FreeBSD.  OpenBSD tends to be somewhat more conservative,
though.

Gads; security be damned.  Grex would've been better off with FreeBSD.
janc
response 338 of 547: Mark Unseen   May 22 05:55 UTC 2003

Well, I argued that.  I have the impression softupdates are more mature in
FreeBSD than OpenBSD.  It's not really clear though.
janc
response 339 of 547: Mark Unseen   May 22 06:30 UTC 2003

For the heck of it, I ran eight copies of the Bonnie benchmark simultaneously
on the RAID 5 partition.  Below, A through H were started simultaneously.
The last line is just one benchmark process running

              -------Sequential Output-------- ---Sequential Input-- --Random--
              -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine    MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU  /sec %CPU
A        2047  8087  5.9  5867  1.2   331  0.1  1647  2.0  1775  0.4  22.5 0.1
B        2047  1889  1.4  7770  1.4   545  0.2   890  0.9  1646  0.3  14.2 0.1
C        2047  1020  0.8  7038  1.2   417  0.1  1929  2.7  1578  0.3  19.9 0.2
D        2047  8647  6.3  7474  1.3   253  0.1  1905  2.4  4597  1.1  89.4 0.8
E        2047  3997  2.9  6946  1.2   215  0.1 23458 27.9 29250  6.4 155.5 1.4
F        2047  8314  6.2  7149  1.3   369  0.1  1333  1.6  1707  0.3  21.2 0.1
G        2047  8926  6.3  7899  1.4   512  0.2   865  1.1  1132  0.3  15.0 0.1
H        2047  4280  3.2  7861  1.3   458  0.1   954  1.2  1649  0.4  19.1 0.1

raid 5   2000  9520  6.8  7974  1.3  5706  2.0 50932 62.5 63815 13.0 147.9 1.6

They didn't stay well synchronized - you can tell that process E continued
running long after the others had finished (process scheduling doesn't seem
to be very fair).  The write speeds didn't suffer too badly from the
competition, but the read times took a terrific beating - they are mostly
around 1/25 of the speed of one process.  Note that there were probably some
write processes still running while the read processes were going.

Here's a more sensible test, a comparison against the SCSI and IDE drives,
in non-RAID configuration, with just one process running:

              -------Sequential Output-------- ---Sequential Input-- --Random--
              -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine    MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU  /sec %CPU
scsi     2000 53754 43.4 54106 14.1 10090  2.6 60326 70.9 61067 11.5 201.2 0.8
ide      2000 27188 21.7 27038  6.9  9634  2.6 24889 29.9 25640  5.2  99.0 0.8

Seems the SCSI is about twice as fast on most benchmarks, and about the same
on the Rewrite test.
gull
response 340 of 547: Mark Unseen   May 22 13:07 UTC 2003

RAID 5 is always going to be slower than a single disk, especially using
software RAID.  There's more processing overhead, and you're doing a
third more reads/writes because of the parity.  Still, I'm surprised to
see it 5 times slower.  That doesn't seem very acceptable at all.
scott
response 341 of 547: Mark Unseen   May 22 13:10 UTC 2003

RAID would be nice, and if we're making such a huge jump in processing power
then I don't think the performance penalty (assuming it's only 2-1 or
something less) isn't an issue.
janc
response 342 of 547: Mark Unseen   May 22 13:50 UTC 2003

I'm beginning to suspect that some of these some of these fast read times are
coming out of buffers.  The drastic crash in read speed when I ran 8 bonnies
could because instead of trying to buffer one 2G file in 1.5G of memory, we
were trying to buffer a total of 16G of files in 1.5G of memory.  Some of
these really fast speed (the ones around 50M/sec) are likely being done
largely out of cache.  This makes the results pretty meaningless.

Anyway, I ran three simultaneous bonnies on a plain SCSI.  I couldn't run
8 becase I didn't have 16 Gig partition.

              -------Sequential Output-------- ---Sequential Input-- --Random--
              -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine    MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU  /sec %CPU
A        2047 15240 16.1 20747  5.0  2018  0.6  3882  4.5  9506  1.7 164.7 0.6
B        2047 16768 13.7 20491  5.3  3016  0.9  4543  5.3  5598  1.1  31.6 0.2
C        2047 16812 13.6 17945  4.6  2977  0.9  4145  4.9  4513  0.8  46.5 0.2

scsi     2000 53754 43.4 54106 14.1 10090  2.6 60326 70.9 61067 11.5 201.2 0.8

scsi/3   2000 17918 43.4 18035 14.1  3363  2.6 20108 70.9 20355 11.5  67.1 0.8

The last line is just the one-process SCSI values divided by three.  Notice
the write statistics for the three processes are all pretty close to one
third of the write statistics for a single process.  The reads are way lower.
Is this an artififact of buffering?  The seeks are a bit hard to tell, because
by that time the processes were pretty much out of synchronization.

The degradation in read performance is similar in magnitude to what we saw
on the raid (keeping in mind that we only have 3 processes instead of 8). 
I think there must be a buffering thing going on here.  The write statistics
are much better for the RAID - most of the 8 process wrote much faster than
1/8 of the single process.

Note that in both cases, the single processes read faster than they write,
while the multiple process write faster than they read.  That's just weird.
aruba
response 343 of 547: Mark Unseen   May 22 13:59 UTC 2003

Jan - can you fool the OS into thinking Grex has less memory than it really
does?  Or tell it not to cache disk reads?
janc
response 344 of 547: Mark Unseen   May 22 14:31 UTC 2003

Re #341:  We are certainly taking a huge jump in processing power, but the
disk I/O performance improvement, while good, probably isn't as spectacular.
Disk speeds just haven't been growing as fast as processor speeds, and old
Grex's disks aren't nearly as old as it's processor.  So the performance jump
in disk I/O from old Grex to new Grex might not be that huge.   (Maybe I
should run some benchmarks on old Grex to compare with - will everyone please
log off?).  I expect the new Grex will have memory to spare, cpu to spare,
disk space to spare, but maybe not disk bandwidth to spare (and certainly not
net bandwidth to spare).

I think the main benefits of RAID are:

  - Availability.  If a disk dies, the system can keep running.  Performance
    degrades, but it still works.  If you have a hot spare disk, it can
    be brought on line, replacing the dead disk, without interruption in
    service.

    I do not consider this very important to Grex.  We can afford short
    downtimes in the case of disaster.

  - Data Protection.  If a disk dies, the data on the drives is not lost.

    This is important to Grex.  However, it can be achieved other ways.
    We could do daily rsync's from /var, /bbs, /home, and /etc to the IDE
    drive (or even another machine).  You might copy certain critical files
    (/etc/passwd) more frequently.  This has a performance penalty, of course.
    In the case of a crash, your backup will not be fully up to date, so there
    will be some data lose, but it should be tolerable.  In the case of
    accidental (or deliberate) deletion of data, this gives you a much better
    safety net then RAID, so much so that we'll want to do at least some
    of this even if we have RAID.

  - Performance.  RAID can balance the load over the drives nicely.

    Yes, but so can ccd (pretty much equivalent to RAID 0).

So this doesn't really make a strong argument for RAID.  However, there is
a bit of a flaw in the above break-down.  These three aspects are not fully
separable.  Suppose we merge our three SCSI drives into one big virtual ccd
drive and parition it up.  Load balancing over the drives should be great.
Then one SCSI drive fails.  You just lost a third of your data, scattered
randomly all over the system.  Though you still have the other two thirds,
but doing anything with it is going to be a nightmare.  Effectively a single
drive failure cooks all your data, instead of 1/3 of your data.  I don't think
the performance improvement given by ccd or RAID 0 is worth the increased
risk of losing the whole system.

So I think the real alternative to RAID is what I originally proposed -
simple partitions, scattered across the drives in an ad hoc manner in hopes
of balancing the load across the spindles, with rsyncs to the IDE drive
for data protection.

I'm really starting to feel that might be the best choice.  The advantages
of RAID for Grex are faint enough so that they don't quite overwhelm the
KISS factor in my estimation.
janc
response 345 of 547: Mark Unseen   May 22 14:33 UTC 2003

Re 343:  probably - but I'm not sure how.  I thought of just creating a
RAMDISK and letting that eat up much of the memory (I could also run the
benchmark on a ramdisk, which might be interesting), but it looks like
you need to do a lot of kernal work to bring up a ramdisk, and I'm
insufficiently motivated.
cross
response 346 of 547: Mark Unseen   May 22 14:37 UTC 2003

One can lower the amount of memory the kernel will use for caching by 
mucking with the kernel.  It looks like, when caching is taken out of
the picture, performance between RAID and the straight SCSI disks is
more or less on par?
janc
response 347 of 547: Mark Unseen   May 22 14:48 UTC 2003

Hmmm...the faq (http://www.openbsd.org/faq/faq11.html) talks about the
BUFCACHEPERCENT kernel value.  It says the default is 5%.  I haven't touched
it, so if I'm reading this right, there should be 75M or less of disk cache.
Hmmm...Linux uses all free memory as disk cache.  A much nicer setup.

Well, if that's the case then I'm not sure what make those benchmark numbers
so goofy.
gull
response 348 of 547: Mark Unseen   May 22 17:12 UTC 2003

Re #344: I think that's starting to make sense, yes.  Unless it turns
out the performance hit you're seeing is an artifact of your testing
method, we may be better off going with using the disks "straight". 
Getting only 20% of the potential performance of the disk subsystem in
exchange for easier recovery on the rare occasions when we have disks
fail doesn't seem like a good tradeoff.  I'm still having trouble
believing RAIDframe is *that* inefficient, though.
cross
response 349 of 547: Mark Unseen   May 22 18:20 UTC 2003

So am I; it seems unreasonably slow, and it looks vaguely like the
numbers start to converge when you have many processes working at
once, which is the normal mode of operation.  I'd be interested in
seeing what a test simulating a timesharing load would be like.
lk
response 350 of 547: Mark Unseen   May 23 03:37 UTC 2003

One simple way to "fool" the kernel into thinking that NextGrex has less
memory is... to remove all but one memory module. Guaranteed to work. (:

You might also want to test mirroring. Might be more efficent (less CPU
utilization for striping and no extra parity data) while offering both
availability and redundancy.  The "cost" here is 50% drive overhead.
The boot disk, with the system partitions (and /tmp or was that IDE?)
could be one disk while the other pair could be mirrored.
janc
response 351 of 547: Mark Unseen   May 23 13:49 UTC 2003

I'm reluctant to take the machine appart for such purposes.  Anyway, I'm a
software guy.

I certainly agree that we need better benchmarks, but I'm not sure how to
obtain them.  Anyone with better ideas is welcome to suggest them.  Those of
you with accounts on the system can probably run them yourselves, as the
relevant disk partitions are permitted 777.  We really want to get some sense
of how RAID would effect a realistic multi-user load.

I tried running a benchmark with a really small file, one where you should be
getting lots of use from cache.  Here's the 50 MB and 2000 MB results.  Explain
this, if you will:

              -------Sequential Output-------- ---Sequential Input-- --Random--
              -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine    MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU  /sec %CPU
raid 5     50 24882 20.9 22641  3.5  3731  1.2  7555  9.6 64346 14.7 511.7 3.1
raid 5   2000  9520  6.8  7974  1.3  5706  2.0 50932 62.5 63815 13.0 147.9 1.6

The small run has much faster output, and significantly faster seek times.
The block read is about as fast as the large file (suggesting that it is
mostly reading from buffer).  But what's going on with the per char read?

Note that the sequence of the tests is:

    Per Char Output
    Rewrite Output
    Block Output
    Per Char Input
    Block Input
    Seek

So it may be that the Per Char read was from disk, but left the entire file
in cache, so the block read was then very fast.  But why wouldn't it already
be in cache after the block output?  And why would the same speed be
achieved on the Block Read with the 2M file, which can't have all been
in cache.

I don't think I know enough about how buffering and disk I/O works in openBSD
to really interpret this stuff.
 0-24   25-49   50-74   75-99   100-124   125-149   150-174   175-199   200-224 
 225-249   250-274   275-299   300-324   302-326   327-351   352-376   377-401   402-426 
 427-451   452-476   477-501   502-526   527-547      
Response Not Possible: You are Not Logged In
 

- Backtalk version 1.3.30 - Copyright 1996-2006, Jan Wolter and Steve Weiss