You are not logged in. Login Now
 0-24   25-49   50-74   75-99   100-124   125-149   150-174   175-199   200-224 
 225-249   250-274   275-299   297-321   322-346   347-371   372-396   397-421   422-446 
 447-471   472-496   497-521   522-546   547      
 
Author Message
25 new of 547 responses total.
cross
response 322 of 547: Mark Unseen   May 21 13:34 UTC 2003

Regarding #321; I concur.  A barebones recovery plan should be developed
in any event, regardless of whether RAID is used.
janc
response 323 of 547: Mark Unseen   May 21 15:34 UTC 2003

Thanks David.  I definitely value input on this question.

I built two new kernels.  The first is simple GENERIC minus a mess of stuff
we don't need - mostly device drivers for devices we haven't got.  The second
is the same, but turns RAID on (and does various stuff to make sure SCSI
drives don't get renumbered when one fails).

I also pushed the "maxusers" parameter from 32 to 64.  Maxusers isn't really
the maximum number of users.  It's a voodoo number that is used to estimate
sizes for all sorts of system parameters, which can be fine-tuned separately
by editing lower level definitions.  I saw various posts by people who had
set it higher than 64 and got a warning message about that.  One seemed to
have some crashes after that and thought it might be related.  However, one
of these guys got no response that is in the archive, and the other was only
told that he was an idiot.  (These OpenBSD mailing list archives are such a
valuable resource.)  So for the moment I thought I'd set it to 64.  It'll be
easy enough to fine tune it later if we have problems with that setting.

The OpenBSD FAQ discourages building new kernels without a danged good reason,
threating lack of technical support for problems with non-generic kernels.
However, since their technical support is laughable anyway and and Marcus is
guaranteed to have changes to make to the kernel anyway, I decided we might
as well get started, even if we don't end up using RAID.

The stripped down GREX kernel is about half the size of the GENERIC kernel,
which is a plus, if not a great big one:

  -rw-r--r--  1 root  wheel  4579691 May 21 07:01 /bsd.generic
  -rwxr-xr-x  1 root  wheel  2719734 May 21 07:03 /bsd.new
  -rwxr-xr-x  1 root  wheel  3133519 May 21 06:59 /bsd.raid

It is currently running on the bsd.raid kernel, and that is the default.
I haven't however, set up any RAID array yet.

I've also now got a draft document on kernel building.
janc
response 324 of 547: Mark Unseen   May 21 17:19 UTC 2003

OK, I've created a RAID array on new Grex - just for experimental purposes
at this point.  First, I sliced up the three scsi disks into two partitions
each, each disk identically:

 sd0a:  20479825 blocks  = ~10 Gig
 sd0d:  15361127 blocks  =  ~7 Gig

The sd0a, sd1a, and sd2a partitions are clustered into a RAID5 array, with
just one partition, /dev/raid1a, on it (it can be sliced into smaller
partitions).  This is mounted as /raid.  The sd0d, sd1d, and sd2d partitions
are mounted as /sd0, /sd1 and /sd2 respectively.  My idea was that if we
want to do any benchmarks, this lets us access the same disks, with or without
raid.  All four partitions are rw-all so anyone with an account can create
stuff there and look at the stats.

df looks like this:

  Filesystem  1K-blocks     Used    Avail Capacity  Mounted on
  /dev/sd0d     7438613        1  7066682     0%    /sd0
  /dev/sd1d     7438613        1  7066682     0%    /sd1
  /dev/sd2d     7438613        1  7066682     0%    /sd2
  /dev/raid1a  19852909        1 18860263     0%    /raid

Note that the available space (18.8 Gigs) is about 61% of the disk we put
into this (30 Gigs), most of the rest being used for parity, some of the
rest being eaten by filesystem overhead of various sorts.
gull
response 325 of 547: Mark Unseen   May 21 17:33 UTC 2003

Yeah, from what I've seen a lot of OpenBSDers are a bit elitist and
don't suffer newbies gladly.  It's an unfortunate attitude.
janc
response 326 of 547: Mark Unseen   May 21 17:37 UTC 2003

Hmmm...I'm trying to run the bonnie benchmark
(http://www.textuality.com/bonnie) on the raid disk, but I'm not sure it
will work.  Bonnie wants me to use a file size several times larger than main
memory.  Main memory is 1.5 Gig, so I told it to use 4 times that: 6144 Meg.
But the first thing it said is: 
   File './Bonnie.28521', size: -2147483648
Uh-oh.  Someone may be using signed longs for the file size.  If that's the
case, then the biggest file size I can use is probably around 2048 Meg, which
isn't several times the size of our memory.  Well, I'll let it run and see
what happens.
janc
response 327 of 547: Mark Unseen   May 21 18:00 UTC 2003

So, if we went with RAID, what would we do?

On the disks we'll have partitions

   sd0a   - pretty tiny.  A place to store kernels.  We'll boot from here.
   sd1a   - A copy of /dev/sd0a, so we can boot if sd0 dies
    
   sd0b   - swap partitions, one Gig each.  You can put swap on raid, but
   sd1b     it doesn't appear to be a great idea.  We'll trust OpenBSD to
   sd2b     balance swap load over the three spindels.

   sd0d   - the remainders of the disks, about 16 Gig each.
   sd1d
   sd2d

Now, sd0d, sd1d and sd2d will be clustered together into a RAID 5 array, called
raid0.  To all intents and purposes, this appears as a single big disk.  It
should come out at about 29 Gig, a more than adequate amount of space for all
of Grex's needs for a while.  Raid0 gets partitioned into all the various
partitions we need, with root on raid0a, usr on raid0d and so on.

The 80 Gig IDE disk doesn't participate in this.  We could put the boot
partition on this, but I'd want copies on two disks anyway, so we'll need at
least some non-raid partitions on the SCSI disks anyway, so let's leave
everything critical off the IDE.
cross
response 328 of 547: Mark Unseen   May 21 18:38 UTC 2003

Why not make sd2a a copy of sd0a as well?  It wouldn't hurt anything,
and might help, since each disk would be exactly like every other disk
in terms of how the partitions are layed out.  That makes partitioning
easy; you can keep a copy of the disklabel for one of the disks around
in a file somewhere, and just write it to a new disk with the disklabel
command if necessary.  Then, just plop the new disk in, tell RAIDframe
to rebuild it, and let it go on its merry way.
janc
response 329 of 547: Mark Unseen   May 21 19:08 UTC 2003

Probably would.  Actually, you don't even have to keep the layout in a file.
You can just copy it from one disk to another:

  disklabel sd0 | disklabel -R sd1 /dev/fd/0

That's how I built the current setup.
janc
response 330 of 547: Mark Unseen   May 21 19:08 UTC 2003

Bonnie croaked while doing some seeks.  Try it again with a smaller file to
see if that works better.
other
response 331 of 547: Mark Unseen   May 21 19:53 UTC 2003

I thought the IDE was mainly for a comprehensive backup of the boot 
partition plus storage for sources.
cross
response 332 of 547: Mark Unseen   May 21 19:55 UTC 2003

Yeah, you can do that, but if you also keep the disk label around in
another file, you can label the disk on any machine with a SCSI controller.
Does that matter?  I don't know.  It might be slightly more convenient.
It's just a nit, though; it's trivial to get a copy of the disklabel once
the machine's set up, and I doubt it would matter....
cross
response 333 of 547: Mark Unseen   May 21 19:58 UTC 2003

Oh yeah....  That's an idea.  Put /usr/src and /usr/obj on the IDE
drive, and then you don't have to do anything hacky with linking them
to /var as in my latest proposal.  /var can be decreased accordingly,
and more space allocated to /u.
janc
response 334 of 547: Mark Unseen   May 21 20:09 UTC 2003

OK, with a file size of 2000M, I get results from Bonnie.  The validity of
these results is, however, questionable, since a lot of the file may have been
in memory instead of on disk.

              -------Sequential Output-------- ---Sequential Input-- --Random--
              -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine    MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU  /sec %CPU

raid 5   2000  9520  6.8  7974  1.3  5706  2.0 50932 62.5 63815 13.0 147.9  1.6
scsi     2000 53754 43.4 54106 14.1 10090  2.6 60326 70.9 61067 11.5 201.2  0.8

We have two lines of results.  The first was using the raid 5 array of three
SCSI disk.  The second was on a single plain ordinary SCSI disk.

For each test we have the speed and the % of CPU used.

There are three output tests:

  Per Char   -  file written sequentially with 2 billion calls to putc()
  Block      -  file written with block writes
  Rewrite    -  each block read, changed and rewritten

There are two input tests

  Per Char   - 2 billion calls to getc()
  Block      - block reads

And a seek test

  Seeks      - four child processes each execute 4000 seeks and reads.  After
               10% of these they change and rewrite the block.

So, on writing, RAID was 5 to 6 times slower.  Notice that the supposedly
optimum block writes were actually slower than the character writes for the
RAID.  The SCSI was twice as fast as RAID on the rewrite test.

On read the RAID array was still slower than the plain disks on the Per
Char reads, but a bit faster on the block reads.  It was substantially slower
on the seeks.

Admitting that the benchmark is seriously questionable due to the small size
of the file relative to the large size of memory, this is not at all an
impressive result.

I reran the tests and got similar results.

              -------Sequential Output-------- ---Sequential Input-- --Random--
              -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine    MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU  /sec %CPU

raid 5   2000  9520  6.8  7974  1.3  5706  2.0 50932 62.5 63815 13.0 147.9  1.6
raid 5   2000  8745  6.4  7654  1.3  5717  2.2 51345 63.5 64022 14.6 150.0  1.1

scsi     2000 53754 43.4 54106 14.1 10090  2.6 60326 70.9 61067 11.5 201.2  0.8
scsi     2000 54058 43.4 54618 14.1 10129  2.8 60552 71.0 60865 11.1 203.4  0.9

I suppose the main advantage in performance is in balancing load among multiple
spindles, but this would really only be noticable if multiple processes were
reading/writing the disk at once.  With a single process, we aren't going to
gain much.  Only in the seek test are there multiple processes, and then only
four.
cross
response 335 of 547: Mark Unseen   May 21 23:48 UTC 2003

Are softupdates turned on on the raid filesystem?
janc
response 336 of 547: Mark Unseen   May 22 02:31 UTC 2003

No.  They are not even enabled in the kernel.  From what little I understand
of it, it improves performance only with respect to metadata updates -
updating inodes when files are created or destoryed.  That wouldn't effect
these benchmarks.  I don't get a clear feeling that it is super stable yet
either.
cross
response 337 of 547: Mark Unseen   May 22 03:29 UTC 2003

Every write and every read is also a metadata update (mtime and atime).
Soft updates are definitely stable at this point; they're enabled by
default in FreeBSD.  OpenBSD tends to be somewhat more conservative,
though.

Gads; security be damned.  Grex would've been better off with FreeBSD.
janc
response 338 of 547: Mark Unseen   May 22 05:55 UTC 2003

Well, I argued that.  I have the impression softupdates are more mature in
FreeBSD than OpenBSD.  It's not really clear though.
janc
response 339 of 547: Mark Unseen   May 22 06:30 UTC 2003

For the heck of it, I ran eight copies of the Bonnie benchmark simultaneously
on the RAID 5 partition.  Below, A through H were started simultaneously.
The last line is just one benchmark process running

              -------Sequential Output-------- ---Sequential Input-- --Random--
              -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine    MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU  /sec %CPU
A        2047  8087  5.9  5867  1.2   331  0.1  1647  2.0  1775  0.4  22.5 0.1
B        2047  1889  1.4  7770  1.4   545  0.2   890  0.9  1646  0.3  14.2 0.1
C        2047  1020  0.8  7038  1.2   417  0.1  1929  2.7  1578  0.3  19.9 0.2
D        2047  8647  6.3  7474  1.3   253  0.1  1905  2.4  4597  1.1  89.4 0.8
E        2047  3997  2.9  6946  1.2   215  0.1 23458 27.9 29250  6.4 155.5 1.4
F        2047  8314  6.2  7149  1.3   369  0.1  1333  1.6  1707  0.3  21.2 0.1
G        2047  8926  6.3  7899  1.4   512  0.2   865  1.1  1132  0.3  15.0 0.1
H        2047  4280  3.2  7861  1.3   458  0.1   954  1.2  1649  0.4  19.1 0.1

raid 5   2000  9520  6.8  7974  1.3  5706  2.0 50932 62.5 63815 13.0 147.9 1.6

They didn't stay well synchronized - you can tell that process E continued
running long after the others had finished (process scheduling doesn't seem
to be very fair).  The write speeds didn't suffer too badly from the
competition, but the read times took a terrific beating - they are mostly
around 1/25 of the speed of one process.  Note that there were probably some
write processes still running while the read processes were going.

Here's a more sensible test, a comparison against the SCSI and IDE drives,
in non-RAID configuration, with just one process running:

              -------Sequential Output-------- ---Sequential Input-- --Random--
              -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine    MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU  /sec %CPU
scsi     2000 53754 43.4 54106 14.1 10090  2.6 60326 70.9 61067 11.5 201.2 0.8
ide      2000 27188 21.7 27038  6.9  9634  2.6 24889 29.9 25640  5.2  99.0 0.8

Seems the SCSI is about twice as fast on most benchmarks, and about the same
on the Rewrite test.
gull
response 340 of 547: Mark Unseen   May 22 13:07 UTC 2003

RAID 5 is always going to be slower than a single disk, especially using
software RAID.  There's more processing overhead, and you're doing a
third more reads/writes because of the parity.  Still, I'm surprised to
see it 5 times slower.  That doesn't seem very acceptable at all.
scott
response 341 of 547: Mark Unseen   May 22 13:10 UTC 2003

RAID would be nice, and if we're making such a huge jump in processing power
then I don't think the performance penalty (assuming it's only 2-1 or
something less) isn't an issue.
janc
response 342 of 547: Mark Unseen   May 22 13:50 UTC 2003

I'm beginning to suspect that some of these some of these fast read times are
coming out of buffers.  The drastic crash in read speed when I ran 8 bonnies
could because instead of trying to buffer one 2G file in 1.5G of memory, we
were trying to buffer a total of 16G of files in 1.5G of memory.  Some of
these really fast speed (the ones around 50M/sec) are likely being done
largely out of cache.  This makes the results pretty meaningless.

Anyway, I ran three simultaneous bonnies on a plain SCSI.  I couldn't run
8 becase I didn't have 16 Gig partition.

              -------Sequential Output-------- ---Sequential Input-- --Random--
              -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine    MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU  /sec %CPU
A        2047 15240 16.1 20747  5.0  2018  0.6  3882  4.5  9506  1.7 164.7 0.6
B        2047 16768 13.7 20491  5.3  3016  0.9  4543  5.3  5598  1.1  31.6 0.2
C        2047 16812 13.6 17945  4.6  2977  0.9  4145  4.9  4513  0.8  46.5 0.2

scsi     2000 53754 43.4 54106 14.1 10090  2.6 60326 70.9 61067 11.5 201.2 0.8

scsi/3   2000 17918 43.4 18035 14.1  3363  2.6 20108 70.9 20355 11.5  67.1 0.8

The last line is just the one-process SCSI values divided by three.  Notice
the write statistics for the three processes are all pretty close to one
third of the write statistics for a single process.  The reads are way lower.
Is this an artififact of buffering?  The seeks are a bit hard to tell, because
by that time the processes were pretty much out of synchronization.

The degradation in read performance is similar in magnitude to what we saw
on the raid (keeping in mind that we only have 3 processes instead of 8). 
I think there must be a buffering thing going on here.  The write statistics
are much better for the RAID - most of the 8 process wrote much faster than
1/8 of the single process.

Note that in both cases, the single processes read faster than they write,
while the multiple process write faster than they read.  That's just weird.
aruba
response 343 of 547: Mark Unseen   May 22 13:59 UTC 2003

Jan - can you fool the OS into thinking Grex has less memory than it really
does?  Or tell it not to cache disk reads?
janc
response 344 of 547: Mark Unseen   May 22 14:31 UTC 2003

Re #341:  We are certainly taking a huge jump in processing power, but the
disk I/O performance improvement, while good, probably isn't as spectacular.
Disk speeds just haven't been growing as fast as processor speeds, and old
Grex's disks aren't nearly as old as it's processor.  So the performance jump
in disk I/O from old Grex to new Grex might not be that huge.   (Maybe I
should run some benchmarks on old Grex to compare with - will everyone please
log off?).  I expect the new Grex will have memory to spare, cpu to spare,
disk space to spare, but maybe not disk bandwidth to spare (and certainly not
net bandwidth to spare).

I think the main benefits of RAID are:

  - Availability.  If a disk dies, the system can keep running.  Performance
    degrades, but it still works.  If you have a hot spare disk, it can
    be brought on line, replacing the dead disk, without interruption in
    service.

    I do not consider this very important to Grex.  We can afford short
    downtimes in the case of disaster.

  - Data Protection.  If a disk dies, the data on the drives is not lost.

    This is important to Grex.  However, it can be achieved other ways.
    We could do daily rsync's from /var, /bbs, /home, and /etc to the IDE
    drive (or even another machine).  You might copy certain critical files
    (/etc/passwd) more frequently.  This has a performance penalty, of course.
    In the case of a crash, your backup will not be fully up to date, so there
    will be some data lose, but it should be tolerable.  In the case of
    accidental (or deliberate) deletion of data, this gives you a much better
    safety net then RAID, so much so that we'll want to do at least some
    of this even if we have RAID.

  - Performance.  RAID can balance the load over the drives nicely.

    Yes, but so can ccd (pretty much equivalent to RAID 0).

So this doesn't really make a strong argument for RAID.  However, there is
a bit of a flaw in the above break-down.  These three aspects are not fully
separable.  Suppose we merge our three SCSI drives into one big virtual ccd
drive and parition it up.  Load balancing over the drives should be great.
Then one SCSI drive fails.  You just lost a third of your data, scattered
randomly all over the system.  Though you still have the other two thirds,
but doing anything with it is going to be a nightmare.  Effectively a single
drive failure cooks all your data, instead of 1/3 of your data.  I don't think
the performance improvement given by ccd or RAID 0 is worth the increased
risk of losing the whole system.

So I think the real alternative to RAID is what I originally proposed -
simple partitions, scattered across the drives in an ad hoc manner in hopes
of balancing the load across the spindles, with rsyncs to the IDE drive
for data protection.

I'm really starting to feel that might be the best choice.  The advantages
of RAID for Grex are faint enough so that they don't quite overwhelm the
KISS factor in my estimation.
janc
response 345 of 547: Mark Unseen   May 22 14:33 UTC 2003

Re 343:  probably - but I'm not sure how.  I thought of just creating a
RAMDISK and letting that eat up much of the memory (I could also run the
benchmark on a ramdisk, which might be interesting), but it looks like
you need to do a lot of kernal work to bring up a ramdisk, and I'm
insufficiently motivated.
cross
response 346 of 547: Mark Unseen   May 22 14:37 UTC 2003

One can lower the amount of memory the kernel will use for caching by 
mucking with the kernel.  It looks like, when caching is taken out of
the picture, performance between RAID and the straight SCSI disks is
more or less on par?
 0-24   25-49   50-74   75-99   100-124   125-149   150-174   175-199   200-224 
 225-249   250-274   275-299   297-321   322-346   347-371   372-396   397-421   422-446 
 447-471   472-496   497-521   522-546   547      
Response Not Possible: You are Not Logged In
 

- Backtalk version 1.3.30 - Copyright 1996-2006, Jan Wolter and Steve Weiss