|
|
| Author |
Message |
| 25 new of 547 responses total. |
cross
|
|
response 310 of 547:
|
May 20 22:21 UTC 2003 |
So that you don't have to remember to turn it off the next time you
upgrade the system. :-) In general, it's more of a hassle to turn
off the setuid bit if it doesn't do anything than to ignore it.
|
lk
|
|
response 311 of 547:
|
May 20 22:25 UTC 2003 |
Dan, Re#298, 293, and others: Dreams can be like that. You never really
know who said what, if it was real... but hey, it worked.
Nonetheless, since you've provided so much other helpful information and
I have not, I'm going to claim credit for this "fix". Afterall, you
phrased it as a question. I said do it! (:
|
cross
|
|
response 312 of 547:
|
May 20 22:37 UTC 2003 |
Another thing.... Grex might have turned off ping to avoid the problem of
a malicious user using the `flood ping' -f option against another host.
This mode sends packets to a remote host as fast as it can; effectively
clogging the network link between the two. On grex's slow connection,
this could clearly be a problem.
However, OpenBSD's version of ping checks that the real user ID is 0 (ie,
you're root) before allowing you to use the -f option for flood pinging.
Given that any program that wants to create an ICMP socket must be running
as root, and that the standard ping doesn't joe user flood ping anymore,
perhaps it'd be acceptible to stop restricting access to ping. Still,
someone might be able to DoS grex by sending a ping request to some big
broadcast address, so maybe it's a good idea to keep restricting it.
|
cross
|
|
response 313 of 547:
|
May 20 22:38 UTC 2003 |
Hrmph, Leeron! :-)
|
janc
|
|
response 314 of 547:
|
May 21 00:19 UTC 2003 |
I have no intention of "remembering" to turn off suid bits. I'm for
documenting it, in this case in the form of a script that does it.
I'd turn off all the suid-root bits that don't need to be on (or leave them
on a nosuid partition where the suid-bit doesn't matter). It's hard to
imagine a security hole turning up in 'ping', but anything is possible.
I'm much less inclined to be aggressive about the sgid scripts.
|
janc
|
|
response 315 of 547:
|
May 21 00:29 UTC 2003 |
I'm not sure if you can use effectively use different RAID strategies on
different partitions without having different disk sets for them, but I'm
still thinking different RAID strategies make sense for different partitions.
I think /usr is another example where data redundancy seems of less value.
I you lose /usr, and you restore it from a month-old backup, you are probably
fine. Just striping seems perfectly adequate for partitions like that.
Where RAID 5 pays off mostly is in places like /var, /bbs, /home.
|
cross
|
|
response 316 of 547:
|
May 21 01:22 UTC 2003 |
Regarding #314; Well, if you keep / as the only partition that honors
the suid bit, then you only have to change permissions on two binaries:
ping and ping6 (I still say ignore shutdown, since only users in group
operator can run it, anyway).
Regarding #315; The thing is that if you lose /usr, the system is
unusable; similarly with /usr/local, /, etc. RAID isn't just about
data security, it's also about availability.
|
janc
|
|
response 317 of 547:
|
May 21 03:05 UTC 2003 |
Right, but that isn't Grex's highest priority. We aren't amazon.com that
can't be off line for a few days without making headlines. Heck, we currently
shut down for backups. If a disk melts down, taking a few days to come back
up is no disaster, if we can do it without loss of data.
If I'd have been at the last board meeting, I'd have argued against the third
SCSI disk. Grex doesn't need that much disk in the near future. But, we've
got the disk, so we might as well use it. I think the best use may be to do
a RAID setup and win a bit better performance and a bit more data security.
Right now I think the strongest argument against using RAID on Grex is the
KISS argument. RAID certainly has benefits, but it adds complexity, and extra
complexity is always a minus. Using RAID means one more potentially buggy
piece of software in a critical function. It means one more complex subsystem
staff members need to understand, administer, and reinstall on every upgrade.
I think a sound argument could be made that the benefits aren't worth the
complexity. Skip RAID. Divide the partitions among the disks and hope the
loads balance out approximately. rsync critical partitions to the IDE disk
frequently. Remember to do backups. We don't lose much by taking that
easier path, and it is significantly simpler to install and administer.
You could do ccd on some partitions, if you want the same performance benefits
(slightly more even) at a lower complexity level.
|
janc
|
|
response 318 of 547:
|
May 21 03:07 UTC 2003 |
All reports about problems with multiple drives on our SCSI controller seemed
to be about really frequent ones. I've had all three drives busy reading and
writing for all they are worth for a day now and have seen no problems at all.
So I think we can probably consider that problem solved. I'll let them grind
for a bit longer though.
|
cross
|
|
response 319 of 547:
|
May 21 05:12 UTC 2003 |
Regarding #317; Well, to me, splitting up partitions is more complex.
Maybe I'm smoking my hair, but it seems a lot simpler conceptually to
think of a RAID as one giant partition that you can chunk up as you
like, and the performance issues and load balancing are yours free.
You get some modicum of resistance to failures as a side benefit.
As for reliability.... RAIDframe has been in OpenBSD for several
years now. It seems just as solid as FFS itself or even soft updates.
Could it go wrong? Yeah, but there could also be bugs lurking in FFS.
Complexity of configuration is pretty simple. One or two configuration
files, and you're basically good to go. The only really annoying thing
is that you can't directly boot from it. But, at the end of the day,
I'm not on grex staff and aren't challenged with keeping it running.
It seems simpler to me, and RAID-5 everywhere seems to fit grex like
a glove (especially if it's planned on a few partitions already), but
that's just me.
|
janc
|
|
response 320 of 547:
|
May 21 12:57 UTC 2003 |
To some degree I'm arguing all sides of the question to make up for the lack
of people arguing. But the large number of Grex staff who have no opinion
on RAID is a bit worrisome.
Administrative complexity hits at three points - first, right now - decide
which RAID setup to use, and implementing it.
Second, on each system upgrade, which we need to reinstall the kernel
customizations and config files. Mostly we can document this and make
a step-by-step procedure that most anyone can follow.
Third, when a disk has a problem, or when we want to change the disk
configuration. RAID can help with problems like this, but only if you know
what you are doing. Doing the wrong thing can hose your data.
In a volunteer run system, the level of knowledge that may be on hand on
the day when a disk dies is unpredictable. It may make sense to keep things
simple so lots of people feel like they can help. This argument applies
equally to Kerberos. Both confer modest benefits that I'm not sure we need,
at the cost of complexity that makes the size of the hump you have to get
over to become an effective system administrator substantially larger.
I fear they will reduce the size of the pool of potential system
administrators.
Or maybe they we make the system cooler and more interesting, thus attracting
more potential system administrators.
I think I may experiment with setting up RAID on Grex2003, just to get a better
feeling for the complexity.
|
gull
|
|
response 321 of 547:
|
May 21 13:09 UTC 2003 |
I'm in favor of RAID because I think it has the potential to *reduce*
the amount of staff time needed for recovery if a disk fails. Assuming
you don't have a multiple failure, recovery is reduced to a few steps,
presumably simple ones, although I'm not familiar with RAIDframe
specifically. Generally there's some way to tell the RAID subsystem
you're going to offline a disk (this may have been done automatically if
the disk failed), then you'd shut down the system, swap the disks, then
boot and tell the RAID subsystem to rebuild the failed disk. During all
of this except shutdown the RAID array is generally still usable, just
in a "degraded" mode. (It will be slower.) There are some RAID systems
where that isn't true, but I'd hope RAIDframe would have implemented
online recovery.
I think, if we have time, the concerns about complexity could be
addressed by developing a step-by-step disk failure recovery procedure
that any staff member could follow. It shouldn't really be any more
complex than restoring from a backup, just different.
If you're going to use RAID, I think it's best to look at the RAID array
as one big disk, and not try to spread things out with different RAID
strategies for different partitions. That seems unnecessarily complex
to me.
|
cross
|
|
response 322 of 547:
|
May 21 13:34 UTC 2003 |
Regarding #321; I concur. A barebones recovery plan should be developed
in any event, regardless of whether RAID is used.
|
janc
|
|
response 323 of 547:
|
May 21 15:34 UTC 2003 |
Thanks David. I definitely value input on this question.
I built two new kernels. The first is simple GENERIC minus a mess of stuff
we don't need - mostly device drivers for devices we haven't got. The second
is the same, but turns RAID on (and does various stuff to make sure SCSI
drives don't get renumbered when one fails).
I also pushed the "maxusers" parameter from 32 to 64. Maxusers isn't really
the maximum number of users. It's a voodoo number that is used to estimate
sizes for all sorts of system parameters, which can be fine-tuned separately
by editing lower level definitions. I saw various posts by people who had
set it higher than 64 and got a warning message about that. One seemed to
have some crashes after that and thought it might be related. However, one
of these guys got no response that is in the archive, and the other was only
told that he was an idiot. (These OpenBSD mailing list archives are such a
valuable resource.) So for the moment I thought I'd set it to 64. It'll be
easy enough to fine tune it later if we have problems with that setting.
The OpenBSD FAQ discourages building new kernels without a danged good reason,
threating lack of technical support for problems with non-generic kernels.
However, since their technical support is laughable anyway and and Marcus is
guaranteed to have changes to make to the kernel anyway, I decided we might
as well get started, even if we don't end up using RAID.
The stripped down GREX kernel is about half the size of the GENERIC kernel,
which is a plus, if not a great big one:
-rw-r--r-- 1 root wheel 4579691 May 21 07:01 /bsd.generic
-rwxr-xr-x 1 root wheel 2719734 May 21 07:03 /bsd.new
-rwxr-xr-x 1 root wheel 3133519 May 21 06:59 /bsd.raid
It is currently running on the bsd.raid kernel, and that is the default.
I haven't however, set up any RAID array yet.
I've also now got a draft document on kernel building.
|
janc
|
|
response 324 of 547:
|
May 21 17:19 UTC 2003 |
OK, I've created a RAID array on new Grex - just for experimental purposes
at this point. First, I sliced up the three scsi disks into two partitions
each, each disk identically:
sd0a: 20479825 blocks = ~10 Gig
sd0d: 15361127 blocks = ~7 Gig
The sd0a, sd1a, and sd2a partitions are clustered into a RAID5 array, with
just one partition, /dev/raid1a, on it (it can be sliced into smaller
partitions). This is mounted as /raid. The sd0d, sd1d, and sd2d partitions
are mounted as /sd0, /sd1 and /sd2 respectively. My idea was that if we
want to do any benchmarks, this lets us access the same disks, with or without
raid. All four partitions are rw-all so anyone with an account can create
stuff there and look at the stats.
df looks like this:
Filesystem 1K-blocks Used Avail Capacity Mounted on
/dev/sd0d 7438613 1 7066682 0% /sd0
/dev/sd1d 7438613 1 7066682 0% /sd1
/dev/sd2d 7438613 1 7066682 0% /sd2
/dev/raid1a 19852909 1 18860263 0% /raid
Note that the available space (18.8 Gigs) is about 61% of the disk we put
into this (30 Gigs), most of the rest being used for parity, some of the
rest being eaten by filesystem overhead of various sorts.
|
gull
|
|
response 325 of 547:
|
May 21 17:33 UTC 2003 |
Yeah, from what I've seen a lot of OpenBSDers are a bit elitist and
don't suffer newbies gladly. It's an unfortunate attitude.
|
janc
|
|
response 326 of 547:
|
May 21 17:37 UTC 2003 |
Hmmm...I'm trying to run the bonnie benchmark
(http://www.textuality.com/bonnie) on the raid disk, but I'm not sure it
will work. Bonnie wants me to use a file size several times larger than main
memory. Main memory is 1.5 Gig, so I told it to use 4 times that: 6144 Meg.
But the first thing it said is:
File './Bonnie.28521', size: -2147483648
Uh-oh. Someone may be using signed longs for the file size. If that's the
case, then the biggest file size I can use is probably around 2048 Meg, which
isn't several times the size of our memory. Well, I'll let it run and see
what happens.
|
janc
|
|
response 327 of 547:
|
May 21 18:00 UTC 2003 |
So, if we went with RAID, what would we do?
On the disks we'll have partitions
sd0a - pretty tiny. A place to store kernels. We'll boot from here.
sd1a - A copy of /dev/sd0a, so we can boot if sd0 dies
sd0b - swap partitions, one Gig each. You can put swap on raid, but
sd1b it doesn't appear to be a great idea. We'll trust OpenBSD to
sd2b balance swap load over the three spindels.
sd0d - the remainders of the disks, about 16 Gig each.
sd1d
sd2d
Now, sd0d, sd1d and sd2d will be clustered together into a RAID 5 array, called
raid0. To all intents and purposes, this appears as a single big disk. It
should come out at about 29 Gig, a more than adequate amount of space for all
of Grex's needs for a while. Raid0 gets partitioned into all the various
partitions we need, with root on raid0a, usr on raid0d and so on.
The 80 Gig IDE disk doesn't participate in this. We could put the boot
partition on this, but I'd want copies on two disks anyway, so we'll need at
least some non-raid partitions on the SCSI disks anyway, so let's leave
everything critical off the IDE.
|
cross
|
|
response 328 of 547:
|
May 21 18:38 UTC 2003 |
Why not make sd2a a copy of sd0a as well? It wouldn't hurt anything,
and might help, since each disk would be exactly like every other disk
in terms of how the partitions are layed out. That makes partitioning
easy; you can keep a copy of the disklabel for one of the disks around
in a file somewhere, and just write it to a new disk with the disklabel
command if necessary. Then, just plop the new disk in, tell RAIDframe
to rebuild it, and let it go on its merry way.
|
janc
|
|
response 329 of 547:
|
May 21 19:08 UTC 2003 |
Probably would. Actually, you don't even have to keep the layout in a file.
You can just copy it from one disk to another:
disklabel sd0 | disklabel -R sd1 /dev/fd/0
That's how I built the current setup.
|
janc
|
|
response 330 of 547:
|
May 21 19:08 UTC 2003 |
Bonnie croaked while doing some seeks. Try it again with a smaller file to
see if that works better.
|
other
|
|
response 331 of 547:
|
May 21 19:53 UTC 2003 |
I thought the IDE was mainly for a comprehensive backup of the boot
partition plus storage for sources.
|
cross
|
|
response 332 of 547:
|
May 21 19:55 UTC 2003 |
Yeah, you can do that, but if you also keep the disk label around in
another file, you can label the disk on any machine with a SCSI controller.
Does that matter? I don't know. It might be slightly more convenient.
It's just a nit, though; it's trivial to get a copy of the disklabel once
the machine's set up, and I doubt it would matter....
|
cross
|
|
response 333 of 547:
|
May 21 19:58 UTC 2003 |
Oh yeah.... That's an idea. Put /usr/src and /usr/obj on the IDE
drive, and then you don't have to do anything hacky with linking them
to /var as in my latest proposal. /var can be decreased accordingly,
and more space allocated to /u.
|
janc
|
|
response 334 of 547:
|
May 21 20:09 UTC 2003 |
OK, with a file size of 2000M, I get results from Bonnie. The validity of
these results is, however, questionable, since a lot of the file may have been
in memory instead of on disk.
-------Sequential Output-------- ---Sequential Input-- --Random--
-Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU
raid 5 2000 9520 6.8 7974 1.3 5706 2.0 50932 62.5 63815 13.0 147.9 1.6
scsi 2000 53754 43.4 54106 14.1 10090 2.6 60326 70.9 61067 11.5 201.2 0.8
We have two lines of results. The first was using the raid 5 array of three
SCSI disk. The second was on a single plain ordinary SCSI disk.
For each test we have the speed and the % of CPU used.
There are three output tests:
Per Char - file written sequentially with 2 billion calls to putc()
Block - file written with block writes
Rewrite - each block read, changed and rewritten
There are two input tests
Per Char - 2 billion calls to getc()
Block - block reads
And a seek test
Seeks - four child processes each execute 4000 seeks and reads. After
10% of these they change and rewrite the block.
So, on writing, RAID was 5 to 6 times slower. Notice that the supposedly
optimum block writes were actually slower than the character writes for the
RAID. The SCSI was twice as fast as RAID on the rewrite test.
On read the RAID array was still slower than the plain disks on the Per
Char reads, but a bit faster on the block reads. It was substantially slower
on the seeks.
Admitting that the benchmark is seriously questionable due to the small size
of the file relative to the large size of memory, this is not at all an
impressive result.
I reran the tests and got similar results.
-------Sequential Output-------- ---Sequential Input-- --Random--
-Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU
raid 5 2000 9520 6.8 7974 1.3 5706 2.0 50932 62.5 63815 13.0 147.9 1.6
raid 5 2000 8745 6.4 7654 1.3 5717 2.2 51345 63.5 64022 14.6 150.0 1.1
scsi 2000 53754 43.4 54106 14.1 10090 2.6 60326 70.9 61067 11.5 201.2 0.8
scsi 2000 54058 43.4 54618 14.1 10129 2.8 60552 71.0 60865 11.1 203.4 0.9
I suppose the main advantage in performance is in balancing load among multiple
spindles, but this would really only be noticable if multiple processes were
reading/writing the disk at once. With a single process, we aren't going to
gain much. Only in the seek test are there multiple processes, and then only
four.
|