maus
|
|
Scaling databases
|
Dec 1 06:56 UTC 2006 |
I am curious to know whether people here have worked with scaling a
database on UNIX/Linux. For the time being, this is a gedanken
experiment, as I am not sure it would fly with the powers-that-ain't,
and there is the possibility that I may get myself purposely fired
because of contract change that will probably hurt all of the employees
(I can't give details, per my lawyer's instructions).
Presuming I still exist at the company, I am toying with the idea of
possibly setting up a large IDS system for a server-farm that has a huge
exposure footprint. The design is as follows: each access switch will
have a span-port with a SNORT sensor hanging off of it. Each of these
sensors will go to a shared pool of aggrigators, which will in-turn feed
the data to a database. The data will be made available through a couple
of slave databases to an application server running BASE or a similar
analysis tool, as well as to a slave replica reserved only for
trend-analysis and data-mining.
Since all of the data coming from the probes would be of the same type,
in the same format, concerning the same issue, it does not look like the
database is divisible, or at least, not easily divisible. I know that on
a database server with sufficient RAM, performance is i/o bounded, and I
am wondering how to work this to make it scale. I would presume that
having the NIC (or possibly NICs) on a seperate bus from the discs would
be a start (either by using a lot of PCI-e or by getting a system with
multiple buses), and by having the disc-array use a large, fast memory
buffer, so that the host can write to that and not have to wait for the
order-of-magnitude slower write to the actual drives.
On a database server that is taking a *LOT* of incoming data (lots of
insert queries -- the serevr farm is pretty fucking huge and has a huge
exposure footprint (i.e. virtually every server is fully
internet-facing)) and writing them all to a table, is the primary
slow-down going to be getting the data in (from the network) or getting
the data out (to the tables/drives)? Can the slower of these two have a
higher preempting priority so that the two can be made roughly the same
speed?
Is there a formula or methodology or even a rule-of-thumb for
approximating the number of insert queries a database can do in a
second, given the number of processors and amount of RAM and other
factors?
Do the different major databases differ greatly in the number of queries
per second that they can process? For example, would there be a major
difference between using an untuned PGSQL vs MySQL vs Oracle (we don't
have a mainframe, so using DB2 would probably not be an option; besides,
we have a lot of UNIX, BSD and Linux knowledge, and only two people who
know mainframes)? Same question for the various databases and having
each one tuned by an expert.
Can a database have multiple listeners (perhaps one per node in the
aggrigator pool)? Would the listeners wind up preempting one another?
Could each listener be tied to a separate SMT/SMP core (so 2 dual-core,
hyperthreading processors could simultaneously service 8 threads, for
example)? Is there a risk of simultaneous insert queries failing or
stalling or something because one would have a lock on the table or can
a database lock something smaller than a table (such as just the row or
even just the cell)?
Would it substantially matter if this was done in UNIX or BSD or Linux?
Am I over-thinking this?
Thanks.
|