|
|
Data structures are methods or organizing data to facility quick searches, quick modifications, compression, etc. They are manipulated by various algorithms, and like algorithms, are fundamental to computing and a cornerstone of any programmer's education. This is the place to ask about and discuss them.
3 responses total.
This response has been erased.
[Note: Response #1 was erased because I attempted to post it using
fronttalk, which has a bug with string quoting that makes C code
problematic. This response is really #1, just entered using picospan.]
Recently, a question came up on M-Net regarding hash tables. Consider the
following simple string hashing function suitable for use with open hashing
structures that utilizes some other method for collision resolution, e.g.,
chaining. In C:
enum
{
MULTIPLIER = 31 /* Note: 31 is prime. */
};
unsigned int
hash(char *str, int tsize)
{
unsigned int h;
char *p;
p = str;
if (p == NULL)
p = "";
h = 0;
while (*p != '\0')
h = h * MULTIPLIER + *p++;
return (h % tsize);
}
A question was posed by cdalton about why it is advantageous to require
that MULTIPLIER and the hash table size be prime (or at the very least,
relatively prime to one another). My answer is as follows.
It's all about the distribution of entries in the table. The idea is that
you want to spread them out throughout the table as best as you can, so that
you minimize collisions. If the multiplier and the table size are relatively
prime, then you end up distributing entries across every "slot" in the table,
assuming "sufficiently random" data.
The easiest way to get the multiplier and hash table size relatively prime,
meaning they share no common factors, or that if n = the table size is a
positive integer greater than 1 and m = the multiplier is likewise a
positive integer greater than 1, then gcd(n, m) = 1, where gcd(n, m) is the
positive greatest common divisor of n and m, is to make one or both prime
numbers. Recall that a mod n = r (a, n positive integers) means that for
some integer s, a = ns + r. We assume further that n > m.
To show why this relatively prime business is important, suppose we are
doing open hashing as in my earlier example with some other method for
collision resolution (e.g., chaining). Now, let d = gcd(n, m) >= 1.
Given an input datum consisting of the string a_1 a_2 a_3 ... a_k, where
0 <= a_k <= B for some constant positive integer B, the hash becomes
the nonnegative integer,
k
---
r = ( > m a ) mod n
--- i
i=1
Let u be the sum over i = 1 to k of a_i; then the above becomes:
r = (mu) mod n,
since m is constant with respect to the sum, and thus can be factored out.
But, since m has d as a factor, we know there exists some positive integer q
such that m = dq and thus the hash is equivalent to r = (dqu) mod n which
implies that dqu = sn + r => dqu - r = sn => n | (dqu - r), where ``a | b''
stands for "a divides b" (or b has a as a factor), and s is as before. But,
recalling that n also has d as a factor, we see that there exists some
positive integer t such that n = dt. Then, n | (dqu - r) => d | (dqu - r),
and sn / d = sdt / d = st = (dqu - r) / d = dqu/d - r/d is an integer by
closure of multiplication (since s and t are integers), cancellation,
substitution and distribution in the integers. But, clearly dqu/d = qu is
an integer by cancellation, and since st is an integer, then by addition we
see that r/d = du - st is also an integer and hence d | r. Thus, every hash
value must be a multiple of d.
It follows that every slot in the hash table that is not an integer multiple
of d is unused. If d > 1 then n and m are not relatively prime and only
every d'th slot in the table will be used. For example, consider the case
where n = 8 and m = 4. Then gcd(n, m) = 4 implies that only the 0'th and
4'th slots in the table will be used, and the hash table is only as
effective as one with two slots.
On the other hand, if n and m are relatively prime, then by definition,
gcd(n, m) = 1 and every slot in the table will be used.
Of course, this still all assumes that the data we're hashing cooperates and
doesn't give us some weird values. For instance, if the data were always
some multiple of the modulus when shifted and summed, you'd end up in the
same boat. You can try and ameliorate this case by careful choice of
collision resolution techniques, using double hashing with hash algorithms,
etc. It gets into some real black magic.
In practice, however, for most string data the function I posted is quite
good.
(Note: in that second to last paragraph, that should be double hashing with *differnt* algorithms, etc.)
Response not possible - You must register and login before posting.
|
|
- Backtalk version 1.3.30 - Copyright 1996-2006, Jan Wolter and Steve Weiss