No Title

Next: About this document Up: My Home Page

Application of Functions: Hashing

You are given a set of student records, each of which includes the student's ID-number and which have to be stored in a table. For any given ID-number one has to be able to determine whether there is a record with that number and, if so, to retrieve the record. It may also be necessary to add new records and delete existing ones.

It is fairly easy to come up with some method for solving this problem:

Store the first record in the first table slot, the second in the second slot, and so on. To search for a record, simply scan all entries from the beginning (to the end if necessary).

How are records added or deleted?

It is much harder to design a method for which the search, add, and delete operations are efficient and at the same time storage space is used economically (i.e., no huge table for just a few records).

One solution is to use a hash table based on a well-chosen hash function.

A Hash Table

For simplicity let us first assume that the number of records is small, say no more than 7.

We will use a table with seven entries, numbered 0, 1 , 2 , , 6.

Given a (9-digit) ID-number x, compute its hash value h(x) using the hash function h, where
Store the student's record in table entry h(x).

Example.

tabular158

Note. The elements x for which one computes hash values are also called keys. In this example the keys are ID-numbers.

Collisions

Typically, the domain X of a hash function is much larger than its co-domain Y, though the subset X' of those elements of X for which hash values need to be computed is usually about the same size as Y.

If the function f, when restricted to X' as a domain, is one-to-one, then hashing works fine. If it is not one-to-one, there may be collisions.

Example. Where do we store the record of a student with ID-number 223-79-9068?

Collisions can be resolved in two ways:

store the record in the ``next available'' slot, or
store (a pointer to) a list of record in each slot.

Hash functions are typically onto - why is this good?

In the example, if the number of student records is reasonably large, say around 8,000, the function h above, with , is not suitable. A more reasonable function might be

The Pigeonhole Principle

If A and B are finite domains and B has fewer elements than A, then there is no one-to-one function from A to B.

This observation is also known as the Pigeonhole Principle.

Example. Let A be the set . How many of the integers from A need to be selected so that, regardless of the choice of selection, there is at least one pair with a sum of 9?

Four is not enough, as we may select 1,2,3,4 where no pair yields a sum larger than 7.

But any selection of five integers from A must contain a pair whose sum is 9. To see why, observe that A can be partitioned into four different subsets , , , and , where the sum of each of the four corresponding pairs is 9.

Now if , and are the selected integers from A, we define a function f, by setting to be the set that contains .

By the pigeonhole principle, the function f is not one-to-one, so that there exists two integers and with . In other words, there must be one subset , both of whose elements are selected. The corresponding sum is 9.

A Bald Statement

Despite its simplicity, the pigeonhole principle can be used to solve an amazing variety of problems.

Claim: There must be at least two non-bald New Yorkers who have exactly the same number of hairs on their heads!

Proof: The maximum number of hairs on a human head is 1,000,000, and there are greater than 1,000,000 non-bald New Yorkers. height6pt width4pt

Note that this proof, although completely rigourous, is not constructive. We don't figure out which two people share the same hair count, or what the hair count it - only that the given pair must exist.

Other Applications of the Pigeonhole Principle

I own n distinct pairs of socks, which I keep in an unmatched pile in my drawer. How many individual socks must I pull out of the drawer to ensure that I get two that match?

Think of this as having n pigeonholes, one for each type of sock. How many pigeons do I need to ensure that some hole contains 2 of them?

How long a document must you write in order to ensure that at least some word is used more than once?

If there are only 100,000 words in the dictionary, a book with 100,001 words will use at least one of them twice.

A Subset of Divisors

Suppose you are given an arbitrary subset of 101 distinct integers from the set . There must be two integers x, y in S such that x divides y.

Proof: Every positive integer n can be written as , for and m odd. (Why? Factoring all the twos from n leaves an odd number.)

Thus every number in S can be mapped to an odd number from 1 to 199. There are exactly 100 such numbers. (Why? These are the integers 2i - 1 for )

Thus at least two of the 101 distinct integers must be mapped to the same odd number m, say and . Then x must divide y. height6pt width4pt

This result can be generalized to to state that any subset S of n+1 integers from 1 to 2n must contain a pair x, y in S such that x divides y.

About this document ...

Next: About this document Up: My Home Page

Steve Skiena
Tue Aug 24 20:25:28 EDT 1999