Exercise 11.2-1

Suppose we use a hash function \(h\) to hash \(n\) distinct keys into an array \(T\) of length \(m\). Assuming simple uniform hashing, what is the expected number of collisions? More precisely, what is the expected cardinality of \(\{ \{k, l \} : k \neq l\) and \(h(k) = h(l)\) ?

There is a peculiar notion of ordering here that must be observed: Two keys, \(k\) and \(l\) are a ‘collision’ if they both hash to the same value (that is, \(h(k) = h(l)\)) but this is only true if both keys have been hashed. Hashing one or the other is not a collision until the other is hashed and produces the same value. I’m pointing this out because neglecting to consider this detail leads to an unnecessary simple calculation.

Instead, let’s approach this question with a mind towards this ordering aspect. We shall assume our set of \(n\) keys are totally ordered (there exists some relation between elements that can reliably reproduce the order via calculation) as \(\{k_1, ..., k_n \}\). Let \(X_i\) be the expected number of times that key \(k_i\) is collided with by hashing keys that come after it.

\[\begin{align*} E[X_i] &= \sum_{j > i}^{n} Pr(h(k_j) = h(k_i)) \\ &= \sum_{j > i}^{n} \frac{1}{m} \\ &= \frac{(n-i)}{m} \end{align*}\]

Next, by linearity of expectation the total number of expected collisions is the sum of this figure for all \(n\) keys:

\[\begin{align*} E[X] &= \sum_{i = 1}^{n} \frac{n - i}{m} \\ &= \frac{n^2 - \frac{n(n+1)}{2}}{m} \\ &= \frac{n^2 - n}{2m} \end{align*}\]