To insert a piece of data into a hash table, we need to calculate an index in the array where that data should be stored. To do that, we need to perform two tasks:
We need to transform the data into an integer. This is the “hashing” part of the task.
We then need to take the integer and map it to a valid index in the array we are using to store our data. This is the “indexing”, or “mapping”, part of the task.
The second task is relatively straightforward. If we have an array of size \(n\text{,}\) we can take the integer \(i\) we got from hashing and compute i % n to get a valid index in the array. (Dividing by \(n\) and taking the remainder ensures the result is between 0 and \(n - 1\text{.}\)) The value of \(n\) will depend on the size of the array, not the input. So we will leave that task for the data structure to worry about. The first task, hashing, is independent of the data structure, with logic that only depends on the type of data we are trying to hash. This is the only part of the process that our hash functions will worry about.
The simplest type of data to hash is an integer. We can just use the integer itself as the hash value. For example, if we want to store the number 42 in our hash table, we can just use 42 as the hash value. If we want to store the number 12345, we can just use 12345 as the hash value. The values are already unique and can be easily mapped to an index in the array using the modulo operator as described above. (We don’t care that the original value is obviously recoverable from the hash value, because we are not using hashing for security purposes.)
int myHash(char value) {
return static_cast<int>(value); // The hash of a char is its ASCII value
}
While turning a floating point value into an integer, we don’t want 3.1 and 3.14 to have the same hash value. But we might be worried about the fact that a value we expect to be 3.1 might actually be stored as 3.0999999999999996 due to the way floating point numbers are represented in memory. So, like when deciding if floating point values are “equal”, we will want to only consider a certain number of decimal places when hashing. For example, we might decide to multiply the value by 10000 and then round it to the nearest integer before using that as the hash value. That way, 3.1 and 3.14 would have different hash values, but 3.1 and 3.0999999999999996 would have the same hash value.
Note that although 3.01 and 3.0999999999999996 hash to the same value, it is not 30100! 3.01 has a binary representation that is slightly less than 3.01. But, our function successfully treats those values as the same for hashing purposes, which is what we wanted.
Other approaches might directly convert the bits of the floating point value into an integer, but that would not give us the ability to treat values that are close enough as the same for hashing purposes. Regardless of the approach taken, we will need to be careful to ensure that the hash function produces the same hash value for values that we want to treat as the same, and different hash values for values that we want to treat as different.
Floating point values like doubles are not great candidates for hashing. Given a choice of data types to hash, you should generally avoid hashing floating point values. If you do need to hash them, be sure to carefully consider how you will handle the issues described above.