To hash a string, or other complex data type, we need to find a way to combine the hash values of the individual characters (or other components) into a single hash value. One simple approach would be to combine all of the character codes together to produce a single integer. While this is a valid hash function, it is not a very good one. It will produce the same hash value for any two strings that have the same characters in any order. Two string like “sheet” and “these” would produce the same hash value, even though they are different strings.
Before looking at why this happens and what we can do about it, let’s consider how we should “combine” the character codes together. We could just add them all together. Another common method is to use exclusive or, or xor, to combine the bits. xor is an operation that takes two bits and produces a 1 if exactly one of the bits is 1, and produces a 0 otherwise. (1 if they are different, 0 if they are the same.) In C++, the ^ operator is the “bitwise exclusive or” operator. It combines two values by performing the “xor” operation on each pair of corresponding bits.
XOR is often chosen because it is a simple operation that can be performed efficiently by just about any processor. Other logical operations like AND and OR tend to force values towards either 1 (for OR) or 0 (for AND), which can lead to more collisions. For any random pair of bits, XOR will produce a 1 half of the time and a 0 half of the time, which can help to spread out hash values more evenly.
XOR has its own issues as a hash operation. If you hash the same value with itself, you will get a hash value of 0. Both 'e' ^ 'e' and 's' ^ 's' would produce 0. Ideally, the strings ee and ss would produce different hash values.
Whether we use addition or xor, we still have the problem that “sheet” and “these” produce the same hash value. In general, we can’t guarantee that every object will produce a unique hash value. But we can try to design our hash function to minimize the number of collisions (cases where different objects produce the same hash value). To do this in a string, we need to make sure that the order of the characters matters. One way to do that is to shift the bits of different characters by different amounts before combining them.
Anywhere there isn’t a visible bit shown, it should be considered a 0. But, 0 XOR’d with anything produces that thing (like adding 0), so if there is only one visible bit in a column, you can use that bit as the result.
When we shift the bits of the first character, the order of the characters affects the hash value, which will make sure se and es produce different hash values.
To shift bits of an integer in C++, we use the << operator. If x has the value 5, or 00000101 in binary, then x << 2 will shift the bits of x to the left by 2 positions, resulting in 00010100 (which would be 20). x << 4 would shift the pattern over four positions to make 01010000 in binary.
Confusingly, this use of << is a completely different operation from when we use << with an output stream despiteusing the same symbols. intVar << intValue says to shift the bits of intVar to the left by intValue positions, while cout << intValue says to output intValue to the console.
As with doing arithmetic operations, the shift operator produces a new value, so if we wanted to change the value of x, we would need to assign the result of the shift operation back to x:
x = x << 4; // Shift x left by 4 bits and assign the result back to x
By repeatedly shifting the bits of the hash value, as we add in characters, we create what is known as a rolling hash. The animation below demonstrates the process.
You can click and drag on the animation area to pan around. Use the Ctrl + mouse wheel (or touchpad scroll gesture) while over the animation area to zoom in and out.
A C++ implementation of this hash function for strings is shown below. We start with a hash value of 0, and for each character in the string, we shift the current hash value left by 4 bits and then xor it with the character code of the current character.
int hashString(const std::string& str) {
int hash = 0;
for (char c : str) {
hash = (hash << 4) ^ c; // Shift hash left by 4 bits and xor with character code
}
return hash;
}
This is a very simplistic hash function for strings, but it at least takes into account the order of the characters.