Hash Functions for Strings

Section 33.4 Hash Functions for Strings

To hash a string, or other complex data type, we need to find a way to combine the hash values of the individual characters (or other components) into a single hash value. One simple approach would be to combine all of the character codes together to produce a single integer. While this is a valid hash function, it is not a very good one. It will produce the same hash value for any two strings that have the same characters in any order. Two string like “sheet” and “these” would produce the same hash value, even though they are different strings.

🔗

Before looking at why this happens and what we can do about it, let’s consider how we should “combine” the character codes together. We could just add them all together. Another common method is to use exclusive or, or xor, to combine the bits. xor is an operation that takes two bits and produces a 1 if exactly one of the bits is 1, and produces a 0 otherwise. (1 if they are different, 0 if they are the same.) In C++, the ^ operator is the “bitwise exclusive or” operator. It combines two values by performing the “xor” operation on each pair of corresponding bits.

🔗

01100101  (bits of 'e')
01110011  (bits of 's')
--------
00010110  (xor of 'e' and 's')

Figure 33.4.1. 'e' ^ 's'. Every column where the bits are different has a 1 in the result. Every column where the bits are the same has a 0.

🔗

Note 33.4.1.

XOR is often chosen because it is a simple operation that can be performed efficiently by just about any processor. Other logical operations like AND and OR tend to force values towards either 1 (for OR) or 0 (for AND), which can lead to more collisions. For any random pair of bits, XOR will produce a 1 half of the time and a 0 half of the time, which can help to spread out hash values more evenly.

🔗

XOR has its own issues as a hash operation. If you hash the same value with itself, you will get a hash value of 0. Both 'e' ^ 'e' and 's' ^ 's' would produce 0. Ideally, the strings ee and ss would produce different hash values.

🔗

Robust hash functions often combine XOR with other operations like addition to avoid this problem.

🔗

Whether we use addition or xor, we still have the problem that “sheet” and “these” produce the same hash value because they combine the same characters. In general, we can’t guarantee that every object will produce a unique hash value. But we should try to minimize the number of collisions (cases where different objects produce the same hash value).

🔗

To do this in a string, we need to make sure that the order of the characters affects the hash value. One way to do that is to shift the bits of different characters by different amounts before combining them.

🔗

First, look at the results of 'e' ^ 's' and 's' ^ 'e'.

🔗

01100101  (bits of 'e')
01110011  (bits of 's')
--------
00010110  (xor)

Figure 33.4.2. 'e' ^ 's'

🔗

01110011  (bits of 's')
01100101  (bits of 'e')
--------
00010110  (xor)

Figure 33.4.3. 's' ^ 'e'

🔗

Now, look at the results if we take the first character and shift its bits left by 4 before combining with the second character.

🔗

01100101      (bits of 'e' << 4)
    01110011  (bits of 's')
------------
011000100011  (xor)

Figure 33.4.4. 'e' ^ 's' with bits of ’e’ shifted left by 4

🔗

01110011      (bits of 's' << 4)
    01100101  (bits of 'e')
------------
011100110110  (xor)

Figure 33.4.5. 's' ^ 'e' with bits of ’s’ shifted left by 4

🔗

Note 33.4.2.

Anywhere there isn’t a visible bit shown, it should be considered a 0. But, 0 XOR’d with anything produces that thing (like adding 0), so if there is only one visible bit in a column, you can use that bit as the result.

🔗

When we shift the bits of the first character, the order of the characters affects the hash value, which will make sure se and es produce different hash values.

🔗

To shift bits of an integer in C++, we use the << operator. If x has the value 5, or 00000101 in binary, then x << 2 will shift the bits of x to the left by 2 positions, resulting in 00010100 (which would be 20). x << 4 would shift the pattern over four positions to make 01010000 in binary.

🔗

Warning 33.4.3.

Confusingly, this use of << is a completely different operation from when we use << with an output stream despite using the same symbols. intVar << intValue says to shift the bits of intVar to the left by intValue positions, while cout << intValue says to output intValue to the console.

🔗

As with doing arithmetic operations, the shift operator produces a new value, so if we wanted to change the value of x, we would need to assign the result of the shift operation back to x:

🔗

x = x << 4;  // Shift x left by 4 bits and assign the result back to x

By repeatedly shifting the bits of the hash value, as we add in characters, we create what is known as a rolling hash. The animation below demonstrates the process.

🔗

Activity 33.4.1. String Hashing.

The animation is set to hash these. Click the > button to step through the process.

🔗

Instructions.

When the > button is highlighted, an animation is prepared to run. Use the Step button to step through the animation one step at a time.

🔗

When no animation is in progress, you can use the data controls to start a new animation.

🔗

You can click and drag on the animation area to pan around. Use the Ctrl + mouse wheel (or touchpad scroll gesture) while over the animation area to zoom in and out.

🔗

Core Controls.
- << : Skip to the beginning of the animation.
  
  🔗
- < : Step backwards one step in the animation.
  
  🔗
- > : Step forward one step in the animation.
  
  🔗
- >> : Skip to the end of the animation.
  
  🔗
- Auto Step Speed Set to anything other than off to automatically step through the animation at the selected speed.
  
  🔗
- Zoom Set the zoom level. You can also use Ctrl + Mouse Wheel to zoom in and out.
  
  🔗
🔗
🔗
Data Controls.
- Value Use this area to enter a value when inserting, finding, deleting, etc...
  
  🔗
🔗
🔗

🔗

A C++ implementation of this hash function for strings is shown below. We start with a hash value of 0, and for each character in the string, we shift the current hash value left by 4 bits and then xor it with the character code of the current character.

🔗

int hashString(const std::string& str) {
    int hash = 0;
    for (char c : str) {
        hash = (hash << 4) ^ c;  // Shift hash left by 4 bits and xor with character code
    }
    return hash;
}

This is a very simplistic hash function for strings, but it at least takes into account the order of the characters.

🔗

Checkpoint 33.4.1.

🔗

Checkpoint 33.4.2.

Hint.

Different values in the same column produce a 1. Same values in the same column produce a 0. If there is no value in a column, treat it as a 0.

🔗

You have attempted of activities on this page.

🔗

Prev Top Next