Standard Library Hash Functions and Custom Types

Section 33.5 Standard Library Hash Functions and Custom Types

Rather than writing our own hash functions for built in types like doubles and strings, we can use the hash functions provided by the C++ standard library. The std::hash class is a templated functor that can be used to compute a hash value (as a size_t) for a given type. For example, to compute the hash value of a string, we can use the following code:

🔗

string name = "John Smith";
string name2 = "Sandra Jackson";
// Create a hash functor for strings
std::hash<string> stringHasher;
// Compute the hash values for the strings
size_t strHash = stringHasher(name);
size_t strHash2 = stringHasher(name2);

For custom data types, we need to write our own hash functions that use std::hash to compute hash values for basic members and then combine them using bit shifting and xor to produce a single hash value for the entire object.

🔗

For example, if we have a Point struct with x and y integer coordinates, we can write a hash function for it like the one below. It shifts the bits of the x coordinate’s hash value to the left by 8 bits, we ensure that (1, 2) will have a different hash value than a point with coordinates (2, 1).

🔗

size_t myHash(const Point& p) {
  std::hash<int> intHasher;
  size_t pointHash = (intHasher(p.x) << 8) ^ intHasher(p.y);
  return pointHash;
}

For more complex objects, we need to carefully consider which members to include in the hash. There are a couple of general guidelines to follow when deciding which members to include in the hash:

We should include enough members to ensure that different objects that we want to treat as different will have different hash values. For example, we would not want to compute the hash value for a Person using just the age as then everyone with the age 23 would hash to the exact same value.

🔗
We should try to avoid including data that is likely to change frequently. The hash value will be used to “find” objects in the hash table. We want an object’s hash value to remain consistent whenever possible.

🔗

If there is a single member that is unique for each object (like a social security number for a person), the best strategy is often to just use the hash value of that member as the hash value for the entire object. If there is not a single unique member, we need to decide on a set of members to include in the hash and then combine their hash values in a way that produces a good distribution of hash values.

🔗

We also need to make sure that the logic for the hash function matches the logic of equality. If two objects are considered equal by the equality operator, they should have the same hash value. If two objects are not considered equal, they should ideally have different hash values (although we do not have to guarantee this). Any object we are going to hash should also have an equality operator defined that relies on the same attributes as the hash function.

🔗

Insight 33.5.1.

Equality implies the same hash value. The same hash value does not necessarily mean two objects are equal.

🔗

Let’s consider two different versions of a Student class. The first will have a studentID member that is unique for each student:

🔗

struct Student {
    string firstName;
    string lastName;
    string dateOfBirth;
    int earnedCredits;
    double GPA;
    int studentID;  // Unique identifier for each student
};

bool operator==(const Student& s1, const Student& s2) {
    return s1.studentID == s2.studentID;  // Equality based on studentID
}

size_t myHash(const Student& s) {
    std::hash<int> intHasher;
    return intHasher(s.studentID);  // Hash based on studentID
}

Here, the existence of a unique field makes things simple. We will just base the hash value on that unique field. Then, we will make sure that any two items that are == always will produce the same hash value by defining the equality operator to also only rely on the unique field.

🔗

Now, let’s consider a second version of the Student class where we just have firstName, lastName, and dateOfBirth members, none of which are unique by themselves:

🔗

struct Student {
    string firstName;
    string lastName;
    string dateOfBirth;
    int earnedCredits;
    double GPA;

bool operator==(const Student& s1, const Student& s2) {
    // Equality based on first name, last name, and date of birth
    return s1.firstName == s2.firstName &&
           s1.lastName == s2.lastName &&
           s1.dateOfBirth == s2.dateOfBirth;  
}

size_t myHash(const Student& s) {
    std::hash<string> stringHasher;
    size_t h1 = stringHasher(s.firstName);
    size_t h2 = stringHasher(s.lastName);
    size_t h3 = stringHasher(s.dateOfBirth);
    // Combine the hash values and shift each field by one additional bit
    return h1 ^ (h2 << 1) ^ (h3 << 2);
}

In this second version, we do not have a single unique field to base the hash on, so we need to combine the hash values of multiple fields. We use bit shifting and xor to combine the hash values of the first name, last name, and date of birth. We will shift each value by a different number of bits (0, 1, and 2) to ensure that the order of the fields affects the final hash value. This way, we will produce different hash values for a student with the first name "John", last name "Smith", and date of birth "01/01/2000" than for a student with the first name "Smith", last name "John", and date of birth "01/01/2000". However, if we have two students with the same first name, last name, and date of birth, they will produce the same hash value, which is consistent with our equality operator. (It should be clear why having a unique student ID is a nice way to identify them.)

🔗

As before, we need to make sure that the logic of the hash function is consistent with the logic of the equality operator. By defining == to rely on the same fields as the hash function, we ensure that any two students that are considered equal will produce the same hash value.

🔗

Insight 33.5.2.

Equality and hash functions for data types should be based on the core attributes that uniquely identify an object. Sometimes the “uniquely identifying attributes” is just a single member and sometimes it requires a combination of members.

🔗

Note that neither version includes things like earned credits or GPA in the hash. If we included those values in the hash and then tried to update a student’s GPA, the hash value for that student would change, which means its location in the table would need to change as well.

🔗

Warning 33.5.3.

When using any hash table, if you want to modify a key in a way that will change it’s hash value, you must first remove the item from the table, modify the key, and then reinsert the item into the table. If you modify a key without removing and reinserting, you will break the integrity of the hash table and it may not be able to find that item anymore.

🔗

Checkpoint 33.5.1.

We are storing structs that contain information about stocks:

🔗

struct StockInfo {
    string tickerSymbol;
    string companyName;
    int sharesOwned;
    double currentPrice;
};

The ticker symbol is the short code used to identify stocks in listings, while the company name is the full legal name of the company. Thus, we might have a value like this: {tickerSymbol: "GOOG", companyName: "Alphabet Inc.", sharesOwned: 100, currentPrice: 316.45}.

🔗

What members should we include in the hash function for this struct?

🔗

tickerSymbol
companyName
This is a tough call. If we believe the ticker symbol is unique (which it is), then using just it would be sufficient. And not including the companyName means that we can let that name change without having to worry about rehashing.

If tickerSymbols were not unique, we might want to include the companyName to ensure a unique identifier.
sharesOwned
Shares owned is likely to change. And it does not contribute to uniquely identifying the stock.
currentPrice
Current price is likely to change and does not contribute to uniquely identifying the stock.

It is also a double which can introduce floating-point precision issues.

🔗

Checkpoint 33.5.2.

Which statements are correct about hash functions and equality?

🔗

If two objects are equal, their hash values must be the same.
If two objects have the same hash value, they must be equal.
This is not necessarily true. Two objects can have the same hash value (a collision) without being equal. Though it is desirable for hash functions to minimize collisions, they cannot be completely eliminated.
If two objects are equal, they should have different hash values.
This is incorrect. Equal objects must have the same hash value.
If two objects have different hash values, they must not be equal.
Equal objects must have the same hash value, so if two objects have different hash values, they cannot be equal.

🔗

You have attempted of activities on this page.

🔗

Prev Top Next