Skip to main content

Section 33.5 Standard Library Hash Functions and Custom Types

Rather than writing our own hash functions for built in types like doubles and strings, we can use the hash functions provided by the C++ standard library. The std::hash class is a templated functor that can be used to compute a hash value (as a size_t) for a given type. For example, to compute the hash value of a string, we can use the following code:
string name = "John Smith";
string name2 = "Sandra Jackson";
// Create a hash functor for strings
std::hash<string> stringHasher;
// Compute the hash values for the strings
size_t strHash = stringHasher(name);
size_t strHash2 = stringHasher(name2);
For custom data types, we need to write our own hash functions that use std::hash to compute hash values for basic members and then combine them using bit shifting and xor to produce a single hash value for the entire object.
For example, if we have a Point struct with x and y coordinates, we can write a hash function for it like the one below. It shifts the bits of the x coordinateโ€™s hash value to the left by 1, we ensure that (1.0, 2.0) will have a different hash value than a point with coordinates (2.0, 1.0).
size_t myHash(const Point& p) {
  std::hash<double> doubleHasher;
  size_t pointHash = (doubleHasher(p.x) << 1) ^ doubleHasher(p.y);
  return pointHash;
}
(Of course we could have issues related to floating point precision, but we will assume we arenโ€™t worried about 1.0 vs. 1.0000001).
For more complex objects, we need to carefully consider which members to include in the hash. There are a couple of general guidelines to follow when deciding which members to include in the hash:
  • We should include enough members to ensure that different objects that we want to treat as different will have different hash values. For example, we would not want to compute the hash value for a Person using just the age as then everyone with the age 23 would hash to the exact same value.
  • We should try to avoid including data that is likely to change frequently. As we will see later, changing data that is included in the hash can cause problems when using hash tables.
If there is a single member that is unique for each object (like a social security number for a person), the best strategy is often to just use the hash value of that member as the hash value for the entire object. If there is not a single unique member, we need to decide on a set of members to include in the hash and then combine their hash values in a way that produces a good distribution of hash values.
There also should be an association between the hash function and the equality operator for a type. If two objects are considered equal by the equality operator, they should have the same hash value. If two objects are not considered equal, they should ideally have different hash values (although it is not strictly required that they do). So any object we are going to hash should also have an equality operator defined, and the logic of the hash function should be consistent with the logic of the equality operator.
Letโ€™s consider two different versions of a Student class. The first will have a studentID member that is unique for each student, and the second will have firstName, lastName, and dateOfBirth members, none of which are unique by themselves.
struct Student {
    string firstName;
    string lastName;
    string dateOfBirth;
    int earnedCredits;
    double GPA;
    int studentID;  // Unique identifier for each student
};

bool operator==(const Student& s1, const Student& s2) {
    return s1.studentID == s2.studentID;  // Equality based on studentID
}

size_t myHash(const Student& s) {
    std::hash<int> intHasher;
    return intHasher(s.studentID);  // Hash based on studentID
}
In that first case, the existence of a unique field makes things simple. We just need to make sure that any two items that are == always will produce the same hash value.
In the second case, we need to combine the hash values of multiple members. We will use bit shifting and xor to combine the hash values of the first name, last name, and date of birth. We will also need to make sure that the logic of the hash function is consistent with the logic of the equality operator. If we consider two students to be equal if they have the same first name, last name, and date of birth, then our hash function should produce the same hash value for any two students with the same first name, last name, and date of birth.
struct Student {
    string firstName;
    string lastName;
    string dateOfBirth;
    int earnedCredits;
    double GPA;

bool operator==(const Student& s1, const Student& s2) {
    // Equality based on first name, last name, and date of birth
    return s1.firstName == s2.firstName &&
           s1.lastName == s2.lastName &&
           s1.dateOfBirth == s2.dateOfBirth;  
}

size_t myHash(const Student& s) {
    std::hash<string> stringHasher;
    size_t h1 = stringHasher(s.firstName);
    size_t h2 = stringHasher(s.lastName);
    size_t h3 = stringHasher(s.dateOfBirth);
    // Combine the hash values and shift each field by one additional bit
    return h1 ^ (h2 << 1) ^ (h3 << 2);
}
Because of the bit shifting, we will produce a different hash for a student with the first name "John", last name "Smith", and date of birth "01/01/2000" than for a student with the first name "Smith", last name "John", and date of birth "01/01/2000". But, if we have two students with the same first name, last name, and date of birth, they will produce the same hash value, which is consistent with our equality operator. (It should be clear why having a unique student ID is a nice way to identify them.)

Insight 33.5.1.

Equality and hash functions for data types should be based on the core attributes that uniquely identify an object. Sometimes the โ€œuniquely identifying attributesโ€ is just a single member and sometimes it requires a combination of members.
Note that neither version includes things like earned credits or GPA in the hash. If we included those values in the hash and then tried to update a studentโ€™s GPA, the hash value for that student would change, which means its location in the table would need to change as well.

Warning 33.5.2.

When using any hash table, if you want to modify a key in a way that will change itโ€™s hash value, you must first remove the item from the table, modify the key, and then reinsert the item into the table. If you modify a key without removing and reinserting, you will break the integrity of the hash table and it may not be able to find that item anymore.

Checkpoint 33.5.1.

We are storing structs that contain information about stocks:
struct StockInfo {
    string tickerSymbol;
    string companyName;
    int sharesOwned;
    double currentPrice;
};
The ticker symbol is the short code used to identify stocks in listings, while the company name is the full legal name of the company. Thus, we might have a value like this: {tickerSymbol: "GOOG", companyName: "Alphabet Inc.", sharesOwned: 100, currentPrice: 316.45}.
What members should we include in the hash function for this struct?
  • tickerSymbol
  • companyName
  • This is a tough call. If we believe the ticker symbol is unique (which it is), then using just it would be sufficient. And not including the companyName means that we can let that name change without having to worry about rehashing.
    If tickerSymbols were not unique, we might want to include the companyName to ensure a unique identifier.
  • sharesOwned
  • Shares owned is likely to change. And it does not contribute to uniquely identifying the stock.
  • currentPrice
  • Current price is likely to change and does not contribute to uniquely identifying the stock.
    It is also a double which can introduce floating-point precision issues.

Checkpoint 33.5.2.

Which statements are correct about hash functions and equality?
  • If two objects are equal, their hash values must be the same.
  • If two objects have the same hash value, they must be equal.
  • This is not necessarily true. Two objects can have the same hash value (a collision) without being equal. Though it is desirable for hash functions to minimize collisions, they cannot be completely eliminated.
  • If two objects are equal, they should have different hash values.
  • This is incorrect. Equal objects must have the same hash value.
  • If two objects have different hash values, they must not be equal.
  • Equal objects must have the same hash value, so if two objects have different hash values, they cannot be equal.
You have attempted of activities on this page.