Skip to main content

GetHashCode Hands-On Session

The following is a hands-on post meant to demonstrate how GetHashCode() and Equals() methods are used by .NET Framework under the hood. For the sake of simplicity I will refer to the popular hashed-base Dictionary type, although any other hash based structure will follow a similar behavior, if not the same one.

After understanding this post you should be able to spot potential problematic behaviors and resolve them, prevent creation of unreachable items in dictionaries and improve CRUD actions performance on hash based structures.

The Theory

GetHashCode() is used to create a unique integer identifier for objects/structs. The hashcode can be used for two purposes:

  • Programmatically, by developers, to distinguish objects/structs form each other (NOTE: Not recommended when the default .NET implementation is used, as it's not guaranteed to preserve the same hash between .NET versions and platforms)
  • Internally, by .NET Framework, when using the object/struct as a key in a hashed based list, i.e. Dictionary. This scenario is the main subject of this article.

Microsoft has supplied both Reference and Value types with two default GetHashCode() implementations. The implementation is designed to provide a generic "good-enough" solution with almost no knowledge of our particular types at runtime. As result the default implementation of GetHashCode() works, but is not optimized. (The default Value.GetHashCode() has an additional issue, discussed later in this post).

Add and Search Mechanism

When using an object/struct as a key in a Dictionary, .NET implementation internally calls the given object/struct's GetHashCode() method and calculates an index from it, based on the length of the list.

Although implementation can vary between .NET versions, the common implementation is similar to the following:

int hash = GetHashCode() % dictionary.Length;
Don't worry, we will review the default implementation later-on.

Hashbased structures are composed by buckets that are used to 'hold' the supplied objects. When trying to insert new items in a Dictionary .NET uses the calculated index and tries to insert the item in that specific index position.

If the calculated index is still not in use (the bucket is empty), the item is placed smoothly in the bucket (O(1) algorithm).

In case the calculated index already 'holds' an object(s), meaning that the bucket is already in use, .NET will still try to add the object to the bucket, but only if the identical key is not already used by that specific bucket. This scenario is called "collision"(O(n) algorithm).

Hash Table Collision

In order to check the existence of identical keys in a bucket .NET compares each existing key in the bucket with the new candidate item by calling the Equals() method on each of the existing items:

bool exist = alreadyInUseKey1.Equals(newKey1);  //* Loops through the entire bucket items list 

If .NET encounters an equivalence between two keys (exist = true) an 'An item with the same key has already been added' exception will be thrown. Otherwise the item will be added with the provided key.

NOTE: Typically, the domain of a hash function (the set of possible keys) is larger than its range (the number of different table indexes), and so it will map several different keys to the same index (called collision).

When searching an item based on its key object, .NET internally retrieves the key's hashcode, recalculates the index and instructs the CPU to retrieve the object from the specific bucket at the index position.

Optimal Implementation

In order to distinguish objects from each other, we need to make use of all, or parts of, its internal values (properties or fields). There are three common golden rules we must follow in order to maximize use of hashed lists:

  1. If two objects are equal (as defined by Equals() method), they must generate the same hash value, Otherwise, their hashcodes can’t be used to find objects in their correspondent buckets.
    The contrary is not required - Multiple non-Equal objects can have the same hashcode (collisions). This is the reason why we need to override Equals() when GetHashCode() has been overridden and vice versa, we need them to be synchronized.
  2. For any object A, A.GetHashCode() must be an instance invariant. No matter what methods are called upon A, A.GetHashCode() must always return the same value. That ensures that an object placed in a bucket will always remain reachable.
  3. To reduce the frequency of hashcode collisions, the hash function should generate a uniform distribution among all integers (32/64 bit) based on all, or part of, the object/struct fields. That’s how you achieve efficiency in a hash-based container.

    Wikipedia: In probability theory and statistics, the discrete uniform distribution is a symmetric probability distribution whereby a finite number of values are equally likely to be observed; every one of n values has equal probability 1/n. Another way of saying "discrete uniform distribution" would be "a known, finite number of outcomes equally likely to happen". A simple example of the discrete uniform distribution is throwing a fair die. The possible values are 1, 2, 3, 4, 5, 6, and each time the die is thrown the probability of a given score is 1/6. If two dice are thrown and their values added, the resulting distribution is no longer uniform since not all sums have equal probability.

In order to use the large number produced by the GetHashCode() as an index, we need to reduce it to the length of the dictionary while preserving the good distribution gained by our GetHashCode().

For that reason, as explained previously, .NET internally calculates the index by modeling it against the length of the list. By the way, .NET preserves the hashcode in order to use it in case the Dictionary capacity is increased at runtime. In this case .NET recalculates all items indexed with the new Length of the dictionary and replaces the items position accordingly.

A common and successful algorithm is to XOR all the return values from GetHashCode() on all fields in a type. If your type contains some mutable fields, exclude those fields from the calculations. At this point it should be clear that a poor hashcode algorithm can cause performance issues and even unreachable items (see samples in "Hands On" section below).

Default .NET Implementation

As mentioned before, mediocre GethashCode() and Equals() implementations where supplied by default by Microsoft:

System.Object doesn't have to follow rule 3, because it uses a simple sequential integer. So when you create reference types that are meant to be hash keys, you should override GetHashCode() to get a better distribution of the hash values across all integers for your specific type.

System.ValueType overrides Objects' implementations and performs a hash calculation based on the object's first Property. Rule 2 and 3 depend on the first property of an object. It must be well distributable and immutable.

NOTE: I have taken the description above, about System.ValueType implementation, from the book "Effective C#". Several tests later, I now understand that it's not always the case. .NET apparently uses a more complex algorithm. In this article we will follow "Effective C#"'s description.

Hands On

The following is an effective illustration demonstrating how small changes can lead to big differences when talking about hash based structures:

Defaul Value Implelemntation

Given a custom value struct-key, where the first field (id) of the key is always different.

    
    struct MyKey
    {
        private int id;
        private string msg;
        public MyKey(int id)
        {
            this.id = id;
            this.msg = "Hello";     
        }
    }

    private static void AddDic()
    {
        Dictionary <MyKey,int> list = new Dictionary<MyKey,int>();            
        for (int i = 0; i < 100; i++)
            list.Add( new MyKey(i), i);

    } 

The code above will take ~8 seconds on my 4 core 8MB RAM machine.

Let's say that time comes and your college made a small refactor:

    
    struct MyKey
    {
        private string msg;//* Now first field in the struct
        private int id;
        public MyKey(int id)
        {
            this.id = id;
            this.msg = "Hello";     
        }
    }

Even though she just changed the order of the fields, now the code will take ~30 seconds, because the first field of the struct is fixed and all the items are inserted in the same bucket.

Custom Equals & GetHashCode

The same is true with custom overridden methods. We need to make sure the hashcode is well distributed, otherwise we will suffer from the same behavior.

Let's consider a similar object-key implementation:


    
    struct MyKey
    {
        private int id;
        public MyKey(int id)
        {
            this.id = id;
        }
              
        public override bool Equals(object obj)
        {
            int int1 = ((MyKey)obj).id;
            bool equ = int1 == id;
            return equ;
        }

        public override int GetHashCode()
        {
            int hash = 1;//* Always same value
            return hash;
        }
    }

The first sample makes use of the above object-key to generate an identical hashcode for all instances.

    
        private static void AddDic()
        {
            Dictionary <MyKey , int> list = new Dictionary<MyKey, int>();

            double beforems = DateTime.Now.TimeOfDay.TotalMilliseconds;
            Console.WriteLine("Befoare Insert: " + beforems);
            for ( int i = 0; i < 100; i++)
            {
                list.Add(new MyKey(i), i);
                Console.WriteLine("----");
            }
            double afterms = DateTime.Now.TimeOfDay.TotalMilliseconds;
            Console.WriteLine("After Insert: " + afterms);

            Console.WriteLine("Total: " + (afterms - beforems));
          
            Console.ReadKey();
        } 

Adding new items into the Dictionary is very expensive, as mentioned before. When an identical hashcode is encountered, .NET tries to determine equality for all the key objects by looping throughout all the items in the specific bucket. This innocent loop of 100 items suffers from extremely low performance just because the hashcode in the custom key is always the same.

Output:

Equals - Internal Object ID:15 External Object ID: 16
Equals - Internal Object ID:14 External Object ID: 16
Equals - Internal Object ID:13 External Object ID: 16
Equals - Internal Object ID:12 External Object ID: 16
Equals - Internal Object ID:11 External Object ID: 16
Equals - Internal Object ID:10 External Object ID: 16
Equals - Internal Object ID:9 External Object ID: 16
Equals - Internal Object ID:8 External Object ID: 16
Equals - Internal Object ID:7 External Object ID: 16
Equals - Internal Object ID:6 External Object ID: 16
Equals - Internal Object ID:5 External Object ID: 16
Equals - Internal Object ID:4 External Object ID: 16
Equals - Internal Object ID:3 External Object ID: 16
Equals - Internal Object ID:2 External Object ID: 16
Equals - Internal Object ID:1 External Object ID: 16
Equals - Internal Object ID:0 External Object ID: 16
----
GetHashCode - Internal Object ID:17 HashCode :1
Equals - Internal Object ID:16 External Object ID: 17
Equals - Internal Object ID:15 External Object ID: 17
Equals - Internal Object ID:14 External Object ID: 17
Equals - Internal Object ID:13 External Object ID: 17
Equals - Internal Object ID:12 External Object ID: 17
Equals - Internal Object ID:11 External Object ID: 17
Equals - Internal Object ID:10 External Object ID: 17
Equals - Internal Object ID:9 External Object ID: 17
Equals - Internal Object ID:8 External Object ID: 17
Equals - Internal Object ID:7 External Object ID: 17
Equals - Internal Object ID:6 External Object ID: 17
Equals - Internal Object ID:5 External Object ID: 17
Equals - Internal Object ID:4 External Object ID: 17
Equals - Internal Object ID:3 External Object ID: 17
Equals - Internal Object ID:2 External Object ID: 17
Equals - Internal Object ID:1 External Object ID: 17
Equals - Internal Object ID:0 External Object ID: 17
----
GetHashCode - Internal Object ID:18 HashCode :1
Equals - Internal Object ID:17 External Object ID: 18
Equals - Internal Object ID:16 External Object ID: 18
Equals - Internal Object ID:15 External Object ID: 18
Equals - Internal Object ID:14 External Object ID: 18
Equals - Internal Object ID:13 External Object ID: 18
Equals - Internal Object ID:12 External Object ID: 18
Equals - Internal Object ID:11 External Object ID: 18
Equals - Internal Object ID:10 External Object ID: 18
Equals - Internal Object ID:9 External Object ID: 18
Equals - Internal Object ID:8 External Object ID: 18
Equals - Internal Object ID:7 External Object ID: 18
Equals - Internal Object ID:6 External Object ID: 18
Equals - Internal Object ID:5 External Object ID: 18
Equals - Internal Object ID:4 External Object ID: 18
Equals - Internal Object ID:3 External Object ID: 18
Equals - Internal Object ID:2 External Object ID: 18
Equals - Internal Object ID:1 External Object ID: 18
Equals - Internal Object ID:0 External Object ID: 18
----

We can see that every new added object (17 and 18) is tested against each existing item in the dictionary.

By changing our implementation (although not the optimal one) to return a distinct value for each item, we will dramatically increase the overall performance.

  
    struct MyKey
    {
        private int id;
        public MyKey(int id)
        {
            this.id = id;
        }
              
        public override bool Equals(object obj)
        {
            int int1 = ((MyKey)obj).id;
            bool equ = int1 == id;
            return equ;
        }

        public override int GetHashCode()
        {
            int hash = id;//* Always different
            return hash;
        }
    }

An additional common mistake can occurs when our GetHashCode() implementation is based on a mutable field, for example, msg field. In this case we can make items become unreachable if the field gets modified at runtime:

  
         private static void ChangeDic()
        {
            Dictionary<MyKeyStruct , int> list = new Dictionary<MyKeyStruct , int>();

            MyKeyStruct one = new MyKeyStruct(1);
            MyKeyStruct oneTwo = new MyKeyStruct(2);
            MyKeyStruct oneThree = new MyKeyStruct(3);
            MyKeyStruct oneFour = new MyKeyStruct(4);
            MyKeyStruct oneFive = new MyKeyStruct(5);
            MyKeyStruct oneSix = new MyKeyStruct(6);

            list.Add(one, 1);
            Console.WriteLine( "----" );
            list.Add(oneTwo, 2);
            Console.WriteLine( "----" );
            list.Add(oneThree, 3);
            Console.WriteLine( "----" );
            list.Add(oneFour, 4);
            Console.WriteLine( "----" );
            list.Add(oneFive, 5);
            Console.WriteLine( "----" );
            list.Add(oneSix, 6);

            one.msg = "x" ;//* Change msg value

            var item = list[one];//* Exception: The given key was not present in the dictionary.

            Console.ReadKey();
        } 

The item '1' is lost somewhere in the hash map because after being stored in a specific bucket, we changed the msg filled, causing GetHashCode() method to produce a new hash, and pointing to a different bucket. Next time 'one' key is used, a new hashcode will lead to a different bucket and '1' item will never be found.

Summary

The default behavior, Object.GetHashCode() works correctly for reference types.
ValueType.GetHashCode() works only if the first field in your struct is immutable.

Both of them do not necessarily generate an efficient distribution hashcode, and their implementation must be based on immutable values.

ValueType.GetHashCode() generates an efficient hash code only when the first field in your struct contains values across a meaningful subset of its inputs.

Comments

isabellaJoseph said…
I’m really impressed with your blog article, such great & useful knowledge you mentioned here..
Dot Net Training in Chennai
Android Training in Chennai

The Best

Closures in C# vs JavaScript -
Same But Different

Closure in a Nutshell Closures are a Software phenomenon which exist in several languages, in which methods declared inside other methods (nested methods), capture variables declared inside the outer methods. This behavior makes captured variables available even after the outer method's scope has vanished.

The following pseudo-code demonstrates the simplest sample:
Main() //* Program starts from here { Closures(); } AgeCalculator() { int myAge = 30; return() => { //* Returns the correct answer although AgeCalculator method Scope should have ordinarily disappear return myAge++; }; } Closures() { Func ageCalculator = AgeCalculator(); //* At this point AgeCalculator scopeid cleared, but the captured values keeps to live Log(ageCalculator()); //* Result: 30 Log(ageCalculator()); //* Result: 31 } JavaScript and C# are two languages that support…

Formatting Data with IFormatProvider & ICustomFormatter

This post provides an internal overview of IFormatProvider & ICustomFormatter interfaces, and they way they are handled by .NET.

IFormatProvider is a .NET Framework Interface that should be used, by implementing its single public object GetFormat(Type) method, when there is a need to implement custom formatting of data like String and DateTime.

The public object GetFormat(Type) method simply returns an object that in turns is available to supply all available information to continue the formatting process. The Type passed in by the Framework is meant to give the implementor a way to decide which type to return back. Its like a Factory Method Design Pattern where the "formatType" is the type expected to be returned.
class MyProvider : IFormatProvider { public object GetFormat(Type formatType) { object result = null; //* Factory Method if (formatType == typeof( ICustomFormatter)) //* Some object, will be disc…

Design API for Multiple Different Clients

Today I want to talk about common design challenges related to architecture of robust APIs, designed to be consumed by multiple clients with different needs.

Our use case is the following: We need to build a N-Tier Web REST/SOAP API that is supposed to read/write data from a DB, perform some processing on that data and expose those methods to our API consumers.

In addition we have multiple different API clients each with different needs, meaning we can't just expose a rigid set of functions with a defined group of DTOs (Data Transfer Objects).
DTO vs POCO Before start diving I want to explain shortly the difference between these two controversial concepts.
DTO Objects that are designed to transfer data between edges (i.e. between processes, functions, server & clients etc'). Typically DTOs will contain only simple properties with no behavior.
POCO Objects that are designed to reflect the internal business data model. For example if you have an eCommerce platform you will…