Understanding Consistent Hashing

13 min read
godistributed-systemsload-balancingalgorithmssystems-programming

When I first learned about consistent hashing, I thought I understood it. Hash keys to nodes, use a ring structure, minimize redistribution when nodes change—simple enough. Then I tried to use it in a real system and watched in horror as one server got hammered with requests while others sat idle.

That's when I discovered the missing piece: bounded loads. Consistent hashing solves the redistribution problem beautifully, but it doesn't guarantee balanced load distribution. This is why I built smol-hash, a minimal implementation that demonstrates how to add load bounds to consistent hashing.

The Problem with Regular Hashing

Before we talk about consistent hashing, let's understand why we need it in the first place.

Imagine you have three cache servers and want to distribute keys across them. The naive approach is simple: server_index = hash(key) % num_servers. If you hash "user:123", you get some number, mod it by 3, and that's your server.

This works great until you add or remove a server. Suddenly, the modulo changes and almost every key maps to a different server. If you had a million cached items, 999,667 of them (roughly two-thirds) need to move. Your cache hit rate crashes, and you're effectively starting from scratch.

For a distributed cache serving production traffic, this is catastrophic.

Enter Consistent Hashing

Consistent hashing solves this problem elegantly. Instead of mapping keys directly to servers, we map both keys and servers onto a circular hash space—a "ring."

Here's how it works:

  1. Hash each server name to get positions on the ring
  2. Hash each key to get a position on the ring
  3. Walk clockwise from the key's position until you hit a server
  4. That server owns the key

The brilliance: when you add or remove a server, only the keys between that server and the previous one need to move. With three servers and a million keys, removing one server only affects about 333,333 keys (one-third), not 666,667.

But there's a catch.

The Hotspot Problem

When I first implemented basic consistent hashing, I ran into an issue immediately. With three servers on the ring, the keys weren't distributed evenly. One server got 50% of the load, another got 30%, and the third got 20%. Random hash distribution is... random.

The standard solution is virtual nodes (replicas). Instead of placing each server once on the ring, you place it many times with different hash values. Server1 becomes Server1#0, Server1#1, Server1#2, and so on. More replicas = smoother distribution.

I implemented this in smol-hash:

// Add virtual nodes (replicas) to the ring
for i := 0; i < ch.replicas; i++ {
    virtualKey := fmt.Sprintf("%s#%d", nodeName, i)
    hashVal := ch.hash(virtualKey)
    ch.ring = append(ch.ring, hashVal)
    ch.ringMap[hashVal] = nodeName
}

With 150 replicas per server, distribution improved dramatically. But I still had a problem.

The Real-World Scenario

Here's what happened when I tested it with realistic data: some keys are hot. In a cache serving web traffic, certain users or endpoints get hit way more than others. Maybe a celebrity's profile page gets 10,000 requests per second while a random user gets 10.

With consistent hashing, all those celebrity profile requests hit the same server—the one that owns that key's position on the ring. That server melts down while others idle.

Virtual nodes help with overall distribution, but they don't help with hot keys. If "celebrity:profile" hashes to Server1, all 10,000 requests per second go to Server1, no matter how many replicas you have.

This is the bounded load problem.

Bounded Loads: The Missing Piece

The solution comes from a 2017 Google paper: "Consistent Hashing with Bounded Loads." The idea is elegant: track how many keys each server currently handles, and refuse to assign more keys to a server that's above a threshold.

The threshold is based on average load:

max_load = average_load × load_factor

If you have 3 servers, 15 keys, and a load factor of 1.25:

  • Average load: 15 / 3 = 5 keys per server
  • Max load: 5 × 1.25 = 6.25 → 6 keys

No server can have more than 6 keys. When a key would normally go to an overloaded server, you walk clockwise to find the next available one.

Implementing Bounded Loads

The core logic happens in GetNode(). Here's the key part from smol-hash:

func (ch *ConsistentHash) GetNode(key string) (string, error) {
    // Calculate average load and max allowed load
    totalNodes := len(ch.nodes)
    totalLoad := 0
    for _, node := range ch.nodes {
        totalLoad += node.load
    }
    
    avgLoad := float64(totalLoad+1) / float64(totalNodes) // +1 for the new key
    maxLoad := int(avgLoad * ch.loadFactor)

    // Hash the key
    hashVal := ch.hash(key)
    idx := ch.search(hashVal)

    // Search for a node with available capacity
    for i := 0; i < len(ch.ring); i++ {
        currIdx := (idx + i) % len(ch.ring)
        nodeName := ch.ringMap[ch.ring[currIdx]]
        node := ch.nodes[nodeName]

        // Check if this node is under the load limit
        if node.load < maxLoad || maxLoad == 0 {
            node.load++
            return nodeName, nil
        }
    }

    // If all nodes are at capacity, return the originally hashed node
    nodeName := ch.ringMap[ch.ring[idx%len(ch.ring)]]
    ch.nodes[nodeName].load++
    return nodeName, nil
}

The critical insight: we don't just hash and assign. We hash to find a starting point, then search clockwise until we find a server with capacity. This preserves consistent hashing's redistribution benefits while preventing overload.

The Load Factor Trade-off

Choosing the load factor is an interesting balance. Here's what I learned:

Load Factor 1.0: Perfect balance. Each server has exactly the average number of keys (or as close as possible). Sounds ideal, but it's fragile. In real systems, perfect balance is hard to maintain, and you end up rejecting keys or forcing them onto already-full servers.

Load Factor 1.25 (default): Each server can have up to 125% of average load. This gives enough flexibility for uneven distribution while still preventing major imbalances. Most production systems use something in this range.

Load Factor 1.5+: More relaxed. Servers can vary significantly from the average. Good for high-throughput scenarios where you care more about accepting all keys than perfect balance.

I chose 1.25 as the default because it hits the sweet spot for most use cases.

Virtual Nodes: How Many?

The number of replicas (virtual nodes) also matters. Too few and you get uneven distribution. Too many and you waste memory and CPU on the ring structure.

I tested different values:

50 replicas: Noticeable imbalance. Some servers consistently got 20% more load than others.

150 replicas (my default): Good balance. Load typically within 5-10% of perfect distribution.

500 replicas: Excellent balance, but the ring structure grows large. For most applications, the improvement over 150 isn't worth the memory cost.

For smol-hash, 150 replicas provides a good balance. Production systems might tune this based on their specific needs.

Thread Safety: The Tricky Part

One challenge I didn't anticipate: concurrent access. Multiple goroutines might call GetNode() simultaneously, and we're modifying load counters.

I used a sync.RWMutex to handle this:

type ConsistentHash struct {
    mu           sync.RWMutex
    ring         []uint32
    ringMap      map[uint32]string
    nodes        map[string]*NodeInfo
    replicas     int
    loadFactor   float64
}

func (ch *ConsistentHash) GetNode(key string) (string, error) {
    ch.mu.Lock()
    defer ch.mu.Unlock()
    // ... rest of implementation
}

The mutex protects all shared state. For read-heavy workloads, you could optimize this with more granular locking or lock-free data structures, but for learning purposes, a single mutex keeps the code simple and correct.

Using smol-hash

Here's how you'd use it in practice:

package main

import (
    "fmt"
)

func main() {
    // Create hash ring: 150 virtual nodes, 1.25x load factor
    ch := NewConsistentHash(150, 1.25)

    // Add cache servers
    ch.AddNode("cache-1.example.com")
    ch.AddNode("cache-2.example.com")
    ch.AddNode("cache-3.example.com")

    // Assign keys
    keys := []string{
        "user:1001", "user:1002", "session:abc",
        "cart:xyz", "profile:celebrity",
    }

    for _, key := range keys {
        server, _ := ch.GetNode(key)
        fmt.Printf("Key %s -> %s\n", key, server)
    }

    // Check distribution
    stats := ch.GetStats()
    for server, load := range stats {
        fmt.Printf("%s: %d keys\n", server, load)
    }
}

The API is deliberately minimal. Create a ring, add nodes, get assignments. That's it.

When to Use This Pattern

Consistent hashing with bounded loads shines in specific scenarios:

Distributed caching: Route cache keys to Redis/Memcached instances. When an instance fails, only its keys need redistribution, not everything.

Database sharding: Distribute data across database shards. Add new shards without reshuffling all your data.

Content delivery: Route requests to CDN edge servers based on content hash, with load balancing.

Service mesh: Distribute microservice instances, handling failover gracefully.

The common thread: you need both consistent distribution (minimize moves when topology changes) and balanced load (prevent hotspots).

Testing Distribution Quality

I wrote a simple test to verify load distribution:

func TestLoadDistribution(t *testing.T) {
    ch := NewConsistentHash(150, 1.25)
    
    ch.AddNode("server1")
    ch.AddNode("server2")
    ch.AddNode("server3")

    // Assign 300 keys
    for i := 0; i < 300; i++ {
        key := fmt.Sprintf("key:%d", i)
        ch.GetNode(key)
    }

    stats := ch.GetStats()
    
    // Each server should have ~100 keys (± 25 due to load factor 1.25)
    for server, load := range stats {
        if load < 75 || load > 125 {
            t.Errorf("Server %s has poor distribution: %d keys", server, load)
        }
    }
}

With 150 replicas and a 1.25 load factor, distribution consistently stayed within 5-10% of perfect balance across test runs.

Performance Characteristics

The algorithm's complexity:

Adding a node: O(R log N) where R is replicas and N is total nodes. We insert R virtual nodes into a sorted ring.

Getting a node: O(log N + N) worst case. Binary search to find the starting position (O(log N)), then potentially walk the entire ring if all nodes are full (O(N)). In practice, with proper load factors, it's usually just O(log N).

Space complexity: O(R × N) for storing virtual nodes.

For typical use cases (dozens of nodes, hundreds of replicas), these characteristics are perfectly acceptable. smol-hash isn't optimized for thousands of nodes, but neither are most consistent hashing implementations.

Lessons Learned

Building smol-hash clarified several things for me:

Consistent hashing alone isn't enough: The algorithm is elegant, but real systems need load balancing too. The "consistent" part solves redistribution, not balance.

Virtual nodes are crucial: Without replicas, distribution is terrible. With them, it's quite good. The difference is dramatic.

Load bounds prevent hotspots: Tracking load and refusing to overload servers is simple but powerful. It's the difference between theory and practice.

Trade-offs everywhere: Number of replicas, load factor, locking granularity—every choice is a trade-off between competing concerns.

Simple implementations teach best: By keeping the code straightforward, the algorithm's core ideas stayed visible. Optimization can come later.

Try It Yourself

Want to experiment with smol-hash? Here's how:

git clone https://github.com/smol-go/smol-hash.git
cd smol-hash
go run main.go

The example in main.go demonstrates adding nodes, assigning keys, checking load distribution, and dynamic scaling. Fork it, modify it, break it, fix it. That's how you really learn.

Final Thoughts

Is smol-hash production-ready? No. Should you use it instead of a mature library? Definitely not.

But building it taught me more about distributed systems in a weekend than reading papers ever could. The moment when I added bounded loads and watched the load distribution smooth out was genuinely satisfying.

If you've ever wondered how distributed caches work or how database sharding handles node failures, I encourage you to build something like this. The concepts that seem abstract in distributed systems textbooks become concrete when you're debugging why your hash ring isn't balancing load properly.


Check out the full source code on GitHub, including tests and examples demonstrating load distribution.