Demystifying Parsers: Writing a JSON Parser

Dec 03, 202512 min read

goparsingjsoncompilerssystems-programming

When I decided to build a JSON parser from scratch, I wasn't trying to replace Go's excellent encoding/json package. Instead, I wanted to understand how parsers actually work—the kind of understanding you only get by implementing one yourself.

Parsing is one of those computer science fundamentals that feels almost magical until you build your own. How does a parser take a string of characters and turn it into structured data? Let's find out.

Why Build a JSON Parser?

JSON parsing is everywhere in modern software development. Every time you make an API call or read a config file, there's probably a JSON parser working behind the scenes. But most developers (myself included, before this project) treat parsers as black boxes.

I chose JSON specifically because:

The spec is well-defined and surprisingly simple
It's practical—we use JSON daily
It's complex enough to demonstrate real parsing techniques
It's simple enough to implement in a weekend

The result is smol-parser: a complete JSON parser that handles strings, numbers, booleans, nulls, arrays, and objects. It even supports escape sequences and Unicode. And it all fits in about 400 lines of Go.

The Two-Stage Dance: Lexing and Parsing

Most parsers work in two distinct phases, and understanding this separation was my first key insight:

Lexing (lexical analysis): Breaking raw text into meaningful tokens
Parsing (syntactic analysis): Organizing those tokens into data structures

Think of it like reading a sentence. First, you identify individual words (lexing). Then you understand how those words relate to each other grammatically (parsing).

Here's what that looks like for {"name": "Alice"}:

Input characters: { " n a m e " : " A l i c e " }
                 ↓ Lexing
Tokens: LEFT_BRACE, STRING("name"), COLON, STRING("Alice"), RIGHT_BRACE
                 ↓ Parsing
Structure: map[string]interface{}{"name": "Alice"}

Building the Lexer: Character by Character

The lexer is where I started. Its job is simple conceptually: read characters one at a time and group them into tokens. But the tricky part is in the details.

I structured my lexer as a state machine that tracks its position:

type Lexer struct {
    input string
    pos   int
    ch    byte
}

The readChar() method advances through the input, while peekChar() lets us look ahead without committing—crucial for deciding whether we're looking at a number like 123 or 123.456.

The String Challenge

Parsing strings turned out to be trickier than I expected. JSON strings support escape sequences like \n and \t, but also Unicode escapes like \u0048 (which represents 'H').

My readString() method handles this with a state machine inside the string reader:

func (l *Lexer) readString() (string, error) {
    var result []rune
    l.readChar() // skip opening "

    for l.ch != '"' && l.ch != 0 {
        if l.ch == '\\' {
            l.readChar()
            switch l.ch {
            case '"', '\\', '/':
                result = append(result, rune(l.ch))
            case 'n':
                result = append(result, '\n')
            // ... more escape sequences
            case 'u':
                // Handle \uXXXX Unicode escapes
                hex := ""
                for i := 0; i < 4; i++ {
                    l.readChar()
                    hex += string(l.ch)
                }
                val, _ := strconv.ParseInt(hex, 16, 32)
                result = append(result, rune(val))
                continue
            }
        } else {
            result = append(result, rune(l.ch))
        }
        l.readChar()
    }
    
    return string(result), nil
}

The Unicode handling was particularly satisfying to get right. Converting four hex digits into an actual character felt like cracking a small puzzle.

Numbers: More Complex Than They Look

JSON numbers follow a specific format: optional minus, digits, optional decimal point and more digits, optional exponent. My initial attempt just grabbed "digit-looking characters," which broke on scientific notation like 1.5e-10.

The solution was methodical: handle each part of the number format in sequence. First the optional minus, then the integer part (with special handling for leading zeros), then the optional fractional part, then the optional exponent:

func (l *Lexer) readNumber() string {
    start := l.pos - 1
    
    if l.ch == '-' {
        l.readChar()
    }
    
    // Integer part
    if l.ch == '0' {
        l.readChar()
    } else {
        for unicode.IsDigit(rune(l.ch)) {
            l.readChar()
        }
    }
    
    // Fractional part
    if l.ch == '.' {
        l.readChar()
        for unicode.IsDigit(rune(l.ch)) {
            l.readChar()
        }
    }
    
    // Exponent
    if l.ch == 'e' || l.ch == 'E' {
        l.readChar()
        if l.ch == '+' || l.ch == '-' {
            l.readChar()
        }
        for unicode.IsDigit(rune(l.ch)) {
            l.readChar()
        }
    }
    
    return l.input[start : l.pos-1]
}

This approach follows the JSON specification exactly, which means it correctly handles edge cases like -0 and 1e+10.

The Parser: Making Sense of Tokens

Once the lexer produces tokens, the parser's job is to build actual Go data structures. I used a technique called recursive descent parsing, which is both elegant and intuitive once you understand it.

The core idea: each JSON grammar rule becomes a function, and these functions call each other recursively. Here's my parser structure:

type Parser struct {
    lexer    *Lexer
    curToken Token
}

The parser always maintains a "current token" and an advance() method to move forward. Every parsing function follows the same pattern: look at the current token, decide what to do, consume tokens, and return a value.

Recursive Descent in Action

The parseValue() method is the heart of the parser. It examines the current token and delegates to specialized functions:

func (p *Parser) parseValue() (interface{}, error) {
    switch p.curToken.Type {
    case TokenLeftBrace:
        return p.parseObject()
    case TokenLeftBracket:
        return p.parseArray()
    case TokenString:
        val := p.curToken.Value
        p.advance()
        return val, nil
    case TokenNumber:
        val, _ := strconv.ParseFloat(p.curToken.Value, 64)
        p.advance()
        return val, nil
    // ... more cases
    }
}

What makes this "recursive descent" is what happens in parseObject() and parseArray(). These functions call back to parseValue(), creating a recursive structure that naturally handles nesting.

Objects: The Tricky Part

Parsing objects taught me about the complexity hiding in simple syntax. An object is just {key: value, key: value}, right? But there are edge cases: empty objects, trailing commas (not allowed in JSON!), ensuring keys are strings, handling deeply nested structures.

My parseObject() implementation handles all this:

func (p *Parser) parseObject() (map[string]interface{}, error) {
    obj := make(map[string]interface{})
    
    p.advance() // skip {
    
    // Handle empty object
    if p.curToken.Type == TokenRightBrace {
        p.advance()
        return obj, nil
    }
    
    for {
        // Key must be a string
        if p.curToken.Type != TokenString {
            return nil, fmt.Errorf("expected string key")
        }
        
        key := p.curToken.Value
        p.advance()
        
        // Expect colon
        if p.curToken.Type != TokenColon {
            return nil, fmt.Errorf("expected colon after key")
        }
        p.advance()
        
        // Parse value (this is where recursion happens!)
        val, err := p.parseValue()
        if err != nil {
            return nil, err
        }
        
        obj[key] = val
        
        // Check for end or continuation
        if p.curToken.Type == TokenRightBrace {
            p.advance()
            return obj, nil
        }
        
        if p.curToken.Type != TokenComma {
            return nil, fmt.Errorf("expected comma or closing brace")
        }
        p.advance()
    }
}

The beauty of recursive descent really shines here. When we encounter a value that's itself an object or array, we just call parseValue() again, and it handles the complexity. The call stack naturally mirrors the nesting structure of the JSON.

Type Mapping: JSON to Go

One design decision I had to make early: how should JSON types map to Go types? JSON has six types (object, array, string, number, boolean, null), but Go's type system is more rigid.

I settled on this mapping:

JSON objects → map[string]interface{}
JSON arrays → []interface{}
JSON strings → string
JSON numbers → float64 (JSON doesn't distinguish int/float)
JSON booleans → bool
JSON null → nil

The interface{} type is crucial here. It lets us represent dynamically-typed JSON data in statically-typed Go. Users need to type-assert the results, but that's the trade-off for supporting arbitrary JSON.

Testing: The Boring but Essential Part

I'll be honest—writing tests isn't the most exciting part of building a parser, but it's where I caught most of my bugs. I used table-driven tests extensively:

func TestParseString(t *testing.T) {
    tests := []struct {
        input    string
        expected interface{}
    }{
        {`"hello"`, "hello"},
        {`"hello\nworld"`, "hello\nworld"},
        {`"unicode: \u0048\u0065\u006C\u006C\u006F"`, "unicode: Hello"},
    }

    for _, tt := range tests {
        result, err := Parse(tt.input)
        if err != nil {
            t.Errorf("Parse(%q) error: %v", tt.input, err)
            continue
        }
        if !reflect.DeepEqual(result, tt.expected) {
            t.Errorf("Parse(%q) = %v, want %v", tt.input, result, tt.expected)
        }
    }
}

The error test cases were particularly important. A good parser should fail gracefully with helpful error messages, not panic or return cryptic errors.

Performance: Good Enough?

I also added benchmarks to see how smol-parser performs:

func BenchmarkParseSimpleObject(b *testing.B) {
    input := `{"name": "John", "age": 30, "active": true}`
    for i := 0; i < b.N; i++ {
        Parse(input)
    }
}

The results? smol-parser is significantly slower than Go's standard library (no surprise there), but it's fast enough for learning purposes and small JSON files. The standard library uses optimizations like buffer pooling and assembly code that I deliberately avoided to keep the code readable.

What I Left Out

Building a parser requires making trade-offs. Here's what I consciously excluded:

Streaming: My parser loads the entire JSON string into memory. Production parsers often support streaming for huge files.

Custom unmarshaling: Go's encoding/json lets you define how structs are unmarshaled. I kept it simple with generic interface{} values.

Pretty printing: I parse JSON but don't format it back. Adding a serializer would double the code size.

Detailed error positions: My errors mention position, but don't show the offending line with a helpful pointer. That's more UX than fundamental parsing.

Each of these would be a good exercise if you fork the project.

Lessons Learned

Building smol-parser changed how I think about parsing:

Parsing isn't magic: It's systematic character processing and recursive function calls. The elegance is in the structure, not complexity.

Recursive descent is intuitive: Once you see that grammar rules map directly to functions, it clicks. Each function handles one syntactic construct and delegates to others.

Edge cases matter: Empty objects, escape sequences, scientific notation—the spec is full of details that trip you up if you're not careful.

Good errors are hard: Catching invalid JSON is easy. Telling users what's wrong and where is the challenging part.

Testing drives design: Writing tests revealed bugs and edge cases I hadn't considered. The parser improved dramatically once I had comprehensive test coverage.

Try It Yourself

Want to play with smol-parser? Here's how:

git clone https://github.com/smol-go/smol-parser.git
cd smol-parser
go run main.go

Or use it in your own code:

package main

import (
    "fmt"
    "log"
)

func main() {
    json := `{"users": [{"name": "Alice", "age": 30}, {"name": "Bob", "age": 25}]}`
    
    result, err := Parse(json)
    if err != nil {
        log.Fatal(err)
    }
    
    // Type assert to access the data
    data := result.(map[string]interface{})
    users := data["users"].([]interface{})
    
    for _, u := range users {
        user := u.(map[string]interface{})
        fmt.Printf("%s is %v years old\n", user["name"], user["age"])
    }
}

Final Thoughts

Is smol-parser production-ready? Absolutely not. Should you use it instead of encoding/json? Definitely not.

But building it taught me more about parsing in a weekend than reading about parsers ever could. The satisfaction of typing malformed JSON and seeing your error message explain exactly what's wrong is surprisingly rewarding.

If you've ever wondered how parsers work or felt intimidated by compiler theory, I encourage you to build something like this. Pick a simple format, implement it step by step, and watch the pieces fall into place. The concepts that seemed abstract in textbooks become concrete when you're debugging why your parser chokes on Unicode escape sequences.

And who knows? Maybe you'll find yourself looking at other parsers—SQL parsers, markdown parsers, programming language parsers—and thinking, "I could build that."

Check out the full source code on GitHub, including comprehensive tests and examples.