Demystifying Parsers: Writing a JSON Parser
When I decided to build a JSON parser from scratch, I wasn't trying to replace Go's excellent encoding/json package. Instead, I wanted to understand how parsers actually work—the kind of understanding you only get by implementing one yourself.
Parsing is one of those computer science fundamentals that feels almost magical until you build your own. How does a parser take a string of characters and turn it into structured data? Let's find out.
Why Build a JSON Parser?
JSON parsing is everywhere in modern software development. Every time you make an API call or read a config file, there's probably a JSON parser working behind the scenes. But most developers (myself included, before this project) treat parsers as black boxes.
I chose JSON specifically because:
- The spec is well-defined and surprisingly simple
- It's practical—we use JSON daily
- It's complex enough to demonstrate real parsing techniques
- It's simple enough to implement in a weekend
The result is smol-parser: a complete JSON parser that handles strings, numbers, booleans, nulls, arrays, and objects. It even supports escape sequences and Unicode. And it all fits in about 400 lines of Go.
The Two-Stage Dance: Lexing and Parsing
Most parsers work in two distinct phases, and understanding this separation was my first key insight:
- Lexing (lexical analysis): Breaking raw text into meaningful tokens
- Parsing (syntactic analysis): Organizing those tokens into data structures
Think of it like reading a sentence. First, you identify individual words (lexing). Then you understand how those words relate to each other grammatically (parsing).
Here's what that looks like for {"name": "Alice"}:
Input characters: { " n a m e " : " A l i c e " }
↓ Lexing
Tokens: LEFT_BRACE, STRING("name"), COLON, STRING("Alice"), RIGHT_BRACE
↓ Parsing
Structure: map[string]interface{}{"name": "Alice"}
Building the Lexer: Character by Character
The lexer is where I started. Its job is simple conceptually: read characters one at a time and group them into tokens. But the tricky part is in the details.
I structured my lexer as a state machine that tracks its position:
type Lexer struct {
input string
pos int
ch byte
}
The readChar() method advances through the input, while peekChar() lets us look ahead without committing—crucial for deciding whether we're looking at a number like 123 or 123.456.
The String Challenge
Parsing strings turned out to be trickier than I expected. JSON strings support escape sequences like \n and \t, but also Unicode escapes like \u0048 (which represents 'H').
My readString() method handles this with a state machine inside the string reader:
func (l *Lexer) readString() (string, error) {
var result []rune
l.readChar() // skip opening "
for l.ch != '"' && l.ch != 0 {
if l.ch == '\\' {
l.readChar()
switch l.ch {
case '"', '\\', '/':
result = append(result, rune(l.ch))
case 'n':
result = append(result, '\n')
// ... more escape sequences
case 'u':
// Handle \uXXXX Unicode escapes
hex := ""
for i := 0; i < 4; i++ {
l.readChar()
hex += string(l.ch)
}
val, _ := strconv.ParseInt(hex, 16, 32)
result = append(result, rune(val))
continue
}
} else {
result = append(result, rune(l.ch))
}
l.readChar()
}
return string(result), nil
}
The Unicode handling was particularly satisfying to get right. Converting four hex digits into an actual character felt like cracking a small puzzle.
Numbers: More Complex Than They Look
JSON numbers follow a specific format: optional minus, digits, optional decimal point and more digits, optional exponent. My initial attempt just grabbed "digit-looking characters," which broke on scientific notation like 1.5e-10.
The solution was methodical: handle each part of the number format in sequence. First the optional minus, then the integer part (with special handling for leading zeros), then the optional fractional part, then the optional exponent:
func (l *Lexer) readNumber() string {
start := l.pos - 1
if l.ch == '-' {
l.readChar()
}
// Integer part
if l.ch == '0' {
l.readChar()
} else {
for unicode.IsDigit(rune(l.ch)) {
l.readChar()
}
}
// Fractional part
if l.ch == '.' {
l.readChar()
for unicode.IsDigit(rune(l.ch)) {
l.readChar()
}
}
// Exponent
if l.ch == 'e' || l.ch == 'E' {
l.readChar()
if l.ch == '+' || l.ch == '-' {
l.readChar()
}
for unicode.IsDigit(rune(l.ch)) {
l.readChar()
}
}
return l.input[start : l.pos-1]
}
This approach follows the JSON specification exactly, which means it correctly handles edge cases like -0 and 1e+10.
The Parser: Making Sense of Tokens
Once the lexer produces tokens, the parser's job is to build actual Go data structures. I used a technique called recursive descent parsing, which is both elegant and intuitive once you understand it.
The core idea: each JSON grammar rule becomes a function, and these functions call each other recursively. Here's my parser structure:
type Parser struct {
lexer *Lexer
curToken Token
}
The parser always maintains a "current token" and an advance() method to move forward. Every parsing function follows the same pattern: look at the current token, decide what to do, consume tokens, and return a value.
Recursive Descent in Action
The parseValue() method is the heart of the parser. It examines the current token and delegates to specialized functions:
func (p *Parser) parseValue() (interface{}, error) {
switch p.curToken.Type {
case TokenLeftBrace:
return p.parseObject()
case TokenLeftBracket:
return p.parseArray()
case TokenString:
val := p.curToken.Value
p.advance()
return val, nil
case TokenNumber:
val, _ := strconv.ParseFloat(p.curToken.Value, 64)
p.advance()
return val, nil
// ... more cases
}
}
What makes this "recursive descent" is what happens in parseObject() and parseArray(). These functions call back to parseValue(), creating a recursive structure that naturally handles nesting.
Objects: The Tricky Part
Parsing objects taught me about the complexity hiding in simple syntax. An object is just {key: value, key: value}, right? But there are edge cases: empty objects, trailing commas (not allowed in JSON!), ensuring keys are strings, handling deeply nested structures.
My parseObject() implementation handles all this:
func (p *Parser) parseObject() (map[string]interface{}, error) {
obj := make(map[string]interface{})
p.advance() // skip {
// Handle empty object
if p.curToken.Type == TokenRightBrace {
p.advance()
return obj, nil
}
for {
// Key must be a string
if p.curToken.Type != TokenString {
return nil, fmt.Errorf("expected string key")
}
key := p.curToken.Value
p.advance()
// Expect colon
if p.curToken.Type != TokenColon {
return nil, fmt.Errorf("expected colon after key")
}
p.advance()
// Parse value (this is where recursion happens!)
val, err := p.parseValue()
if err != nil {
return nil, err
}
obj[key] = val
// Check for end or continuation
if p.curToken.Type == TokenRightBrace {
p.advance()
return obj, nil
}
if p.curToken.Type != TokenComma {
return nil, fmt.Errorf("expected comma or closing brace")
}
p.advance()
}
}
The beauty of recursive descent really shines here. When we encounter a value that's itself an object or array, we just call parseValue() again, and it handles the complexity. The call stack naturally mirrors the nesting structure of the JSON.
Type Mapping: JSON to Go
One design decision I had to make early: how should JSON types map to Go types? JSON has six types (object, array, string, number, boolean, null), but Go's type system is more rigid.
I settled on this mapping:
- JSON objects →
map[string]interface{} - JSON arrays →
[]interface{} - JSON strings →
string - JSON numbers →
float64(JSON doesn't distinguish int/float) - JSON booleans →
bool - JSON null →
nil
The interface{} type is crucial here. It lets us represent dynamically-typed JSON data in statically-typed Go. Users need to type-assert the results, but that's the trade-off for supporting arbitrary JSON.
Testing: The Boring but Essential Part
I'll be honest—writing tests isn't the most exciting part of building a parser, but it's where I caught most of my bugs. I used table-driven tests extensively:
func TestParseString(t *testing.T) {
tests := []struct {
input string
expected interface{}
}{
{`"hello"`, "hello"},
{`"hello\nworld"`, "hello\nworld"},
{`"unicode: \u0048\u0065\u006C\u006C\u006F"`, "unicode: Hello"},
}
for _, tt := range tests {
result, err := Parse(tt.input)
if err != nil {
t.Errorf("Parse(%q) error: %v", tt.input, err)
continue
}
if !reflect.DeepEqual(result, tt.expected) {
t.Errorf("Parse(%q) = %v, want %v", tt.input, result, tt.expected)
}
}
}
The error test cases were particularly important. A good parser should fail gracefully with helpful error messages, not panic or return cryptic errors.
Performance: Good Enough?
I also added benchmarks to see how smol-parser performs:
func BenchmarkParseSimpleObject(b *testing.B) {
input := `{"name": "John", "age": 30, "active": true}`
for i := 0; i < b.N; i++ {
Parse(input)
}
}
The results? smol-parser is significantly slower than Go's standard library (no surprise there), but it's fast enough for learning purposes and small JSON files. The standard library uses optimizations like buffer pooling and assembly code that I deliberately avoided to keep the code readable.
What I Left Out
Building a parser requires making trade-offs. Here's what I consciously excluded:
Streaming: My parser loads the entire JSON string into memory. Production parsers often support streaming for huge files.
Custom unmarshaling: Go's encoding/json lets you define how structs are unmarshaled. I kept it simple with generic interface{} values.
Pretty printing: I parse JSON but don't format it back. Adding a serializer would double the code size.
Detailed error positions: My errors mention position, but don't show the offending line with a helpful pointer. That's more UX than fundamental parsing.
Each of these would be a good exercise if you fork the project.
Lessons Learned
Building smol-parser changed how I think about parsing:
Parsing isn't magic: It's systematic character processing and recursive function calls. The elegance is in the structure, not complexity.
Recursive descent is intuitive: Once you see that grammar rules map directly to functions, it clicks. Each function handles one syntactic construct and delegates to others.
Edge cases matter: Empty objects, escape sequences, scientific notation—the spec is full of details that trip you up if you're not careful.
Good errors are hard: Catching invalid JSON is easy. Telling users what's wrong and where is the challenging part.
Testing drives design: Writing tests revealed bugs and edge cases I hadn't considered. The parser improved dramatically once I had comprehensive test coverage.
Try It Yourself
Want to play with smol-parser? Here's how:
git clone https://github.com/smol-go/smol-parser.git
cd smol-parser
go run main.go
Or use it in your own code:
package main
import (
"fmt"
"log"
)
func main() {
json := `{"users": [{"name": "Alice", "age": 30}, {"name": "Bob", "age": 25}]}`
result, err := Parse(json)
if err != nil {
log.Fatal(err)
}
// Type assert to access the data
data := result.(map[string]interface{})
users := data["users"].([]interface{})
for _, u := range users {
user := u.(map[string]interface{})
fmt.Printf("%s is %v years old\n", user["name"], user["age"])
}
}
Final Thoughts
Is smol-parser production-ready? Absolutely not. Should you use it instead of encoding/json? Definitely not.
But building it taught me more about parsing in a weekend than reading about parsers ever could. The satisfaction of typing malformed JSON and seeing your error message explain exactly what's wrong is surprisingly rewarding.
If you've ever wondered how parsers work or felt intimidated by compiler theory, I encourage you to build something like this. Pick a simple format, implement it step by step, and watch the pieces fall into place. The concepts that seemed abstract in textbooks become concrete when you're debugging why your parser chokes on Unicode escape sequences.
And who knows? Maybe you'll find yourself looking at other parsers—SQL parsers, markdown parsers, programming language parsers—and thinking, "I could build that."
Check out the full source code on GitHub, including comprehensive tests and examples.