Building a clone of Cursor's Indexing Feature

Jun 30, 202613 min read

I use Cursor every day and never once wondered how @codebase actually finds the right three files out of ten thousand in under a second. It just works. You ask "where do we validate auth tokens" and it points at the right function, in the right file, with no exact keyword match anywhere in the query.

So I built tiny-cursor, a from-scratch clone of that indexing pipeline in Python. No frameworks doing the hard parts for me — I wrote the AST chunker, the embedding abstraction, the Merkle-tree sync, and the search layer myself, then wrapped it in a terminal app with click and rich. It's not trying to compete with Cursor. It's trying to make the pipeline legible enough that I stop treating it as magic.

How Cursor Actually Does This

Before writing any code I wanted to understand the real thing, not just guess at it. Cursor has actually published how this works, and a few independent write-ups fill in the rest. The short version:

When you open a workspace, Cursor scans every file and builds a Merkle tree — leaf nodes are file hashes, parent nodes are hashes of their children's hashes, all the way up to a single root hash for the entire codebase. That tree gets compared against the server's copy to find exactly where the two diverge — entries whose hashes differ get synced, entries that match get skipped, and anything missing on one side gets added or deleted on the other. A small edit only changes the hash of that one file and the hashes of its parent directories up to the root, so the diff stays cheap even on huge repos — for a workspace with fifty thousand files, the filenames and hashes alone add up to a few megabytes, and that's all that needs to move on a sync.

Chunking isn't naive either. Rather than splitting by raw character or token count — which can slice a function in half mid-body — Cursor uses an AST-aware splitter that breaks along class and function boundaries, the same idea LangChain's recursive splitters use. Once a file changes, its new chunks get embedded and the embeddings land in Turbopuffer, Cursor's remote vector database, tagged with metadata like line numbers and obfuscated file paths rather than the raw source. Cursor also caches embeddings by chunk hash, so re-indexing a codebase you've already indexed once is much cheaper the second time. On the query side, your prompt gets embedded, a nearest-neighbor search runs against the vector store, and that returns file paths and line numbers — the actual code gets read back off your local disk, not reconstructed from the embedding.

That's the whole shape of it: hash for cheap change detection, AST-split for chunk quality, embed only what changed, search by vector similarity, resolve back to local files. tiny-cursor is that shape, minus the server, minus Turbopuffer, minus a few hundred engineers.

What I Built

Four phases, each one a layer on top of the last:

Phase	What it does
1 — Chunking + embeddings	tree-sitter AST chunking, pluggable embedder, first index into ChromaDB
2 — Merkle sync	SHA-256 tree, diff against the last run, delete-before-insert on changed/deleted files
3 — Search layer	`semantic_search()` and `grep_search()` as two separate, non-overlapping functions
4 — Terminal app	`click` + `rich` CLI: `index`, `search`, `grep`, `status`

Phase 1 — Chunking and the embedder abstraction

The chunking strategy is the one decision everything downstream depends on. Split on raw character count and you cut functions in half, which wrecks the semantic quality of the embedding. So I parse every file with tree-sitter and only look at top-level nodes: functions, classes, and decorated definitions each become their own chunk no matter how large they are, while everything smaller — imports, constants, bare statements — gets buffered and flushed as one merged chunk once it'd cross ~1500 characters:

BOUNDARY_NODE_TYPES = {
    "function_definition",
    "class_definition",
    "decorated_definition",
}

for node in root.children:
    node_len = node.end_byte - node.start_byte

    if node.type in BOUNDARY_NODE_TYPES:
        flush_buffer()
        # ...emit this function/class as its own standalone chunk
        continue

    if buffer_nodes and buffer_len + node_len > max_chars:
        flush_buffer()

    buffer_nodes.append(node)
    buffer_len += node_len

A 200-line function stays one chunk. A file full of one-line imports becomes one merged chunk instead of forty tiny useless ones.

The other Phase 1 decision was making the embedding backend a non-issue for the rest of the codebase. I wanted this runnable with or without an OpenAI key, so EmbedderProvider checks the environment once at startup and picks OpenAI's text-embedding-3-small or a local nomic-embed-text via Ollama — and nothing outside this one file knows which:

if self._openai_key:
    from openai import OpenAI
    self.backend = "openai"
    self.model = "text-embedding-3-small"
    self._client = OpenAI(api_key=self._openai_key)
else:
    import ollama
    self.backend = "ollama"
    self.model = "nomic-embed-text"
    self._client = ollama
    self._ensure_ollama_model()

Everything else just calls embedder.embed(texts) and gets back vectors. Swapping the backend later — Voyage, Cohere, whatever — means touching one file.

Phase 2 — Merkle sync, and the bug that doesn't throw an error

This was the phase I was most nervous about, because the dev plan I wrote for myself flagged exactly why: if you re-embed a changed file without deleting its old chunks first, nothing crashes. There's no exception. Search results just get progressively worse as stale and fresh chunks for the same lines both sit in the collection, and you won't notice until you're three weeks in wondering why search feels off.

The Merkle tree itself is built bottom-up — every file's SHA-256, then every directory's hash as the SHA-256 of its sorted children's (name, hash) pairs, all the way to one root hash:

def _hash_node(tree: dict) -> str:
    entries = []
    for name, file_hash in sorted(tree.get("files", {}).items()):
        entries.append(f"f:{name}:{file_hash}")
    for name, subtree in sorted(tree.get("dirs", {}).items()):
        entries.append(f"d:{name}:{_hash_node(subtree)}")
    return hashlib.sha256("\n".join(entries).encode("utf-8")).hexdigest()

That gives a genuinely useful shortcut: if the new root hash matches the one from last run, nothing anywhere in the tree changed, so index can return immediately — zero diffing, zero Chroma calls, zero embedding API calls. Only when the roots diverge do I diff the flat {path: hash} maps to get the exact changed and deleted sets, and the rule from the plan becomes code, not a comment: every changed-or-deleted path gets delete_by_file_path() called on it before any re-embedding happens, full stop, even for a file that shrank to zero chunks. The new tree gets written to disk last, after the Chroma writes succeed — so a crash mid-sync leaves the old, honest tree in place instead of one lying about what got done.

I tested this properly before trusting it: first index, a no-op rerun, editing one file, deleting one file. The no-op run printed "No changes detected" and made zero embedding calls; the single-edit run touched exactly one file; the deletion left zero orphaned chunks behind. Cheap insurance for the part of the system with no error message when it's wrong.

Phase 3 — Two search functions, on purpose

semantic_search() embeds your query and does an ANN lookup. grep_search() shells out to ripgrep. There's no dispatcher trying to guess which one you meant from the query text, and I went back and forth on whether that was lazy before deciding it was correct: a query like "what does *args do?" or "where is the ^ operator used?" is completely reasonable natural language that's also full of regex metacharacters. Any heuristic confident enough to auto-route would be wrong often enough to be worse than just asking.

@dataclass
class SearchResult:
    file: str
    lines: tuple[int, int]
    snippet: str
    score: Optional[float] = None  # similarity score for semantic, None for grep

Both functions return the same shape, so the Phase 4 CLI renders either with identical code and just hides score when it's None. One implementation detail bit me here: ChromaDB defaults to L2 distance, not cosine similarity, so a "score 0.91" would've actually been an L2 distance number, not a similarity. I set hnsw:space="cosine" on collection creation and compute score = 1 - distance — which only takes effect on a freshly created collection, so it's the kind of thing that's easy to get wrong quietly if you add it after you've already been indexing for a while.

Phase 4 — Wrapping it in something you'd actually use

The CLI is click for commands and rich for output — progress bars during indexing, syntax-highlighted snippets, muted file paths. The progress bar was worth doing properly rather than faking: build_merkle_tree takes an on_file callback that fires once per file during the hash pass (which always covers every file, changed or not), and a second progress task appears only if there's actual embedding work, advancing per file as it's re-embedded. index, search, grep, and status came together into something that actually feels like a tool instead of a script:

$ tiny-cursor index ./my-project
Scanning 312 file(s)    ████████████████████████████ 100% 0:00:01
Embedding 312 file(s)   ████████████████████████████ 100% 0:00:08
✓ 1,847 chunks embedded and stored. [9.4s]

$ tiny-cursor search "where is the database connection set up"
src/db/connection.py  (line 12–34, score 0.91)
  def connect(host, port, db_name):
      engine = create_engine(...)

status reads from merkle.json, config.json, and the Chroma collection rather than re-walking the project, so checking index health doesn't cost anything close to what building it costs.

What I Learned

Delete-before-insert is the easiest thing in the whole project to get subtly wrong, and the scariest, because getting it wrong doesn't throw. It degrades. I wrote a real test pass for exactly this before moving on, which I wouldn't have bothered with for almost anything else in this project.

A root hash is a genuinely free optimization, not just a tidy abstraction. Comparing one string against one string and bailing out covers the overwhelmingly common case — you ran index again and nothing changed — without touching the vector store at all.

An abstraction is only doing its job if you can forget it exists. EmbedderProvider succeeded the day I stopped thinking about which backend was active while writing indexer.py, search.py, and cli.py. None of them know or care.

Two boring, separate functions beat one clever, ambiguous one. I almost wrote a heuristic to route between semantic and literal search before realizing every example I could think of to justify it was also a counterexample against it.

Real-world docs caught a bug a clean-room implementation would've shipped with. I didn't know Chroma defaulted to L2 distance until I went and actually checked how Cursor scores results, which is the kind of thing you only catch by comparing against the real system instead of just your own mental model of it.

What's Not There Yet

A few things came directly from comparing tiny-cursor against how Cursor itself actually works, beyond what was already on my own roadmap:

Periodic background re-sync. Cursor automatically re-checks the Merkle tree every few minutes so the index stays fresh without you remembering to run anything. tiny-cursor only syncs when you explicitly run index — a --watch flag using watchdog was already on the list, but a scheduled background sync is a different, probably more useful thing for a long-running session.
Chunk-level embedding cache. Cursor caches embeddings by chunk hash so re-indexing a codebase you've indexed before is cheap even across machines. Mine only short-circuits at the whole-tree and whole-file level — a content-addressed cache keyed by chunk hash would make "I cloned this repo somewhere new" cheap too.
Splitting oversized functions. A function bigger than the chunk limit still becomes one giant chunk today. Real implementations recursively split large nodes by their children rather than accepting an oversized leaf.
Multi-language support. Python only, via tree-sitter-python. Adding tree-sitter-javascript or tree-sitter-go is mostly a matter of swapping the grammar and the boundary node type names.
--json output and a REPL mode. Both small, both genuinely useful, both just not built yet.

Try It Yourself

git clone https://github.com/themillenniumfalcon/tiny-cursor.git
cd tiny-cursor
uv sync
uv pip install -e .

tiny-cursor index ./some-other-project
tiny-cursor search "where do we handle retries"
tiny-cursor grep 'TODO|FIXME'
tiny-cursor status

Run index twice in a row and watch the second run finish instantly. Edit one file and run it a third time — watch only that file get touched.

Final Thoughts

The part that actually surprised me wasn't the embeddings or the vector search — that's the part everyone already has a mental model for. It was the Merkle tree. A root hash comparison is such a small piece of code to be doing so much work: it's the difference between "re-index everything on every save" and a tool you'd actually leave running. Cursor clearly bet on that being the load-bearing piece of the whole pipeline, and after building a tiny version of it myself, I get why.

If you've ever wondered what @codebase is actually doing under the hood, I'd genuinely recommend building a small version yourself. The Merkle tree sounds like the kind of thing you'd only need at Google's scale until you implement it and realize it's twenty lines of post-order hashing doing a disproportionate amount of the work.