Copenhagen: A Multimodal Product Search Engine

May 14, 202616 min read

I've always been curious about how search engines actually work under the hood. Not the "type keywords, get results" surface-level understanding, but what happens when the query doesn't match any words in the document — when you search for "cosy reading chair" and the catalog has "comfortable armchair", or when you upload a photo of a lamp and want to find something visually similar.

That gap between what a user means and what the documents literally say felt like an interesting problem, so I built Copenhagen — a multimodal product search engine — mostly to figure out how you'd bridge it. It accepts text queries, image uploads, or both at once, and returns ranked results using a two-stage pipeline: hybrid vector and keyword retrieval fused with Reciprocal Rank Fusion, followed by cross-encoder reranking. The numbers it landed on: NDCG@10 of 0.76, MRR of 0.71, P99 end-to-end latency of ~147ms on CPU.

The numbers: NDCG@10 of 0.76, MRR of 0.71, P99 end-to-end latency of ~147ms on CPU. The two-stage pipeline improves ranking quality by 24.6% over single-stage retrieval alone.

The Full Pipeline

Every query goes through six stages:

Query (text / image / both)
          │
          ▼
    Redis cache check ── HIT ──► return (< 1ms)
          │ MISS
          ▼
    CLIP embed query
          │
     ┌────┴────┐
     ▼         ▼
   ANN        BM25
  search     search
     └────┬────┘
          ▼
   RRF fusion (top 50)
          │
          ▼
  Cross-encoder rerank
          │
          ▼
   Return top-K + cache write

First, a Redis cache check — if this exact query has been seen recently, return immediately. If not, encode the query using CLIP into a 768-dimensional vector. Run two searches concurrently: approximate nearest-neighbour search against product embeddings in Postgres, and BM25 full-text search against a weighted tsvector column. Merge those two result lists using Reciprocal Rank Fusion to get a top-50 candidate set. Pass those 50 through a cross-encoder reranker. Return the top-K results with a full latency breakdown per stage.

Why Two Retrieval Methods?

Before getting into implementation, it's worth understanding why you'd run two completely different search methods and merge them.

Vector search encodes the semantic meaning of text and images into points in high-dimensional space, then finds nearby points. "Cosy reading chair" and "comfortable armchair" end up close together even though they share no words. The weakness: it can miss exact keyword matches — searching for a specific product code or brand name often fails because the semantic signal is too weak.

BM25 is what Elasticsearch and most search systems use — term frequency scoring. It's precise on exact matches, handles rare terms well, and degrades gracefully when the query is specific. The weakness: "cosy chair" returns nothing if no product uses that word.

The insight is that these two methods fail in different ways. A product relevant to "Sony noise cancelling headphones" might appear in BM25's top-3 (exact brand match) but rank 40th in the ANN results. Fusing the two lists gets you the best of both.

CLIP: Shared Embedding Space

CLIP (Contrastive Language-Image Pre-training) is the foundation of the whole system. Its key property: it encodes both images and text into the same 768-dimensional vector space, trained jointly on 400 million image-text pairs.

 image ──► image encoder ──┐
                            ├──► 768-dim shared space ──► cosine similarity
  text ──►  text encoder ──┘

"Jointly" is the important word. An image of red sneakers and the text "red leather sneakers" end up close together because CLIP was trained to maximise similarity between matching pairs and minimise it between non-matching pairs. This shared space is what makes multimodal queries possible — you can fuse an image embedding with a text embedding because they're in the same coordinate system.

For fused queries I take a weighted linear combination: fused = α × img_vec + (1−α) × txt_vec, then re-normalise to a unit vector. That final normalisation is critical — adding two unit vectors doesn't give you a unit vector, and the stored product embeddings and the query embedding need to use the same distance metric for cosine similarity to be consistent. Alpha defaults to 0.7 (image-dominant), which works well for visual categories. In production you'd tune this per-category or learn it from click data.

The BLIP Problem

CLIP embeddings capture broad semantic meaning, but product descriptions are often sparse or brand-centric. "Sony WH-1000XM5" tells the embedding model almost nothing beyond "Sony" and "headphones". The product image, though, shows large over-ear cups in matte black on a white studio background.

The fix: BLIP (Bootstrapping Language-Image Pre-training), a separate image captioning model, runs at ingest time and generates natural-language descriptions of product images. A caption like "black over-ear wireless headphones on white background" gets stored in a blip_caption column and fed into two places:

Appended to the text CLIP encodes, so the text embedding gains visual vocabulary
Added to the tsvector column at weight B alongside description, so BM25 can match on visual attributes

A Postgres trigger fires automatically on every INSERT or UPDATE and rebuilds the weighted tsvector from all five text sources:

title       → weight A  (strongest)
description → weight B
blip_caption → weight B
brand        → weight C
category     → weight D  (weakest)

A match in the title is stronger evidence than the same term buried in a caption. BLIP is ~900 MB and only needed at ingest time, so it lives in the worker process — the API server never loads it.

Storing Vectors in Postgres

Three vector columns per product row — image, text, and fused — each typed as VECTOR(768) from pgvector. I store all three rather than just the fused result because fused is parameterised by alpha. Changing the image/text weighting later means a SQL UPDATE using the stored components, rather than rerunning CLIP on every product.

products
├── external_id         UNIQUE
├── title, description, brand, category
├── blip_caption        TEXT
├── image_embedding     VECTOR(768)
├── text_embedding      VECTOR(768)
├── fused_embedding     VECTOR(768) ◄── HNSW index (cosine)
└── search_tsv          TSVECTOR    ◄── GIN index
                                         auto-rebuilt by trigger

I chose HNSW over pgvector's IVFFlat for one practical reason: IVFFlat needs a minimum number of rows to build meaningful cluster centroids and requires periodic REINDEX operations as data grows. HNSW inserts incrementally, works correctly at any dataset size, and has better recall at equivalent latency.

Reciprocal Rank Fusion

With two result lists — up to 100 rows each — I need to merge them into a single ranked list. The problem is that ANN returns cosine distances and BM25 returns ts_rank_cd scores. These scales are incompatible. You can't meaningfully add them without normalisation.

RRF sidesteps this entirely by ignoring the actual scores and working only on rank positions:

  ANN results          BM25 results
  ───────────          ────────────
  #1 Leather boots     #1 Ankle boots
  #2 Ankle boots       #2 Leather boots
  #3 Chelsea boots     #3 Suede loafers
          │                   │
          └────────┬───────────┘
                   ▼
         score(d) = Σ 1 / (60 + rank_i)

  Merged:
  #1 Leather boots   → 1/61 + 1/62 = 0.0325
  #2 Ankle boots     → 1/62 + 1/61 = 0.0325
  #3 Chelsea boots   → 1/63 + 0    = 0.0159
  #4 Suede loafers   → 0    + 1/63 = 0.0159

The constant k=60 prevents rank-1 documents from dominating — chosen in the original RRF paper and held up across many benchmarks. Products can appear in the merged list even if they only showed up in one retrieval method, which is where hybrid search earns its keep. I over-fetch at this stage, returning 50 results for the reranker rather than just the final top-K. ANN is approximate — a larger candidate pool gives the reranker more to work with.

Cross-Encoder Reranking

The reranker is where single-stage and two-stage search diverge.

CLIP encodes query and product independently, then measures similarity by comparing their positions in vector space. This is fast — pre-compute all product embeddings once, search at query time — but limited. The encoder can't see the query when embedding the document, so it can only capture coarse semantic matching.

A cross-encoder takes the full (query, passage) pair as a single input. Both sides pass through the transformer together with full cross-attention between query tokens and product tokens. This catches nuances that vector search misses — the query asks for "waterproof" but the product says "water-resistant", or the query mentions a specific size that only appears deep in the product description.

The model I used, ms-marco-MiniLM-L-6-v2, was trained on MS MARCO passage ranking tasks — which is exactly the reordering problem. At ~22 MB it's small enough to run on CPU in ~85–100ms for 50 candidates.

stage 1 (fast, approximate):   100K products → top 50 via ANN + BM25 + RRF
stage 2 (slow, precise):       top 50 → top K via cross-encoder

If the reranker fails for any reason — OOM, model error — the endpoint falls back to RRF results with "reranker": "unavailable" in the response. The request never 500s.

Async Ingestion Worker

The ingest API returns in ~2ms. All the heavy work happens in a separate worker:

POST /ingest
      │ LPUSH
      ▼
 Redis queue (products:ingest)
      │ BRPOP
      ▼
  fetch image URL
      │
      ▼
  BLIP caption
      │
      ▼
  CLIP image + text embed → fuse
      │
      ▼
  Postgres upsert  ──► success
      │ failure
      ▼
  retry (max 3×) ──► exhausted ──► dead-letter queue
                                         +
                                   ingestion_errors table

The upsert uses INSERT ... ON CONFLICT (external_id) DO UPDATE SET ..., making it idempotent — submitting the same product twice produces the same final state as submitting it once.

The retry count lives in the job payload itself, so the worker is stateless. After three failures the job lands in products:ingest:dead and the error row in Postgres has the full payload for resubmission.

I chose Redis BRPOP over Celery deliberately. The workload is I/O-bound with CPU bursts — image downloading is async-friendly, and PyTorch releases the GIL during CLIP inference. A single async process handles this well. Celery would add a separate broker, a result backend, and significant operational overhead for no meaningful throughput gain.

The Caching Layer

Two cache namespaces in Redis:

search:results:<sha256>   TTL 5 min    full serialised response
search:embed:<sha256>     TTL 1 hr     raw float32 bytes (3 KB)

The result cache is keyed by SHA-256 of the normalised query payload, so " Ankle Boots " and "ankle boots" hit the same entry. The embedding cache is keyed by SHA-256 of the lowercased query string — text embeddings can be cached far longer than results because they're purely a function of the query string and the model, never going stale.

Embeddings are stored as raw float32.tobytes() rather than JSON. A 768-dim float32 vector as JSON is ~12 KB; as raw bytes it's 3,072 bytes — 4× smaller with no parse overhead on read.

Both caches are fire-and-forget: every get/set is wrapped in try/except so a Redis outage silently degrades to a cache miss without breaking search.

Benchmarks

Latency on CPU, from 200 requests:

Stage	P50	P99
CLIP encoding	18ms	25ms
ANN retrieval	8ms	14ms
BM25 retrieval	5ms	10ms
RRF fusion	<1ms	<1ms
Cross-encoder	85ms	102ms
End-to-end	116ms	147ms

The cross-encoder is 70% of total latency and constant regardless of catalog size. On a GPU it drops to 7–10ms. CLIP encoding is also constant. ANN scales as O(log N) for HNSW.

Quality against 50 hand-labelled queries:

Metric	Single-stage (RRF)	Two-stage (+reranker)	Δ
NDCG@10	0.61	0.76	+24.6%
MRR	0.54	0.71	+31.5%
Precision@10	0.43	0.67	+55.8%
Recall@50	0.82	0.84	+2.4%

Recall barely moves — the RRF pool at K=50 already contains almost all the relevant products. What the reranker does is dramatically improve where in that pool they rank.

Final Thoughts

The thing that surprised me most was how much the BLIP captions matter. Without them, the first version struggled on visual queries — "brown leather wallet", "white ceramic mug" — because product descriptions rarely used those words. With captions, BM25 suddenly had signal it didn't before, and the vector embeddings became richer. A model trained to describe images turned out to be a surprisingly effective bridge between the image domain and the text search domain.

The two-stage pipeline pattern also feels natural once you've built it. Stage one is about recall — get the right products into the candidate set, even if the ordering isn't perfect. Stage two is about precision — given those candidates, put them in the right order. Separating those concerns makes the system easier to reason about and tune independently.

The full code, Dockerfiles, benchmark runner, and 50-query evaluation set are all on GitHub.