Engineering

Inside retailerapi: cross-retailer enrichment, time-series storage, and stale-while-revalidate at scale

Engineering deep-dive on how retailerapi serves cross-retailer product data at $7/100k page views. Tier 0 SERP scrape + per-retailer fallback, Postgres advisory locks for single-flight, time-series partitioned monthly, ISR + Next.js after().

By Matt Hall··8 min read

retailerapi went from "let's compete with Keepa for non-Amazon retailers" to "live in production with cross-retailer enrichment for 7 retailers" in about 4 days. The architecture is the result of three deliberate decisions: don't synchronously scrape on page render, use one cheap query to seed many retailers at once, and split current-state from history at the schema level. This post walks through what's running in production today, why we chose each piece, and the cost picture.

The architecture in one diagram

                    /p/[id] page request
                            │
                            ▼
              ┌─────────────────────────────┐
              │ Read product_enrichments    │  <- always serves from cache
              │ + readPopularityTier(id)    │     even if stale
              └─────────────────────────────┘
                            │
                            ▼
              ┌─────────────────────────────┐
              │ Render page (Walmart from   │
              │ primary catalog,            │  ISR 24h revalidate
              │ cross-retailer from cache,  │
              │ gaps as "Indexing…" pills)  │
              └─────────────────────────────┘
                            │
                  Next.js after() fires
                            │
                            ▼
              ┌─────────────────────────────┐
              │ Tier 0: Serper Shopping     │  one ~$0.001 call, returns
              │ search by product title     │  offers from many retailers
              └─────────────────────────────┘
                            │
              for each matched retailer:
                            ▼
              ┌─────────────────────────────┐
              │ Acquire pg_try_advisory_lock│
              │ if lock held: skip (another │
              │   visitor is already        │
              │   refreshing this row)      │
              │ if acquired: write to       │
              │   product_enrichments       │
              │   + price_observations      │
              └─────────────────────────────┘
                            │
              for retailers Tier 0 missed:
                            ▼
              ┌─────────────────────────────┐
              │ Tier 1: per-retailer        │  free APIs first (Best Buy
              │ scrapers (Best Buy API,     │  Open API, eBay Browse) then
              │ scrape.do, Firecrawl, etc.) │  paid scrapers per cheatsheet
              └─────────────────────────────┘

Each box is independently observable, replaceable, and cost-bounded.

Why we never fetch synchronously

The tempting design: page request comes in, fetch 6 retailers in parallel, render the result. This loses on three dimensions:

  1. SEO. Google's crawler doesn't wait. If your page LCP is 8 seconds because you're scraping 6 retailers in real-time, your Core Web Vitals tank. AI crawlers (GPTBot, ClaudeBot) don't render JS at all; they get whatever HTML you serve in the first response.
  2. Cost. Synchronous fetch on every page render means cache hit rate near zero. At 100k page views, that's 600k upstream scrape calls. Fast track to $200 to $500 per month in scraping bills.
  3. Reliability. Any retailer's 5-second hang becomes your 5-second page render hang. A Lowe's bot block takes down your visible product cell for the duration.

The fix is the pattern Keepa, CamelCamelCamel, idealo, Capital One Shopping all use: read cache always, refresh background. Implemented in Next.js 15 via the after() API. The page renders from product_enrichments (or shows "Indexing…" placeholder cells if the cache is cold), then after() fires asynchronously after the response is sent. The visitor sees a 1.5 to 2-second render. The next visitor sees a populated page.

Tier 0: one Serper call, many retailers

The biggest single win in our architecture. Pre-2024, Google Shopping accepted GTIN/UPC as a search filter, which made multi-retailer scraping cheap. Google killed that filter (we wrote about it). What still works: title-based search.

A single call to Serper's /shopping endpoint with a product title returns 20 to 40 offers from various merchants. We classify each offer by source (regex match against canonical retailer names: "Amazon", "Best Buy", "Lowe's", etc.) and write each match to product_enrichments. For ~70% of products, this single ~$0.001 call covers most of our 6 target retailers. For the remaining 30%, Tier 1 fills the gaps.

For one validation product (electric pressure cooker UPC 19667262713), Serper returned a single response that populated 5 retailers (Best Buy at $69.99, eBay at $100.71, Target at $79.99, Home Depot at $181.64, Lowe's at $59.99). One call, five retailer cells filled. Cost: ~$0.001.

Comparison to per-retailer scraping: 6 separate scrapes at $0.001 to $0.005 each = $0.006 to $0.030 per uncached page. Tier 0 is 6x to 30x cheaper.

Tier 1: per-retailer fallback

For retailers Tier 0 didn't return, we fall back to retailer-specific fetchers driven by the org-wide scraper cheatsheet. The routing per retailer:

RetailerMethodCost (cr ≈ $0.000116)
Walmartprimary catalog (owned)0
Amazonscrape.do super tier10
Best BuyBest Buy Open API (free)0
eBayScraperAPI cheap tier1
Lowe'sscrape.do default1
TargetFirecrawl~5
Home Depotscrape.do default + render5

Total Tier 1 cost when all 6 fall through (rare): about $0.022 per uncached page.

Single-flight via Postgres advisory locks

When a /p/[id] page is uncached and 100 visitors hit it concurrently, naive code fans out 100 scrape jobs. That bills 100x and gets us IP-blocked by retailers within minutes. The fix is the same pattern Cloudflare and every CDN uses: single-flight coalescing.

We use Postgres pg_try_advisory_lock(int, int) keyed on a SHA-256 hash of (retailer, identifier). The first request acquires the lock and runs the scrape. The second through 100th requests see the lock held and serve stale (or "Indexing…") instead of scraping. The lock auto-releases when the scraper completes (or when the pgBouncer session closes, which is the failsafe).

Implementation: ~30 lines of TypeScript wrapping a 6-line try_advisory_lock_pair(p_a int, p_b int) SQL function. Migration deployed in 20260509000200_advisory_lock_helpers.sql.

Time-series + current-state schema split

Two tables, both keyed by (identifier, identifier_type, retailer):

-- Hot path. One row per (identifier, retailer). Queried on every page render.
CREATE TABLE product_enrichments (
  identifier      text NOT NULL,
  identifier_type text NOT NULL,
  retailer        text NOT NULL,
  data            jsonb NOT NULL,        -- title, price, url, image, in_stock
  fetched_at      timestamptz NOT NULL,
  expires_at      timestamptz NOT NULL,  -- 30-day TTL by default
  fetch_status    text NOT NULL,         -- 'ok' | 'not_found' | 'blocked' | 'error'
  fetch_provider  text,                  -- 'serper' | 'scrape.do' | 'native_api' | ...
  UNIQUE (identifier, identifier_type, retailer)
);

-- Cold path. Append-only. Partitioned monthly. Queried only for charts.
CREATE TABLE price_observations (
  id              bigserial,
  identifier      text NOT NULL,
  identifier_type text NOT NULL,
  retailer        text NOT NULL,
  price           numeric(10,2),
  in_stock        boolean,
  observed_at     timestamptz NOT NULL,
  source          text,
  PRIMARY KEY (observed_at, id)
) PARTITION BY RANGE (observed_at);

This is the same pattern Keepa, CamelCamelCamel, and idealo all converged on. The hot read path (your /p/[id] page) never touches history. The history table grows append-only and gets partitioned monthly, so old data drops off cleanly when retention expires.

At 100,000 products tracked across 7 retailers with 1 observation per product per day, that's 700,000 rows per day or about 21M rows per month. A 50-byte row is 1GB of monthly partition. Postgres handles it without breaking a sweat.

Tiered TTL by retailer × popularity

Not every retailer's data needs the same refresh cadence. Walmart Marketplace listings change hourly. Lowe's listings change weekly. Refreshing both at the same cadence wastes credits on Lowe's and shows stale data on Walmart.

Our base TTL × popularity-tier matrix:

RetailerHot (top 1k by traffic)Warm (top 10k)Cold (long tail)
Walmart1h12h7d
Amazon6h24h14d
Best Buy6h24h14d
eBay1h12h7d
Lowe's24h7d30d
Target12h7d30d
Home Depot24h7d30d

Popularity tier is computed nightly from Google Search Console traffic data and stored on pseo_index.popularity_tier. The TTL resolver takes (retailer, tier) and returns the effective seconds-of-cache. Stale rows trigger a background refresh; missing rows show "Indexing…" pills until the refresh completes.

ISR + Next.js after()

Public pages are Next.js Incremental Static Regeneration with revalidate: 86400 (24 hours). The page is statically rendered at build, regenerated daily on demand, served from edge cache.

Cross-retailer enrichment fires in after(), the Next.js 15 API for "do this after the response is sent." The handler runs on the same edge function instance, has full access to env vars and database, and doesn't block the response. Total page-render budget: 1.5 to 2 seconds for a warm cache hit, 2.5 to 3 seconds for a cold-cache hit (which still renders Walmart-only data immediately).

Bot defense + observability

Cloudflare in front of the apex domain. Verified-bots (Googlebot, Bingbot) get free pass. AI bots (GPTBot, ClaudeBot, PerplexityBot) explicitly Allowed in robots.txt. Anonymous human visitors get an invisible Turnstile challenge after 4 page views in a session.

Outbound: scrape.do, Firecrawl, ScraperAPI all carry rotating residential proxies, so individual retailers never see traffic concentrated from a single IP. Best Buy and eBay native APIs sidestep proxies entirely.

Observability: standard Vercel + Supabase logs. Every successful scrape writes a row to price_observations with the source provider, so we have full audit of "where did this number come from." Sentry catches errors that escape the fetcher's try/catch.

Cost picture

For 100,000 page views per month with 95% cache hit:

  • Serper Tier 0: 5,000 calls × $0.001 = $5/mo
  • Tier 1 fallback (estimated 30% of cold misses, ~1,500 fan-out calls × ~6 retailers averaging $0.002 each): ~$2/mo
  • Vercel hosting: free tier covers this volume
  • Supabase: free tier covers this volume (under 500MB, under 50k MAU)
  • Walmart primary catalog: amortized through the parent organization, marginal cost zero

Total: ~$7/mo at 100k page views.

At 1M page views (95% cache hit): ~$70/mo. At 10M page views: ~$700/mo. Linear scaling, no ugly cliffs.

What ships next

  • Browser extension as distributed scraping fleet (Q3 2026). Real users browsing Amazon/Walmart silently feed data back via WebSocket. Same pattern as Keepa and Honey. Expected to materially reduce Tier 1 spend at scale.
  • compare_retailers MCP tool (June 2026) for single-call cross-retailer comparison without chaining lookup_product.
  • Watch endpoints + webhook delivery (July 2026) for server-side price-drop alerts.
  • Bulk lookup endpoint (August 2026) at discounted token rate for batch agents.

Source

The cross-retailer enrichment lib is at github.com/retailerapi/web/tree/main/apps/marketing/src/lib/cross-retailer. The MCP server at github.com/retailerapi/mcp. Both MIT.

Sign up free and inspect the architecture in production by hitting any public product page. The cross-retailer cells fill in real-time as the background fetchers complete.


Free account

Build with retailerapi

1,000 free lookups per month, no credit card. Cross-retailer price and history across every major US retailer that carries the product.