SEO

The pSEO playbook for product-data sites in 2026

How to build a programmatic-SEO product-data site that ranks in Google AND gets cited by ChatGPT, Claude, and Perplexity. Real architecture, real schema, real costs.

By Matt Hall·May 9, 2026·8 min read

Programmatic SEO (pSEO) for product-data sites in 2026 is in a weird place. Google's "Discovered, currently not indexed" rate is at all-time highs for thin programmatic pages. AI crawlers (GPTBot, ClaudeBot, PerplexityBot) have become a meaningful traffic source but reward stat-rich content over keyword-stuffed templates. The competitors that figured out the AEO (Answer Engine Optimization) angle in 2025 are picking up citations in ChatGPT and Perplexity at 2x to 4x the rate of competitors that didn't.

This post is the playbook we ran when we launched retailerapi.com in May 2026. Real architecture, real schema, real costs. Adapt as fits.

The premise

You have a database of N products (hundreds, thousands, millions). You want each product to live at a unique URL (/p/) that:

Ranks in Google for long-tail product queries (UPC, EAN, " price history", " walmart")
Gets cited by ChatGPT, Claude, Perplexity, Google AI Overviews when users ask product-data questions
Doesn't trigger thin-content penalties when you publish 100k+ pages
Costs less than $20/100k page views to operate

These goals tension each other. Density-rich pages that satisfy AEO can be slow to render. Sitemap chunks for 1M+ URLs need careful traffic-priority ordering. Schema.org markup that satisfies Google's product carousel rules can fail Bradley placement audits if you copy-paste templates.

Step 1: Pick the URL pattern

The /p/ pattern is correct. Reasons:

Stable across retailers (UPC stays the same; product titles drift)
Cleanly canonical (/p/194629116676 is one URL, no slashes-vs-no-slashes ambiguity)
Works with browser autocomplete and link-shareability
LLM crawlers can guess them ("the URL for UPC X is /p/") which makes citations more shareable

Avoid: /products/// — too many breakable layers. Avoid: / — title drift breaks links forever.

Step 2: Build the sitemap correctly

Three sitemap files, not one:

/sitemap.xml          → sitemap-index. Lists the chunks below.
/sitemaps/static.xml  → apex routes (homepage, pricing, blog posts, legal)
/sitemaps/products/0  → first 50,000 products
/sitemaps/products/1  → next 50,000
...

Sort the products inside each chunk by traffic_score DESC. Highest-traffic pages get reindexed first, which lets Google reallocate crawl budget to your hot pages without sacrificing the long tail.

Don't: dump all 1M URLs into a single 1M-URL sitemap. Google's max-50k-URLs-per-sitemap rule is real, and even within that you want priority ordering. Don't: rely on Next.js's default sitemap.ts metadata convention if you have more than ~10k URLs — you need the route-handler approach for chunked output.

Step 3: Robots.txt with explicit AI-bot allowlist

The default User-Agent: * Allow-all is not enough. Some AI bots only honor explicit lines. Allow these by name:

User-Agent: GPTBot
Allow: /

User-Agent: ChatGPT-User
Allow: /

User-Agent: OAI-SearchBot
Allow: /

User-Agent: ClaudeBot
Allow: /

User-Agent: anthropic-ai
Allow: /

User-Agent: Claude-Web
Allow: /

User-Agent: PerplexityBot
Allow: /

User-Agent: Google-Extended
Allow: /

User-Agent: Bingbot
Allow: /

User-Agent: Applebot-Extended
Allow: /

(Plus User-Agent: * Allow-all for the rest, and Sitemap: pointer.)

Counterintuitive insight from 2025-2026: allowing AI bots is the right call even if you sell data. The AEO research consensus is that LLMs cited as "according to retailerapi" drive net-positive signups. Blocking only loses you the traffic.

Step 4: llms.txt and llms-full.txt curated routes

Both files. Both curated, never autogenerated.

llms.txt is the short index — H1, blockquote summary, curated section list with one-line descriptions per entry. Cap at 50 to 200 entries. Include trust pages (privacy, terms), pricing, and your top-traffic products.

llms-full.txt is the long-form companion — same structure but with multi-line descriptions per entry. Cap at 1,000 entries. Include freshness line ("Last updated: YYYY-MM-DD") at the top.

Regenerate both nightly via cron. Both must be fetchable at the apex (yourdomain.com/llms.txt, not /wp-content/...).

The audit rubric to score against: seo-geo-optimization/references/llms-audit-rubric.md (10 categories, 0-2 each, target overall ≥ 16/20).

Step 5: Schema.org with Bradley placement

Bradley placement is the rule that says: emit your Organization schema once, on the homepage, with a stable @id. Every other page references that @id instead of duplicating the Organization block.

<!-- Homepage -->
<script type="application/ld+json">
{
  "@context": "https://schema.org",
  "@graph": [
    {
      "@type": "Organization",
      "@id": "https://retailerapi.com/#organization",
      "name": "retailerapi",
      "url": "https://retailerapi.com",
      "logo": "https://retailerapi.com/logo.png",
      "founder": { "@type": "Person", "name": "Matt Hall" },
      "sameAs": ["https://github.com/retailerapi", "https://x.com/retailerapi"]
    },
    {
      "@type": "WebSite",
      "@id": "https://retailerapi.com/#website",
      "url": "https://retailerapi.com",
      "publisher": { "@id": "https://retailerapi.com/#organization" }
    }
  ]
}
</script>

<!-- /p/<identifier> page -->
<script type="application/ld+json">
{
  "@context": "https://schema.org",
  "@type": "Product",
  "@id": "https://retailerapi.com/p/19667262713#product",
  "name": "12-in-1 Electric Pressure Cooker 6 QT",
  "image": "https://i5.walmartimages.com/...",
  "brand": { "@type": "Brand", "name": "Cooks Essentials" },
  "gtin12": "...",
  "isPartOf": { "@id": "https://retailerapi.com/#organization" },
  "offers": {
    "@type": "AggregateOffer",
    "lowPrice": "59.99",
    "highPrice": "181.64",
    "offerCount": 6,
    "priceCurrency": "USD",
    "offers": [...]
  }
}
</script>

Per-product page emits only Product + AggregateOffer + Breadcrumb. Never duplicates Organization. Never emits a second top-level Organization or WebSite block. The isPartOf reference is enough for Google to associate the product with our brand.

Step 6: Princeton statistics + fluency content optimization

The 2024 Princeton paper on AEO (arXiv:2311.09735) tested 9 content-optimization methods and found that statistics density + fluency improvements drove +37% citation lift across leading LLMs.

Concretely, on every cornerstone page:

Statistics density target: ≥ 0.5% statistic-tokens / total words. Numbers, units, percentages, dates.
Avoid keyword stuffing: any term over 2.5% of total page words triggers a penalty.
Flesch-Kincaid grade-level around 8 to 11 (plain English).
Lead with concrete numbers in the first 100 words.

Bad opener (low stat-density, high keyword-stuffing):

"Walmart price tracking is essential for Walmart sellers. Our Walmart price tracker tracks Walmart prices on Walmart products. With our Walmart price tracking tool, you can track Walmart price drops on Walmart..."

Better opener (stat-rich, varied phrasing):

"Walmart Marketplace runs over 100M SKUs across 240,000 third-party sellers. None of those listings publish a price-history chart. This guide compares 3 working methods in 2026 to track Walmart price history, with cost ranging from $0/mo to $537/mo and product-coverage ceilings from 50 to 5M+ items."

The second opener has 11 numeric tokens in 60 words. Density: 18%. Lead numbers: yes. Specific bracket on cost and coverage: yes.

Step 7: Per-page Open Graph image

Don't reuse the apex OG for every page. Generate a per-page OG dynamically (Next.js opengraph-image.tsx) with the product title, brand, current price, key stats. Rendered at 1200×630 PNG, served at /p//opengraph-image.

This drives 2x to 3x click-through on Slack/X/LinkedIn shares. Costs about 0.5 seconds of compute per cold-cache OG render; effectively free if cached.

Step 8: Time-to-first-byte budget

Public pages need to render in under 2 seconds for the first byte. ISR + Next.js after() pattern (covered in our architecture deep dive) gets us 1.5 to 2 seconds on warm cache. Background scraping fills empty cells out-of-band.

Don't synchronously scrape on render. Don't lazy-load product images via JavaScript (LLM crawlers don't see them). Don't gate the Walmart-from-our-DB content behind a "loading…" spinner.

Step 9: Pre-deploy AEO validator

Run a script against your live URL on every deploy that checks:

All JSON-LD blocks parse cleanly (no control characters)
Homepage has Organization + WebSite
Product pages have exactly 1 top-level Product (no duplicate Organization)
robots.txt has explicit Allow lines for the canonical AI bot list
/llms.txt and /llms-full.txt return 200 with > 200 bytes and have freshness lines

We use a 200-line validate-schema.mjs script. Source on GitHub. Failure exits non-zero and our deploy pipeline fails closed.

Step 10: Indexing strategy

Don't blast 1M URLs at Google on day 1. Their crawl-budget allocation will dump 99% as "Discovered, currently not indexed." Instead:

Day 1-7: Submit a 100-URL sitemap of hand-picked, content-complete products.
Week 2: If 50%+ get indexed, expand to 1,000 URLs.
Week 4: Expand to 10,000 if "Discovered, not indexed" rate is below 30%.
Month 3: Expand to 100k if signals are good.
Month 6: Cap at whatever your traffic-score tail justifies.

The expansion gate is the "Discovered, not indexed" rate from Google Search Console. If it spikes above 30%, freeze expansion and improve content depth on the existing pages before adding more.

Step 11: IndexNow for Bing + Yandex

Bing's IndexNow protocol is free. Submitting URLs at publish time triggers crawl within minutes. For pSEO sites with frequent product additions/changes, IndexNow alone can drive 30 to 50% more Bing traffic.

Implementation: ~50 lines of TypeScript. Generate a key, host it at /.txt, POST URL changes to https://api.indexnow.org/indexnow on every product page revalidation. Submission is rate-limited to 10,000 URLs per call but you can chain calls.

Cost picture

For 100,000 product pages at 95% cache hit rate:

Serper Tier 0 (cross-retailer enrichment): ~$5/mo
Per-retailer Tier 1 fallback: ~$2/mo
Hosting (Vercel, Supabase free tiers): $0
Bot-defense (Cloudflare free tier + Turnstile): $0
IndexNow: $0
GSC API: $0

Total: ~$7/mo at 100k page views. Linear scaling.

For 1M page views: ~$70/mo. For 10M: ~$700/mo.

Compare to a synchronous-scrape architecture: $200 to $500/mo at 100k page views (95% of which is wasted on duplicate scrapes).

What this misses

Internal linking strategy. This post doesn't cover it. Short version: every /p/ page should link to 3 to 5 related products + 1 to 2 cornerstone blog posts. A simple "related products" component is enough.
Author and E-E-A-T signals. For non-product pages (blog posts), Author entity + Person schema with credentials. We're punting until we have proven traffic.
Internationalization. Not covered. If your audience is global, hreflang + per-locale sitemaps are mandatory.

The full retailerapi.com source

We're keeping retailerapi's marketing app, API, MCP server open source on GitHub (under MIT). The cross-retailer enrichment lib, JSON-LD modules, llms.txt curated routes, and validator scripts are all there for inspection or reuse. Repo links: marketing, api, mcp.

Try the result

retailerapi.com is the product-data site we built using exactly this playbook. View source on any /p/[id] page to see the schema, hit /llms.txt for the curated index, /sitemap.xml for the index file. The pre-deploy AEO validator output is in our deploy logs.

Free account

Build with retailerapi

1,000 free lookups per month, no credit card. Cross-retailer price and history across every major US retailer that carries the product.

Start free →See pricing