Business

Scaling to 20k Pages: Programmatic SEO Lessons from My 2026 Experiment

In January 2025, AutoDetective.ai had about 800 indexed pages. By March 2026, it has over 20,000. The traffic graph looks like a hockey stick that...

In January 2025, AutoDetective.ai had about 800 indexed pages. By March 2026, it has over 20,000. The traffic graph looks like a hockey stick that somebody bent in the middle, because growth at this scale is never smooth.

I've learned more about how Google actually works in the past fourteen months than in the previous twenty years of building websites. Most of what I thought I knew was wrong, or at least incomplete. The conventional SEO wisdom that works at 50 pages breaks down entirely at 5,000 pages, and what works at 5,000 pages has a different set of failure modes at 20,000.

The Heap

The Heap

Discarded robots refuse to die. Engineer Kira discovers their awakening—and a war brewing in the scrap. Dark dystopian SF. Consciousness vs. corporate power.

Learn More

This is a technical post about what actually happened — the infrastructure, the mistakes, the metrics, and the revenue. No theory. Just what I observed when I pushed a programmatic SEO experiment well past the point where Google's crawler started treating my site differently.


Where This Started

AutoDetective.ai is a vehicle research platform. Users can look up vehicles by VIN, browse by make and model, and access vehicle history data. The core value proposition is straightforward, but the SEO opportunity comes from the long tail: there are thousands of specific make-model-year combinations that people search for, and each one represents a potential landing page.

By early 2025, I had built out pages for the most popular vehicles — roughly 800 pages covering major makes and recent model years. These were performing well. Decent rankings, steady organic traffic, reasonable ad revenue. The question was whether the same approach could scale by an order of magnitude.

The answer turned out to be yes, but with a lot of caveats.


The Technical Architecture for 20k Pages

You can't manually create 20,000 pages. That's obvious. But the architecture for programmatically generating pages at this scale requires more thought than most people realize.

Here's the basic structure I landed on:

// Simplified page generation logic
var generatePage = function(make, model, year) {
    var slug = slugify(make + "-" + model + "-" + year);
    var data = fetchVehicleData(make, model, year);

    if (!data || data.recordCount < 3) {
        // Don't create thin pages — Google will punish you
        return null;
    }

    var content = {
        title: year + " " + make + " " + model + " - Vehicle Research & History",
        description: buildDescription(data),
        specifications: data.specs,
        commonIssues: data.knownIssues,
        recallData: data.recalls,
        marketData: data.pricing
    };

    return content;
};

The critical decision: not every possible page should exist. Early on, I generated pages for every make-model-year combination I had any data for. That gave me about 35,000 potential pages. But many of them were thin — a page for a 1987 Isuzu Impulse with two data points and no real content isn't useful to anyone.

I implemented a minimum data threshold. If a vehicle didn't have at least meaningful specifications, recall data, or market pricing, the page didn't get created. That cut the total from 35,000 to about 22,000 pages that had genuine content depth.

This single decision — being willing to not create pages — was probably the most important technical choice in the entire project.


Sitemap Management at Scale

This is where things got interesting from an infrastructure perspective.

Google's sitemap protocol supports up to 50,000 URLs per sitemap file, with a maximum file size of 50MB. In theory, my 20,000 pages fit comfortably in a single sitemap. In practice, a single sitemap at this scale is a terrible idea.

The problems:

Crawl prioritization. Google doesn't crawl all URLs in a sitemap equally. With a single file containing 20,000 URLs, Google's crawler was visiting maybe 2,000-3,000 pages per day. New pages took weeks to get indexed. Updated pages took even longer to get re-crawled.

Change detection. When everything is in one sitemap, the lastmod timestamps become noise. Google can't efficiently determine which pages have actually changed, so it either re-crawls everything (wasteful) or ignores the timestamps entirely (slow).

Debugging. When you have indexing problems — and you will have indexing problems — finding the affected pages in a 20,000-URL sitemap is a nightmare.

The solution was a sitemap index with multiple segmented sitemaps:

// Sitemap index generation
var generateSitemapIndex = function() {
    var sitemaps = [];
    var makes = getAllMakes();

    // One sitemap per make
    for (var i = 0; i < makes.length; i++) {
        sitemaps.push({
            loc: "https://autodetective.ai/sitemaps/sitemap-" + slugify(makes[i]) + ".xml",
            lastmod: getLatestModDate(makes[i])
        });
    }

    // Separate sitemap for core pages
    sitemaps.push({
        loc: "https://autodetective.ai/sitemaps/sitemap-core.xml",
        lastmod: new Date().toISOString()
    });

    return buildSitemapIndex(sitemaps);
};

I segmented sitemaps by vehicle make. Toyota gets its own sitemap. Ford gets its own. This means each individual sitemap has a manageable number of URLs (200-1,500 depending on the make), and when I update data for a specific make, only that sitemap's lastmod changes.

The result: new pages went from taking 2-3 weeks to index to typically 3-5 days. That improvement alone was worth the refactoring effort.


Crawl Budget: The Invisible Constraint

If you've only ever worked on small websites, you've probably never thought about crawl budget. It's one of those things that doesn't matter until it really matters.

Crawl budget is essentially how many pages Google is willing to crawl on your site within a given time period. For small sites, it's effectively unlimited — Google will crawl everything. For sites with thousands of pages, Google makes deliberate choices about which pages to crawl and how often.

I hit the crawl budget wall at around 8,000 indexed pages. The symptoms were subtle at first:

  • New pages were taking longer to get indexed
  • Some pages that were previously indexed started getting marked as "Crawled - currently not indexed" in Search Console
  • Pages deep in the site structure were getting crawled less frequently

The fixes involved several changes:

Improving server response time. Google explicitly uses server speed as a factor in crawl budget allocation. If your server responds slowly, Google crawls fewer pages per session. I optimized database queries and added caching for the generated pages. Average response time dropped from ~400ms to ~120ms.

Eliminating duplicate and near-duplicate content. I had some pages that were too similar — different model years of the same vehicle with nearly identical specifications. Google was spending crawl budget visiting these pages and then deciding they were duplicates. I consolidated where appropriate and added stronger differentiation where pages needed to remain separate.

Strategic use of robots.txt. I blocked crawling of utility pages, admin routes, and API endpoints. Every page Google doesn't need to crawl is budget freed up for pages you actually want indexed.

# robots.txt optimization
User-agent: *
Disallow: /api/
Disallow: /admin/
Disallow: /search?
Disallow: /compare?
Allow: /api/vehicles/*/overview
Sitemap: https://autodetective.ai/sitemaps/sitemap-index.xml

Internal linking architecture. This deserves its own section.


Internal Linking: The Multiplier Nobody Talks About

Internal linking is the most underrated factor in programmatic SEO at scale. I'm convinced of this now, after watching the data for over a year.

The initial version of AutoDetective.ai had a flat structure. Homepage linked to make pages. Make pages linked to model pages. Model pages linked to year pages. Clean, logical, insufficient.

The problem: pages more than three clicks from the homepage get crawled less frequently and rank lower. With a flat hierarchy, my year-specific pages were three to four levels deep. Google treated them as low-priority.

The fix was building a dense internal linking network:

// Cross-linking related vehicles
var getRelatedLinks = function(make, model, year) {
    var links = [];

    // Adjacent years of the same model
    links = links.concat(getAdjacentYears(make, model, year));

    // Competing models in the same class
    links = links.concat(getCompetitors(make, model, year));

    // Same make, different popular models
    links = links.concat(getPopularModels(make, 5));

    // Most popular overall (high PageRank pages)
    links = links.concat(getMostViewedVehicles(3));

    return links;
};

Every vehicle page now links to:

  • The previous and next model years
  • Three to five competing vehicles in the same class
  • The five most popular models from the same make
  • The three most-visited vehicles site-wide

This creates a web of internal links where no page is more than two clicks from a high-authority page. The effect on crawl frequency and indexing speed was dramatic. Pages that were being crawled once every two weeks started getting crawled every three to four days.

More importantly, the ranking improvements were significant. Pages with strong internal linking consistently ranked one to three positions higher than equivalent pages without it.


What Google Rewards at Scale

After fourteen months of observation, here's what I can say with confidence about what Google rewards for large-scale programmatic content:

Genuine data depth. Pages that have real, unique data rank better than pages that reformat the same information available elsewhere. My best-performing pages are ones with recall data, market pricing, and common issues — information that requires aggregation from multiple sources. Pages with only basic specifications perform poorly because that data is available on fifty other sites.

Page-level uniqueness. Google got dramatically better at detecting programmatic content in 2025. Pages generated from a template with only minor data substitutions perform poorly. Pages where the content structure adapts to the data — showing different sections based on what information is available, varying the content depth based on data richness — perform well.

Freshness signals. Pages that get updated when new data is available rank better than static pages. I implemented a system that automatically updates pages when new recall data is issued, when market prices change significantly, or when new model year data becomes available. Google notices and rewards this.

User engagement. Pages where users spend time, click through to related pages, and don't immediately bounce rank better. This sounds obvious, but it has a specific implication for programmatic SEO: the pages need to be genuinely useful, not just indexed. I added interactive elements — comparison tools, price trend charts, recall lookup — that increased average time on page from 45 seconds to about 2 minutes.


What Google Penalizes at Scale

Thin content multiplication. Creating thousands of pages with minimal content will not only fail to rank — it will actively hurt your site. Google interprets a high percentage of thin pages as a quality signal for the entire domain. When I had 35,000 pages including the thin ones, my overall indexing rate dropped. When I pruned to 22,000 quality pages, the remaining pages ranked better than the larger set had.

Template sameness. If Google determines that your pages are formulaic — same structure, same sentence patterns, just different data inserted — it devalues the entire pattern. I had to invest significant effort in template variation: different layouts for different vehicle classes, conditional sections based on data availability, variable content ordering.

Aggressive interlinking without relevance. Internal links need to make sense. Early on, I linked every page to every other page within the same make. Google apparently detected this as manipulative. When I switched to relevance-based linking (same class, same year range, known competitors), the results improved.

Rapid page creation. Adding 5,000 pages overnight looks like spam to Google. When I scaled from 8,000 to 20,000 pages, I did it in batches of 500-1,000 pages per week over about three months. Each batch was fully indexed and stabilized before the next batch went live. Patience at this stage is non-negotiable.


The Revenue and Traffic Numbers

Transparency matters here, so let me share real numbers.

Traffic growth (organic sessions per month):

  • January 2025: ~3,200
  • June 2025: ~8,400
  • December 2025: ~14,100
  • March 2026: ~19,500

Indexed pages:

  • January 2025: ~800
  • June 2025: ~5,200
  • December 2025: ~15,800
  • March 2026: ~20,400

Monthly ad revenue:

  • January 2025: ~$85
  • June 2025: ~$310
  • December 2025: ~$620
  • March 2026: ~$780

The revenue isn't impressive by any standard that matters. $780 a month from a site with 20,000 indexed pages and nearly 20,000 monthly sessions tells you something important: most programmatic SEO pages have low commercial intent. People searching for "2019 Honda Civic specifications" aren't in a buying mood. They're researching.

The affiliate revenue — from linking to vehicle listings and related services — has been more promising but more volatile. Some months it's higher than the ad revenue. Some months it's nearly zero. I'm still optimizing this.


Infrastructure Costs

Running a site at this scale isn't free, and it's worth being honest about the costs:

  • Hosting: $24/month (DigitalOcean droplet)
  • Database: $15/month (managed PostgreSQL)
  • Data sources: ~$50/month (various vehicle data APIs)
  • Domain and SSL: ~$15/year
  • Total monthly cost: ~$90

At $780/month in ad revenue plus some affiliate income, the site is profitable. But the profit margin is thinner than most people would guess. Programmatic SEO at scale has real infrastructure costs, and those costs don't scale linearly — the database gets more expensive as the data grows, and the API costs for data sourcing increase with page count.


What I'd Do Differently

Start with fewer, better pages. I wasted months creating pages that were thin and should never have existed. Starting with 5,000 excellent pages would have produced better results faster than starting with 8,000 mediocre ones.

Invest in original data earlier. The pages that rank best have data you can't find elsewhere. I should have invested in building unique datasets — user reviews, owner-reported issues, price prediction models — from the beginning instead of relying primarily on aggregated public data.

Build the internal linking architecture first. I retrofitted the cross-linking system onto an existing site. It would have been cleaner and more effective to design the linking structure before generating the pages.

Monitor crawl budget from day one. I didn't start paying attention to Google Search Console's crawl stats until I hit problems. By then, I'd already wasted months of crawl budget on thin pages and duplicate content.


Is Programmatic SEO Worth It in 2026?

Yes, but not the way most people think about it.

The era of generating thousands of low-quality pages from a template and watching them rank is over. Google's algorithms in 2025 and 2026 have gotten sophisticated enough to detect and devalue that approach. The sites I see succeeding with programmatic SEO share specific characteristics: they have genuine data depth, they provide real utility to users, and they invest in quality at the page level even at scale.

The AI tools make this more feasible than ever. I use agents extensively for data processing, content generation, and quality checking. But the strategy — which pages to create, what data to prioritize, how to structure the user experience — still requires human judgment and patience.

Twenty thousand pages sounds like a lot. But in the context of vehicle research, it's actually a small fraction of the total addressable search space. The growth potential is real, if I continue to prioritize quality over quantity.

That's the lesson I keep relearning: the constraint is never "can we create more pages?" The constraint is always "can we create more pages that are genuinely worth indexing?"


Shane Larson is a software engineer and technical author based in Alaska. He writes about software development, AI tools, and the business of building things at grizzlypeaksoftware.com.

Powered by Contentful