What Is Crawl Budget? A Complete Definition
Crawl budget is the total number of URLs that Googlebot (or any other search engine crawler) will fetch and process on your website within a given time window — typically measured per day. It represents Google's finite allocation of crawling resources across every website on the internet. Because those resources are not unlimited, Google makes decisions about how many pages to crawl on each site, how often, and in what order.
Understanding what is crawl budget is fundamental to technical SEO because a page that is never crawled cannot be indexed, and a page that is not indexed cannot rank. For small brochure sites with fewer than a few hundred pages this rarely causes problems — Google can sweep through the whole site in minutes. But for large e-commerce stores, news publishers, SaaS platforms, and any site that generates URLs dynamically, crawl budget becomes one of the most important technical levers available.
Google's own documentation defines crawl budget as the combination of two interacting factors: crawl rate limit and crawl demand. The actual number of pages Google crawls on any given day is governed by whichever of these two factors is the binding constraint at that moment. To fully understand crawl budget SEO you need to understand both.
Crawl Rate Limit vs Crawl Demand: The Two Pillars of Crawl Budget
Crawl Rate Limit
The crawl rate limit is the maximum speed at which Googlebot will fetch pages from your server. Its primary purpose is to prevent Googlebot from overwhelming your infrastructure. Google dynamically adjusts this limit based on two signals:
- Server health: If your server responds quickly and with consistently low error rates, Google interprets this as a signal that it can fetch more pages without causing harm. A server that returns frequent 5xx errors or takes several seconds per response will see its crawl rate reduced automatically.
- Search Console crawl rate setting: Site owners can manually lower (but not raise above Google's own ceiling) the crawl rate via the Legacy Search Console crawl rate settings panel. This is useful if Googlebot is putting your server under strain, but reducing it also reduces the number of pages Google can discover.
Think of the crawl rate limit as the throughput ceiling — the fastest Googlebot is willing to work on your site at any given moment.
Crawl Demand
Crawl demand is how much Google actually wants to crawl your site, independent of server constraints. It is driven by two sub-factors:
- Popularity: URLs that attract more links, more traffic, and more engagement signals are considered more valuable by Google and are crawled more frequently. A homepage linked to from thousands of external domains will be re-crawled far more often than an obscure product page with no inbound links.
- Staleness: Google tries to keep its index up to date. If your site publishes content frequently — daily news articles, live stock prices, real-time product availability — Google will increase crawl demand to avoid serving stale results to users.
Crawl demand is the appetite side of the equation. The actual crawl budget in practice is roughly: min(crawl rate limit, crawl demand). If Google wants to crawl 50,000 pages a day but your server can only handle 10,000 requests without degrading, the effective budget is 10,000. Conversely, if your server is powerful but Google only considers 5,000 of your URLs worth crawling, budget is capped by demand regardless of server capacity.
When Does Crawl Budget Matter? A Site Size Guide
One of the most common misconceptions in crawl budget SEO is that it matters for every website equally. In practice, the impact of crawl budget scales dramatically with site size and URL structure complexity. The table below gives a practical guide.
| Site Type / Size | Pages in Index | Crawl Budget Risk | Priority Action |
|---|---|---|---|
| Small brochure site | Under 500 | Very low — Google crawls the full site within hours | Focus on content quality, not crawl management |
| Medium business site | 500–5,000 | Low to moderate — issues arise mainly from duplicate content or excessive parameter URLs | Audit for duplicate URLs and thin pages |
| Large content site / blog | 5,000–50,000 | Moderate — crawl budget starts to constrain how quickly new content is discovered | Optimise internal linking; submit XML sitemaps |
| E-commerce store | 10,000–500,000 | High — faceted navigation, sorting and filtering can generate millions of URLs | Block or canonicalise faceted URLs; fix redirect chains |
| News / media publisher | 100,000+ | High — freshness demands compete with large archive; stale pages may lose crawl slots | Use News Sitemaps; remove or noindex old low-value content |
| Enterprise / aggregator | 1,000,000+ | Critical — even a small percentage of wasted budget translates to tens of thousands of uncrawled pages | Full crawl budget audit; server log analysis mandatory |
As a rule of thumb: if your site has fewer than 1,000 indexable pages and they are all genuinely useful, you can largely ignore crawl budget as a priority. Once you pass 10,000 pages — or once you start generating URLs via parameters, filters, or dynamic routing — crawl budget optimisation becomes a worthwhile investment.
How Google Allocates Crawl Resources
Google does not have a single crawl queue. It operates multiple crawling pipelines in parallel, each with different priorities and objectives. Understanding this architecture helps explain why some pages get crawled daily while others are ignored for months.
Googlebot Types
There are several distinct Googlebot user agents. The main ones relevant to crawl budget are:
- Googlebot (desktop): The primary crawler for desktop rendering. Used to discover and index the majority of web content.
- Googlebot (smartphone): Google's mobile-first crawler. Since Google switched to mobile-first indexing, this is the dominant crawler for most sites. If your site has a separate mobile version, both user agents may visit.
- Googlebot-Image: Crawls image content independently. Does not consume the same budget as the main Googlebot in most cases.
- Googlebot-Video, AdsBot-Google, APIs-Google: Specialised crawlers with separate budgets.
When SEOs refer to crawl budget, they almost always mean the Googlebot smartphone (or desktop) budget — the one that determines whether your pages get indexed in web search.
The Crawl Queue and Scheduling
Google maintains a prioritised crawl queue. URLs enter the queue from multiple sources: existing index entries scheduled for recrawl, links discovered on pages Google has already crawled, XML sitemaps, manual URL Inspection requests, and signals from other Google properties (Search Console, Google Analytics, etc.).
Google uses machine learning to schedule recrawl intervals. Pages that change frequently, attract new links, or generate traffic are scheduled for more frequent recrawls. Pages that have not changed since the last crawl and attract no new signals are pushed further back in the queue, sometimes not revisited for weeks or months.
This scheduling intelligence is why simply "submitting a sitemap" does not guarantee crawling — Google weights its own signals more heavily than sitemap declarations. A sitemap is a hint, not a command.
Crawl Budget Optimisation: 12 Strategies
The goal of crawl budget optimisation is straightforward: ensure Googlebot spends its allocated budget on pages that matter, rather than wasting it on low-value, duplicate, or broken URLs. The following strategies are ordered roughly from highest to lowest impact for most sites.
1. Eliminate or Consolidate Duplicate URLs
Duplicate content is the single biggest crawl budget drain on most sites. Common culprits include:
- HTTP and HTTPS versions of the same page both accessible
- Trailing slash and non-trailing slash variants (
/pagevs/page/) - WWW and non-WWW versions both resolving without a redirect
- URL parameters that do not change content (
?ref=email,?utm_source=newsletter) - Session IDs appended to URLs
- Printer-friendly or AMP versions accessible without canonicalisation
For each of these, use a canonical tag to point to the preferred URL, and where possible set up a permanent redirect so Googlebot only ever follows one path. You can audit canonical issues across your site with the RankNibbler site audit.
2. Fix Broken Links and 404 Errors
Every time Googlebot follows a link to a 404 page, it wastes a crawl slot on a URL that returns no value. On large sites, thousands of internal 404s can silently drain crawl budget over time. Use the broken link checker to identify all internal links pointing to 404 responses, then either update the link to point to the correct destination or remove it entirely.
Note: external 404s (broken outbound links) are less of a crawl budget concern for your site, but they signal poor quality to Google, so fixing them has indirect ranking benefits as well.
3. Block Low-Value URLs via Robots.txt
Not every URL needs to be crawled. Use your robots.txt file to prevent Googlebot from accessing pages that have no business being indexed. Common candidates for blocking include:
- Internal search results pages (
/search?q=...) - Faceted navigation and filter combinations in e-commerce
- Admin, login, and account pages
- Shopping cart and checkout flows
- Staging or development subdirectories accidentally accessible in production
- Utility scripts, API endpoints, and JSON feeds
Use the robots.txt generator to build a clean, well-structured robots.txt file. Remember: disallowing a URL in robots.txt prevents crawling but does not prevent indexing if Google discovers the URL from an external link. For pages you want actively excluded from the index, use a noindex meta robots tag instead.
4. Resolve Redirect Chains and Loops
A redirect chain occurs when URL A redirects to URL B, which redirects to URL C, and so on. Each hop in the chain costs Googlebot a crawl request and introduces latency. Chains of three or more redirects are particularly harmful: Google may stop following the chain before reaching the final destination, meaning the target page goes uncrawled. Use the redirect checker to identify chains and loops, then collapse them to a single direct 301 redirect wherever possible.
5. Improve Server Response Time
Googlebot is time-constrained. A server that responds in 200ms can serve five times as many pages per crawl session as a server that takes 1,000ms per response. Faster server responses directly translate into a higher effective crawl rate limit. Optimise for Time to First Byte (TTFB) through caching, CDN use, database query optimisation, and server hardware upgrades. Aim for a TTFB under 200ms for crawled URLs.
6. Submit an XML Sitemap
An XML sitemap is a structured list of your most important URLs that you submit to Google via Search Console. It does not override Google's crawl scheduling, but it acts as a strong signal about which URLs you consider canonical and worth crawling. For large sites, use sitemap index files to organise sitemaps by section (products, blog posts, categories). Keep sitemaps clean — only include 200-response, canonical, indexable URLs. A sitemap that lists 404s or noindex pages undermines your credibility with Google.
7. Strengthen Internal Linking to Priority Pages
Googlebot discovers pages primarily by following links. A page that is three or more clicks away from your homepage — sometimes called a "deep" page — may be visited infrequently. Bring important pages closer to the surface by linking to them from high-traffic, well-linked pages. Review your site's internal link structure with the site audit tool and look for orphan pages (pages with no internal links pointing to them) — these are invisible to Googlebot unless submitted in a sitemap.
8. Apply Noindex to Low-Value Thin Pages
Noindex instructs Google not to include a page in its index, and over time Google will stop crawling noindexed pages as frequently. This is a useful crawl budget strategy for pages you cannot delete or block via robots.txt (because they may still serve a user purpose) but that you do not want consuming indexing resources. Common candidates include:
- Tag and author archive pages with little unique content
- Paginated pages beyond page 2 (use with care and test impact)
- Thin product variant pages that differ only in colour or size without unique descriptions
- Date-based archive pages on blogs
9. Reduce URL Parameter Sprawl
URL parameters are one of the most common sources of URL bloat. Tracking parameters, sorting options, filtering combinations, session tokens, and A/B testing flags can multiply a 10,000-page catalogue into millions of unique URLs overnight. Options for controlling parameter sprawl:
- Use Google Search Console's URL Parameters tool (Legacy) to tell Google how to handle specific parameters
- Implement canonical tags on parameterised URLs pointing to the clean base URL
- Rewrite parameter URLs to use path-based slugs (
/shoes/blue/instead of/shoes?colour=blue) - Block pure tracking parameters in robots.txt using wildcard rules
10. Remove or Update Stale and Low-Quality Content
Google's quality assessments influence crawl prioritisation. Sites with a high proportion of thin, outdated, or unhelpful pages relative to high-quality content may find their overall crawl rate reduced. Regularly auditing content quality and either improving, consolidating (301 redirect to a better page), or removing low-value pages improves the ratio of useful to wasteful URLs and signals quality to Google. This is sometimes called a "content audit" or "content pruning."
11. Use Hreflang Carefully on Multilingual Sites
Multilingual sites face a unique crawl budget challenge. Each language variant of a page is a separate URL that needs to be crawled. Hreflang annotations must be consistent across all variants — errors in hreflang (such as referencing non-existent URLs or creating loops) cause Google to crawl additional URLs to resolve inconsistencies, wasting budget. Audit hreflang implementation as part of any international SEO crawl budget review.
12. Leverage HTTP/2 and Server Push
HTTP/2 allows multiple requests to be multiplexed over a single connection, which is more efficient than the one-request-per-connection model of HTTP/1.1. Google's crawlers support HTTP/2. Upgrading your server to HTTP/2 can meaningfully improve throughput during a crawl session, effectively raising the crawl rate limit ceiling without any risk to your server stability.
Monitoring Crawl Stats in Google Search Console
Google Search Console provides a dedicated Crawl Stats report (found under Settings > Crawl Stats) that shows Googlebot's activity on your site over the last 90 days. Understanding how to read this report is central to any crawl budget SEO workflow.
Key Metrics in the Crawl Stats Report
| Metric | What It Means | What to Look For |
|---|---|---|
| Total crawl requests | Number of fetch requests Googlebot made to your server in the period | Sudden drops may indicate server issues; spikes may indicate a crawl of new content |
| Total download size | Total bytes transferred to Googlebot | Unusually high values may indicate large unoptimised pages being crawled |
| Average response time | Mean time for your server to respond to Googlebot's requests | Keep below 500ms; values above 1,000ms suggest crawl rate limit is being suppressed |
| Crawl requests by response | Breakdown of responses: 200, 301, 302, 404, 429, 5xx etc. | High 404 or 5xx rates indicate wasted budget and potential crawl suppression |
| Crawl requests by file type | HTML, CSS, JS, image, font, etc. | If JS/CSS is taking a disproportionate share, consider whether all assets need to be accessible |
| Crawl requests by purpose | Discovery, refresh, sitemap-triggered | A high proportion of "discovery" crawls suggests Google is still exploring new URLs; "refresh" heavy means it is revisiting known pages |
The Crawl Stats report also breaks down crawl activity by host (useful for sites with subdomains) and by Googlebot type. Check it monthly on medium-sized sites and weekly on large or rapidly-changing sites.
Using URL Inspection for Specific Pages
The URL Inspection tool in Search Console lets you check when a specific URL was last crawled, what HTTP response code was returned, whether it is indexed, and what the rendered HTML looked like. This is invaluable for diagnosing why a specific important page is not appearing in search results despite being linked internally and listed in a sitemap.
Server Log Analysis for Crawl Budget
Search Console's Crawl Stats report is useful but incomplete. It shows you what Google is crawling at a high level, but it does not let you cross-reference crawl activity against your URL inventory at scale. For a complete picture, you need server log analysis.
What Are Server Logs?
Every request to your web server is recorded in an access log. Each entry includes the date and time, the URL requested, the HTTP method (GET, HEAD, POST), the response code, the bytes sent, the referrer, and the user agent. By filtering log entries where the user agent contains "Googlebot", you get a precise record of every URL Google has crawled, when, and what response it received.
What to Look for in Crawl Log Analysis
- URL distribution: What proportion of crawl budget is being spent on which sections of the site? Are product pages getting crawled or is Googlebot stuck in a faceted navigation loop?
- Response code distribution: What percentage of Googlebot requests result in 200, 301, 404, or 5xx? A high error rate is a direct crawl budget drain.
- Crawl frequency by URL: Are your most important pages being crawled frequently and your low-value pages rarely? If it is the other way around, something is wrong with your site architecture or internal linking.
- Bot vs user traffic ratio: On some sites, Googlebot generates more requests than human users. This is not inherently bad, but it is worth understanding whether bot traffic is causing server load issues.
- Crawl gaps: Are there important URLs that have not been crawled in weeks or months despite being live and linked?
Tools like Screaming Frog Log File Analyser, Splunk, ELK Stack, or even a well-structured spreadsheet pivot table can help process log data at scale. For a quick cross-check, run a site audit alongside your log analysis to see which URLs exist but have never been crawled.
Crawl Budget and JavaScript Rendering
JavaScript-heavy sites introduce a second layer of complexity to crawl budget that many SEOs overlook. Google processes URLs in two phases: first a fast HTML fetch, then a separate rendering step in which the JavaScript is executed. These two phases do not always happen together, and the rendering queue can introduce delays of hours, days, or even weeks on large sites.
How JavaScript Affects Crawl Budget
- Rendering is expensive: Executing JavaScript requires significantly more computational resources than parsing static HTML. Google throttles rendering to manage costs, which means a JavaScript-rendered site effectively has a lower crawl and indexing throughput than an equivalent server-side-rendered site.
- Links in JS may not be discovered immediately: If your navigation or internal links are injected by JavaScript, they may not be discovered on the initial HTML fetch. This means new pages linked only through JavaScript can sit in a crawl gap until the rendering queue catches up.
- Google may not execute all JavaScript: If your JavaScript relies on APIs, third-party scripts, or complex interactions to render content, there is a risk that Google's rendering environment does not execute it correctly, resulting in incomplete page rendering and missed content.
Recommendations for JavaScript Sites
- Use server-side rendering (SSR) or static site generation (SSG) for content that needs to be crawled quickly and reliably
- Ensure critical navigation links are in the initial HTML payload, not injected by JavaScript
- Use dynamic rendering as a stopgap: serve pre-rendered HTML to crawlers and full JavaScript to users
- Test how Google renders your pages using the URL Inspection tool's "Test Live URL" feature
- Monitor the Coverage report in Search Console for "Crawled — currently not indexed" status, which can indicate rendering failures
Crawl Budget and Redirects
Redirects have a direct and often underestimated impact on crawl budget. Every redirect requires at least two HTTP requests: one to the original URL and one to the destination. Chains require even more. On a site with hundreds of thousands of redirected URLs, this overhead can consume a substantial portion of your daily crawl allowance.
Types of Redirects and Their Impact
| Redirect Type | Crawl Budget Impact | Recommendation |
|---|---|---|
| 301 Permanent (single hop) | Low — Google follows the redirect and updates its index, eventually reducing future crawl of the old URL | Acceptable; collapse chains to single 301s |
| 302 Temporary | Medium — Google may continue crawling both the source and destination indefinitely since it treats the original as still valid | Only use when the redirect is genuinely temporary; switch to 301 for permanent moves |
| Redirect chain (3+ hops) | High — each hop costs a crawl request; Google may abandon the chain before reaching the destination | Collapse to a direct redirect; use the redirect checker to audit |
| Redirect loop | Critical — Googlebot will follow the loop several times, waste budget, and eventually blacklist the URL | Fix immediately; loops show up in the Crawl Stats report as high crawl volume with no indexing |
| Meta refresh redirect | High — meta refresh is not treated as efficiently as server-side redirects; introduces additional rendering overhead | Replace with server-side 301 wherever possible |
After any large site migration or URL restructuring, audit your redirect map thoroughly. Check that all old URLs redirect directly to their new equivalents in a single hop, and that no new chains have been created by redirecting to pages that themselves redirect.
Common Crawl Budget Mistakes
Even experienced SEOs make crawl budget mistakes. The following are the most common errors seen across large sites:
1. Listing Non-Canonical URLs in Sitemaps
Including URLs in a sitemap that carry a canonical tag pointing to a different URL sends a conflicting signal. Google sees the sitemap listing as "I want this crawled and indexed" and the canonical as "the canonical version is elsewhere." In practice Google usually honours the canonical, but the sitemap listing wastes a crawl request each time Google checks it. Keep sitemaps clean: only list canonical, indexable, 200-response URLs.
2. Blocking Pages in Robots.txt That Are Linked Internally
If a page is blocked by robots.txt but is still linked from crawlable pages, Googlebot will see the URL in the link, attempt to crawl it, receive a robots.txt disallow, stop — and repeat this on every subsequent crawl. The URL never gets crawled, but it still consumes a crawl slot each time Google encounters the link. Remove internal links to robots.txt-blocked pages, or use a noindex tag instead of robots.txt for pages that should not be indexed but are still accessible to users.
3. Ignoring Soft 404s
A soft 404 is a page that returns a 200 HTTP response code but displays "not found" or similar empty content. Because the server says 200, Google crawls the page, renders it, attempts to index it, and then (if its algorithms identify the content as empty or near-duplicate) marks it as a soft 404 in Search Console. This wastes crawl and rendering budget. Fix soft 404s by returning a genuine 404 or 410 response code, or by populating the page with useful content.
4. Allowing Infinite Scroll Without Pagination Fallback
Infinite scroll implemented purely in JavaScript without a corresponding paginated HTML fallback means Googlebot cannot discover content below the initial viewport. The content exists but is invisible to the crawler. Implement pagination fallback or use a component-level load more pattern that Google supports.
5. Ignoring Crawl Budget After a Site Migration
Site migrations (URL changes, CMS migrations, domain moves) typically generate thousands of redirects and temporarily increase the number of URLs Google needs to process. Failing to monitor crawl stats in the weeks after a migration often leads to situations where Google is burning its entire daily budget following redirect chains from old URLs, leaving new canonical URLs partially uncrawled. Plan a post-migration crawl audit as part of every migration project.
6. Over-Relying on Sitemap Submissions
Submitting a sitemap does not guarantee crawling. On large sites, Google may crawl only a fraction of sitemap-listed URLs each day. If your key pages are buried in a 500,000-URL sitemap alongside thousands of low-value URLs, the signal-to-noise ratio is poor. Use multiple focused sitemaps segmented by content type, and ensure the most important pages are also well-linked internally.
7. Leaving Parameter-Based Duplicates Unaddressed
UTM parameters, sorting parameters, and A/B testing parameters that append to URLs and create duplicate content are extremely common. Left unaddressed, a site with 10,000 real pages can present Googlebot with 500,000+ parameter variants. Use canonical tags on parameterised pages and configure parameter handling in Google Search Console where available.
Crawl Budget, Indexing, and Ranking: The Full Chain
It is worth stepping back to see crawl budget in the context of the full search pipeline:
- Crawling: Googlebot fetches your URL and downloads the HTML (and later renders JavaScript)
- Processing: Google parses the page, extracts links, content, signals, and metadata
- Indexing: Google decides whether to add the page to its index, and if so, what signals to associate with it
- Ranking: When a user runs a query, Google retrieves candidate pages from the index and ranks them by relevance, quality, and other signals
Crawl budget optimisation affects Step 1. It cannot directly improve rankings, but it removes the bottleneck that prevents important pages from entering the pipeline at all. A page that is never crawled cannot rank, regardless of its quality. This is why crawl budget SEO is rightly considered a foundational technical discipline, particularly for large sites.
For a broader view of how crawl budget fits into technical SEO, see the SEO Glossary or run a full site audit to identify all technical issues on your site simultaneously.
Frequently Asked Questions About Crawl Budget
Does crawl budget affect rankings directly?
Not directly. Crawl budget affects whether your pages are crawled and indexed, which is a prerequisite for ranking. If your important pages are crawled and indexed, fixing crawl budget issues will not change their ranking position. But if key pages are being missed because budget is wasted on low-value URLs, resolving crawl budget issues will indirectly improve rankings by getting those pages into the index.
How do I find my site's crawl budget?
Google does not publish a specific crawl budget number for your site. The best proxy is the Crawl Stats report in Google Search Console (Settings > Crawl Stats), which shows crawl activity over 90 days. Divide the total crawl requests by the number of days to get an average daily crawl rate. Compare this to your total indexable URL count to understand whether your budget covers the full site.
Can I increase my crawl budget?
You cannot directly request more crawl budget from Google. The most effective ways to increase effective crawl budget are: improve server response times (raises the crawl rate limit), improve content quality and earn more links (raises crawl demand), and reduce the number of low-value URLs competing for the budget (redirects more of the existing budget to important pages).
Does using a CDN help crawl budget?
Yes, indirectly. A CDN reduces server response times and increases availability, which raises the crawl rate limit Google is willing to apply. It also reduces the risk of 5xx errors during traffic spikes. These improvements allow Googlebot to crawl more pages per session without harming server performance.
Should I disallow CSS and JavaScript in robots.txt?
No. This was a common practice years ago to reduce crawl load, but Google now needs to access CSS and JavaScript files to render pages accurately. Blocking them causes Googlebot to see an incomplete version of your page, which can lead to indexing errors and ranking penalties. Allow Googlebot to access all CSS and JavaScript files unless there is a specific reason not to.
How does crawl budget relate to robots.txt?
Your robots.txt file controls which URLs Googlebot is allowed to crawl. Disallowing low-value URLs saves crawl budget by preventing Googlebot from even attempting those fetches. However, disallowed URLs can still appear in the index if linked from external sites — they just cannot be crawled. For pages you want excluded from both crawling and indexing, use noindex (which requires the page to be crawlable to be processed) or both disallow and ensure no external links point to them.
What is the difference between crawl budget and index budget?
Crawl budget is the number of pages Google fetches from your server. Index budget is a less formally defined concept referring to Google's willingness to index pages from your site. A page can be crawled but not indexed (for example, if it is thin content, a duplicate, or explicitly marked noindex). Crawl budget is a prerequisite for indexing, but they are not the same thing. Optimise crawl budget to ensure pages are fetched; optimise content quality to ensure fetched pages are indexed.
Do 301 redirects waste crawl budget?
Single-hop 301 redirects have a small crawl budget cost (one request to the source URL, one to the destination) but over time Google updates its index to point directly to the destination and stops crawling the redirect source as frequently. The bigger crawl budget problem is redirect chains (multiple hops) and keeping internal links pointing to redirected URLs rather than updating them to point directly to the final destination. Use the redirect checker to audit all redirect chains on your site.
How does crawl budget work with e-commerce faceted navigation?
Faceted navigation (filter combinations like colour + size + brand on a category page) is the most common cause of URL bloat on e-commerce sites. A category with 5 colours, 5 sizes, and 5 brands can generate 125 filter combinations, each as a unique URL. Multiply this across hundreds of categories and you can have millions of URLs. The standard approach is to use canonical tags on all filtered URLs pointing back to the unfiltered category page, and to block filter URL patterns in robots.txt. For large stores, this is typically the highest-impact single crawl budget intervention available.
What HTTP response code is best for pages I want to remove from Google's index?
If a page is permanently gone, use a 410 (Gone) response code rather than a 404 (Not Found). Google treats 410 as a stronger signal that the content will not return, and tends to remove 410 pages from its index faster than 404 pages. If you want to redirect users and crawlers to related content, use a 301 redirect to the most relevant live page instead.
Does page speed affect crawl budget?
Yes, significantly. Googlebot has a time budget as well as a URL budget. A server that responds in 200ms can deliver five pages in the time a 1,000ms server delivers one. Improving page speed — particularly Time to First Byte (TTFB) for crawled HTML pages — directly increases the effective number of pages Google can fetch per crawl session. Core Web Vitals improvements that reduce rendering complexity also help the rendering queue process your pages faster after the initial fetch.
How often does Googlebot crawl a site?
Crawl frequency varies enormously by site and by individual page. A major news homepage may be crawled hundreds of times a day. A product page on a mid-size e-commerce site may be crawled once a week. An orphan page on a large site may not be crawled for months. Crawl frequency is driven by popularity, freshness signals, and the overall crawl budget allocation for the site. Monitor the Crawl Stats report over time to understand your site's crawl cadence.
Can structured data help with crawl budget?
Structured data does not directly affect crawl budget allocation, but it can improve how Google understands and values your pages, which may influence crawl prioritisation over time. More directly, correct structured data reduces the chance that pages are classified as thin or low-value, which helps maintain crawl frequency for those pages. Use the schema generator (link is not in the required link list but is in the navigation) to generate valid structured data markup.
Last updated: April 2026