What Is Robots.txt?
A robots.txt file is a plain-text file placed at the root of a website — for example, https://www.yoursite.com/robots.txt — that tells automated web crawlers which parts of the site they are and are not permitted to access. The file is part of the Robots Exclusion Protocol (REP), a voluntary standard that has been in use since 1994 and remains one of the most fundamental technical SEO instruments available to webmasters.
Every major search engine crawler — Googlebot, Bingbot, DuckDuckBot, Yandex, Baidu Spider, and hundreds of others — checks a site's robots.txt file before crawling any page. The file acts like a doorman for your entire website, greeting bots at the entrance and directing them toward or away from specific sections. Understanding what is robots.txt, what it can do, and critically, what it cannot do, is essential knowledge for anyone working in SEO.
The robots.txt file is deceptively simple in its syntax, yet it influences two deeply important technical SEO concerns: crawl efficiency and crawl budget. Large websites with hundreds of thousands of URLs often rely on carefully constructed robots.txt rules to ensure search engines spend their crawl budget on the URLs that matter, rather than wasting requests on internal search results, session-based parameters, or duplicate content. You can learn more about this concept in our guide on what is crawl budget.
noindex meta tag or the X-Robots-Tag HTTP header instead.The History of Robots.txt and the Robots Exclusion Protocol
The Robots Exclusion Protocol (also called the Robots Exclusion Standard) was proposed in June 1994 by Martijn Koster as an informal agreement between webmasters and the operators of early web crawlers. The original motivation was simple: automated crawlers were hitting web servers hard, and site owners needed a way to communicate which pages should be left alone.
The protocol remained an informal standard for over two decades. In 2019, Google, along with other search engine companies, submitted a proposal to the Internet Engineering Task Force (IETF) to formalise the standard. In September 2022, RFC 9309 was published, making the Robots Exclusion Protocol an official internet standard. This formalisation clarified previously ambiguous behaviour around directive matching, wildcard support, and precedence rules.
Despite its age, robots.txt remains the primary mechanism for crawl control. It predates sitemaps, the meta robots tag, and the X-Robots-Tag header — all of which serve related but distinct purposes, as we will discuss later in this guide.
How Robots.txt Works
When a search engine crawler visits your domain for the first time, or when its crawl schedule prompts a revisit, the very first request it makes is to /robots.txt. This happens before any individual page is crawled. The crawler downloads the file, parses the rules, and uses those rules to decide which URLs to queue for crawling and which to ignore.
The processing sequence works as follows:
- The crawler requests
https://www.yoursite.com/robots.txt. - If the server returns a 200 OK response, the crawler reads the file and follows the rules that apply to its specific user-agent.
- If the server returns a 404 Not Found, the crawler assumes there are no restrictions and proceeds to crawl freely.
- If the server returns a 5xx server error, the crawler treats this as a temporary block and will not crawl the site until the file becomes accessible again. This is an important reason to ensure your robots.txt never throws a 500 error.
- If the server returns a 301 or 302 redirect, Google will follow the redirect up to five hops. A redirect chain to robots.txt is bad practice — always serve it directly.
Google caches the robots.txt file and re-fetches it periodically (typically every 24 hours, though this varies). Changes to your robots.txt may not be reflected immediately in crawler behaviour. If you need Googlebot to re-check your file urgently, you can request a re-crawl in Google Search Console.
One critical point deserves repeating: robots.txt controls crawling, not indexing. If Googlebot has previously crawled a page and it is in the index, adding a Disallow rule for that URL will stop future crawls but will not remove it from the index. Google may still show the page in search results based on signals it has already gathered — including anchor text from external links. For full indexing control, pair crawl blocks with a noindex directive.
Robots.txt Syntax Rules
The robots.txt file must follow a specific syntax. Errors in formatting are common and can cause entire rule blocks to be silently ignored by crawlers. Here are the core rules:
- The file must be served as a plain text file with the MIME type
text/plain. - It must be located at the root of the domain:
/robots.txt. You cannot place it in a subdirectory. - Each directive is on its own line. A directive consists of a field name, a colon, a space, and a value.
- Lines beginning with
#are comments and are ignored by crawlers. Use them to annotate your file. - Directive names are case-insensitive (
disallowandDisalloware equivalent). Path values are case-sensitive on case-sensitive servers (most Linux-based web servers). - A blank line separates groups of rules. Each group starts with one or more
User-agentlines and is followed byAllowandDisallowdirectives. - There is no official maximum file size, but Google will only process the first 500 kibibytes (512,000 bytes) of a robots.txt file. Keep your file well under this limit.
- UTF-8 encoding is recommended. Non-ASCII characters in path values may behave unexpectedly across different crawlers.
Here is a minimal but valid robots.txt file:
# Allow all crawlers to access everything User-agent: * Allow: / Sitemap: https://www.example.com/sitemap.xml
And here is an example with comments and multiple groups:
# Block all bots from admin and staging areas User-agent: * Disallow: /admin/ Disallow: /staging/ Disallow: /search? Disallow: /checkout/ Allow: / # Block only Bingbot from a specific section User-agent: Bingbot Disallow: /members/ Sitemap: https://www.example.com/sitemap.xml
All Robots.txt Directives Explained
User-agent
The User-agent directive specifies which crawler or group of crawlers the following rules apply to. It is always the first directive in a rule group. A single group can target one specific bot or multiple bots (by repeating the User-agent line), and a wildcard (*) targets all crawlers that are not otherwise named.
Common user-agent names include:
| User-agent Name | Crawler | Search Engine |
|---|---|---|
Googlebot | Main web crawler | |
Googlebot-Image | Image crawler | |
Googlebot-Video | Video crawler | |
Bingbot | Main web crawler | Microsoft Bing |
DuckDuckBot | Main web crawler | DuckDuckGo |
Slurp | Main web crawler | Yahoo |
YandexBot | Main web crawler | Yandex |
Baiduspider | Main web crawler | Baidu |
* | All crawlers (wildcard) | Any |
When a crawler evaluates your robots.txt file, it first looks for a group that explicitly names its user-agent. If a specific group exists, the crawler uses those rules and ignores all other groups, including the wildcard group. The wildcard group only applies to crawlers that have no explicitly named group.
User-agent: * group for that crawler.Disallow
The Disallow directive instructs the crawler not to access any URL that begins with the specified path. It is the most commonly used directive in the robots.txt file.
Disallow: /admin/— blocks all URLs under/admin/, such as/admin/dashboardand/admin/users/edit.Disallow: /— blocks the entire website. Use with extreme caution — only appropriate for staging environments or sites under maintenance.Disallow:(empty value) — allows full access; equivalent to having no Disallow rule at all.Disallow: /page.html— blocks a single specific page.
Matching is done by prefix. Disallow: /news will block /news, /newsletter, /news/article-1, and /news-archive — anything that begins with /news. If you only want to block the /news/ directory, use the trailing slash: Disallow: /news/.
Allow
The Allow directive is used to override a broader Disallow rule for a specific path. It is most useful when you want to block an entire directory but make an exception for one or more paths within it.
User-agent: * Disallow: /private/ Allow: /private/press-releases/
In this example, everything under /private/ is blocked except /private/press-releases/, which is explicitly allowed. When a URL matches both an Allow and a Disallow rule, the more specific (longer) path takes precedence. If both paths are the same length, the Allow rule wins.
A common technique is to use Allow: /$ (or just Allow: /) as a baseline to explicitly permit crawling of the homepage even if other broad rules exist.
Sitemap
The Sitemap directive tells crawlers where to find your XML sitemap. It is not strictly a rule — it is informational — but it is one of the most valuable lines you can include in a robots.txt file. You can specify multiple sitemap lines if you use a sitemap index or have separate sitemaps for different content types.
Sitemap: https://www.example.com/sitemap.xml Sitemap: https://www.example.com/sitemap-images.xml Sitemap: https://www.example.com/sitemap-news.xml
The Sitemap directive is supported by Google, Bing, Yahoo, and Ask. It must contain a fully qualified absolute URL (including the protocol). Placing your sitemap URL in robots.txt supplements — but does not replace — submitting it in Google Search Console and Bing Webmaster Tools. Learn more about how sitemaps work in our guide on what is a sitemap.
Crawl-delay
The Crawl-delay directive requests that the crawler wait a specified number of seconds between consecutive requests to the server. This is intended to prevent aggressive crawlers from overloading web servers.
User-agent: * Crawl-delay: 10
There are important limitations to be aware of. Google does not honour the Crawl-delay directive. To manage Googlebot's crawl rate, use the crawl rate settings in Google Search Console. Bing, Yandex, and some other crawlers do respect Crawl-delay. For most websites, Crawl-delay is not necessary — only add it if your server is genuinely struggling under crawler load.
Host
The Host directive was introduced by Yandex to specify the preferred domain of a website when multiple domains resolve to the same content (for example, both www.example.com and example.com). It is a Yandex-specific directive and is not recognised by Google or Bing, which rely on canonical tags and Search Console preferred domain settings instead.
Host: www.example.com
If your audience includes significant traffic from Russia or other Yandex-dominated markets, adding a Host directive is worthwhile. For all other websites, you can omit it safely.
Wildcard Patterns in Robots.txt
Google and Bing both support two wildcard characters in robots.txt path values. Not all crawlers support wildcards, but for SEO purposes the two that matter most do.
Asterisk Wildcard (*)
The asterisk (*) matches any sequence of zero or more characters. It can appear anywhere in the path value.
| Pattern | What It Blocks |
|---|---|
Disallow: /*? | All URLs containing a query string parameter (the ? character) |
Disallow: /search/* | All URLs beginning with /search/ followed by anything |
Disallow: /*.pdf$ | All PDF files (using both wildcards) |
Disallow: /category/*/page/ | Paginated category pages across all category names |
End-of-String Anchor ($)
The dollar sign ($) at the end of a path value anchors the match to the end of the URL. Without it, path matching is prefix-based. With it, the rule only applies if the URL ends exactly with the specified string.
# Block only URLs ending in .pdf (not /pdf/ as a directory) Disallow: /*.pdf$ # Block the /search page itself but not /search/ subdirectories Disallow: /search$
Combining both wildcards gives you granular control over URL pattern matching without having to list every individual URL. This is particularly useful for blocking faceted navigation, session IDs, printer-friendly pages, and other dynamically generated URL patterns that can cause crawl budget waste.
Common Robots.txt Examples
Allow Everything (Default Open Site)
User-agent: * Allow: / Sitemap: https://www.example.com/sitemap.xml
Standard E-commerce Site
User-agent: * Allow: / # Block internal search, filters, and cart Disallow: /search Disallow: /checkout/ Disallow: /cart/ Disallow: /account/ Disallow: /*?sort= Disallow: /*?filter= Disallow: /*?page= # Block admin and system files Disallow: /admin/ Disallow: /wp-admin/ Sitemap: https://www.example.com/sitemap.xml Sitemap: https://www.example.com/sitemap-products.xml
Block a Specific Bot Entirely
# Block a scraping bot User-agent: BadBot Disallow: / # Allow all other bots User-agent: * Allow: / Sitemap: https://www.example.com/sitemap.xml
Staging or Development Site (Block All)
# Staging environment - do not index User-agent: * Disallow: /
News Site with Images
User-agent: * Allow: / Disallow: /archive/ Disallow: /tag/ Disallow: /author/ # Allow Google Image bot to access everything User-agent: Googlebot-Image Allow: / Sitemap: https://www.example.com/news-sitemap.xml Sitemap: https://www.example.com/sitemap.xml
Robots.txt for WordPress
WordPress has special considerations when it comes to the robots.txt file for SEO. WordPress does not create a physical robots.txt file by default — instead it generates a virtual one dynamically. This virtual file has some sensible defaults, but many SEO plugins extend or override it.
The default WordPress-generated robots.txt looks something like this:
User-agent: * Disallow: /wp-admin/ Allow: /wp-admin/admin-ajax.php Sitemap: https://www.example.com/wp-sitemap.xml
The Allow: /wp-admin/admin-ajax.php line is important — this endpoint powers many front-end AJAX requests and blocking it can cause rendering issues. WordPress plugins such as Yoast SEO, Rank Math, and All in One SEO all provide interfaces to edit your robots.txt without needing to create the physical file manually. Key WordPress-specific paths you often want to block include:
/wp-admin/— the admin panel (block this, but keep admin-ajax.php allowed)/wp-login.php— the login page (blocking it from crawlers is fine; it cannot be indexed anyway)/?s=or/search/— WordPress search result pages, which are typically low-quality duplicate content/feed/— RSS feeds, though some sites choose to leave these crawlable/?author=— author archive pages if you do not want them indexed
If you want to create a physical robots.txt file in WordPress (which takes precedence over the dynamic one), simply upload a robots.txt file to the root of your server — the same directory that contains wp-config.php. Use our robots.txt generator to create a correctly formatted file.
Robots.txt for Shopify
Shopify has its own approach to robots.txt that differs significantly from WordPress. Until 2021, Shopify merchants had no way to customise their robots.txt file — it was entirely controlled by the platform. Shopify now allows merchants to edit robots.txt through a robots.txt.liquid template in their theme.
The default Shopify robots.txt blocks a sensible set of paths out of the box, including:
/admin— the Shopify admin panel/cart— the shopping cart page/checkout— the checkout flow/orders— customer order pages/cgi-bin— legacy system paths*login*,*password*— login and password-protected pages
To customise your Shopify robots.txt, navigate to Online Store > Themes > Edit Code, and add or edit the robots.txt.liquid template. Common Shopify-specific customisations include blocking faceted navigation parameters (e.g. ?sort_by=, ?filter.p.m.*) and collection+tag combinations that generate large numbers of duplicate or thin pages.
/sitemap.xml. You do not need to add it manually unless you have additional sitemaps.Robots.txt vs Meta Robots vs X-Robots-Tag
One of the most common points of confusion in technical SEO is understanding the difference between the three main mechanisms for controlling how search engines crawl and index your site. Each operates at a different level and controls different things.
| Mechanism | Location | Controls Crawling | Controls Indexing | Page-Level |
|---|---|---|---|---|
| robots.txt | Root of domain | Yes | No (indirectly) | No (path-based) |
| Meta robots tag | HTML <head> | No | Yes | Yes |
| X-Robots-Tag | HTTP header | No | Yes | Yes (incl. non-HTML) |
Meta Robots Tag
The meta robots tag is an HTML element placed in the <head> section of a page. Common values include noindex (do not add this page to the search index), nofollow (do not follow links on this page), noarchive (do not show a cached version), and nosnippet (do not show a text snippet in search results). Because the tag lives on the page itself, a crawler must be able to access and render the page to read it — which is why blocking a page in robots.txt while relying on a meta noindex tag is a self-defeating strategy. You can check your pages' robots directives with our robots directives checker.
X-Robots-Tag
The X-Robots-Tag is an HTTP response header that serves the same purpose as the meta robots tag but is set at the server level rather than in the HTML. Its key advantage is that it works for non-HTML files — PDFs, images, and other file types that cannot contain an HTML <head> element. It supports the same directives as the meta robots tag.
When to Use Which
- Use robots.txt to stop crawlers from accessing pages you do not want crawled at all — admin panels, internal search results, duplicate parameter URLs, and staging areas.
- Use the meta robots noindex tag on pages you want crawlers to access but not include in search results — thank-you pages, login pages, paginated pages beyond page one, and low-value tag archives.
- Use X-Robots-Tag when you need indexing control on non-HTML resources, or when you want to apply directives site-wide without modifying individual page templates.
- Never rely on robots.txt alone to keep a page out of the index. Block + noindex = fully removed from Google. Block alone = may still appear in search results with limited information.
Understanding the relationship between crawling and indexing is covered in depth in our guide on what is indexing.
How to Test Your Robots.txt File
Before deploying any changes to your robots.txt file, testing is essential. A single misplaced Disallow can silently block entire sections of your site from being crawled, with ranking consequences that may not surface for weeks.
Google Search Console Robots.txt Tester
Google Search Console provides a built-in robots.txt tester under Settings > Crawling > robots.txt. It displays the current file Google has fetched and cached, allows you to enter any URL to test whether it is blocked or allowed, and shows you any syntax errors. Note that this tool was briefly removed in 2022 but was restored after widespread user feedback.
Manual Testing
You can view your live robots.txt file at any time by navigating to yourdomain.com/robots.txt in a browser. Check that the file loads correctly, is served as plain text, and contains all the rules you expect.
Third-Party Tools
Our robots.txt generator allows you to build and preview a robots.txt file before deploying it. The robots directives checker lets you verify that specific pages are returning the intended robots meta tags and HTTP headers. The site audit tool will flag if your robots.txt is blocking pages that appear in your sitemap — a common and significant error.
Google's URL Inspection Tool
In Google Search Console, the URL Inspection tool shows you whether a specific URL is blocked by robots.txt, alongside its indexing status, crawl history, and any canonical issues. This is the most direct way to confirm that a specific page is accessible to Googlebot.
Robots.txt SEO: Impact on Search Rankings
Robots.txt has a significant but indirect impact on robots.txt SEO performance. Its primary value is in crawl efficiency: by preventing crawlers from wasting requests on low-value URLs, you direct more crawl activity toward your most important pages. For large sites, this has a measurable effect on how quickly new and updated content gets discovered and ranked.
On smaller sites (fewer than a few thousand pages), crawl budget is rarely a limiting factor. Google's crawlers are efficient, and for typical small business or blog websites, robots.txt is more of a hygiene measure than a performance lever. Still, even small sites benefit from keeping their robots.txt clean and free of errors.
Key SEO benefits of a well-maintained robots.txt file include:
- Preventing duplicate content — blocking parameter-based URLs, printer-friendly versions, and session ID variations prevents Googlebot from seeing multiple versions of the same content.
- Preserving crawl budget — directing crawlers away from pagination, faceted navigation, internal search, and admin areas preserves crawl budget for your valuable pages. Read more in our crawl budget guide.
- Improving crawl freshness — when crawlers are not wasting time on junk URLs, they revisit your important pages more frequently, picking up updates faster.
- Providing sitemap discovery — the Sitemap directive ensures crawlers can always find your XML sitemap, supporting broader URL discovery.
Common Robots.txt Mistakes
Even experienced webmasters make errors with robots.txt. The following mistakes are among the most frequently seen in SEO audits.
1. Blocking CSS and JavaScript Files
This is one of the most damaging mistakes you can make. Google renders web pages much like a browser does, using Googlebot to process HTML, CSS, and JavaScript to understand page content and layout. If your robots.txt blocks access to your stylesheets (/wp-content/themes/) or JavaScript files (/assets/js/), Googlebot cannot render your pages correctly. The result is that Google may misunderstand your content, fail to discover links, and potentially rank your pages lower because it cannot see what users see. Always check a Google Search Console URL inspection for "render blocking" warnings, and ensure your CSS and JS directories are never Disallowed.
2. Blocking the Entire Site with Disallow: /
Adding Disallow: / under User-agent: * blocks every crawler from every page on your entire site. This is appropriate only for staging environments and sites under maintenance that have not yet launched. It is an alarmingly common mistake on live production sites — often left in accidentally after a site migrates from staging to live. Run a site audit to check your robots.txt as part of your regular SEO maintenance.
3. Using Robots.txt Instead of Noindex for Sensitive Pages
Blocking a page in robots.txt does not remove it from Google's index. If Google has already crawled and indexed a page, adding a Disallow rule stops future crawls but the page may remain in search results indefinitely. Furthermore, Google can index a page it has never crawled if external links point to it — it just won't know the page's content. For pages you want removed from search results, you must use a noindex meta robots tag or X-Robots-Tag header, and the page must be crawlable so Google can read that instruction.
4. Case Sensitivity Errors
On Linux web servers (the majority of production web servers), file paths are case-sensitive. Disallow: /Admin/ will not block /admin/ if your URLs are lowercase. Always match the exact case of your URL paths.
5. Missing Trailing Slashes
Disallow: /news blocks /news, /newsletter, /news-archive, and anything else starting with /news. If you only intend to block the /news/ directory, use Disallow: /news/ with a trailing slash.
6. Conflicting Allow and Disallow Rules
When multiple rules match the same URL, Google applies the most specific rule (the longest matching path). If two rules are the same length, Allow takes precedence over Disallow. Unintentional conflicts can either accidentally block pages you want crawled or allow pages you intended to block. Always test specific URLs after making rule changes.
7. No Sitemap Directive
Omitting the Sitemap directive is a missed opportunity. Including your sitemap URL in robots.txt ensures every crawler that reads the file — not just Googlebot — can find your sitemap. This is particularly valuable for Bing and Yandex, where you may not have submitted your sitemap via their respective webmaster tools.
8. Forgetting Subdomains
Each subdomain requires its own robots.txt file. A robots.txt at www.example.com/robots.txt does not apply to blog.example.com or shop.example.com. If you run content on subdomains, ensure each has its own appropriately configured file.
9. Robots.txt on HTTPS and HTTP Separately
If your site is accessible on both http:// and https://, search engines treat these as separate origins, each with their own robots.txt. Typically this is not a problem if you have correct 301 redirects from HTTP to HTTPS — the crawler follows the redirect and reads the HTTPS robots.txt. But if your HTTP version is somehow still accessible without a redirect, ensure both origins are configured correctly.
Robots.txt and Security: Common Misconceptions
Robots.txt is not a security mechanism. It is a public file that any person or program can read — in fact, some malicious actors specifically read robots.txt to discover paths that the site owner considers sensitive. Using robots.txt to "hide" admin panels, API endpoints, or private content provides no real protection.
If a URL is listed in robots.txt under a Disallow rule, a human can still navigate directly to that URL. A malicious bot that ignores the Robots Exclusion Protocol will also access it freely. Legitimate search engine crawlers follow robots.txt, but not all bots do.
For genuinely sensitive areas of your website, the appropriate security measures are:
- Password protection or authentication (HTTP Basic Auth, login walls)
- IP allowlists for admin panels
- Server-level access controls (firewall rules, .htaccess deny directives)
- Private networks or VPNs for internal tools
That said, blocking your admin panel in robots.txt is still good practice — it prevents search engines from wasting crawl budget on pages they can never index, and avoids any chance of admin URLs appearing in search results (which could expose their existence even if the pages themselves are locked). It is a belt-and-braces measure, not a standalone security solution.
Robots.txt and Crawl Budget
Crawl budget is the number of URLs Googlebot will crawl on your site within a given time period. For most small and medium-sized websites, crawl budget is not a constraint — Google crawls everything without issue. For large websites — e-commerce sites with millions of product and category URLs, news sites with deep archives, or platforms with user-generated content — crawl budget management becomes critical.
Effective use of robots.txt is one of the primary tools for crawl budget optimisation. By blocking URLs that have no indexing value — thin pages, faceted navigation combinations, session IDs, internal search results — you direct Googlebot's attention toward pages that can actually rank. This improves the speed at which new and updated pages are discovered and re-crawled.
Beyond robots.txt, crawl budget is also influenced by:
- Site speed — faster servers get crawled more aggressively
- Crawl demand — popular pages are re-crawled more frequently
- Internal link structure — deeply buried pages are crawled less often
- Sitemap accuracy — keeping your sitemap clean of dead URLs helps
Our full guide on what is crawl budget covers this topic in detail, including how to analyse your crawl log data to identify budget waste.
Checking Your Robots.txt with RankNibbler
RankNibbler provides several tools that interact with robots.txt as part of a broader technical SEO workflow:
- The robots.txt generator lets you build a syntactically correct robots.txt file by choosing directives from a visual interface, then copies or downloads the result ready for deployment.
- The robots directives checker fetches a URL and reports the robots meta tag, X-Robots-Tag header, and canonical tag — alongside whether the page is blocked by robots.txt.
- The site audit tool crawls your site and flags configuration issues including robots.txt blocking pages listed in your sitemap, missing sitemap directives, and pages that should be blocked but aren't.
For a comprehensive check of your entire SEO setup, run a full audit from the RankNibbler homepage. The audit checks title tags, meta descriptions, heading structure, image alt text, internal links, structured data, and more — across 30+ SEO checks in a single pass.
Robots.txt Frequently Asked Questions
Does robots.txt affect Google rankings?
Indirectly, yes. Robots.txt does not directly influence ranking signals, but by controlling which pages are crawled, it affects how efficiently Google can discover and process your important content. Blocking irrelevant or low-quality URLs improves crawl efficiency and can lead to faster indexing of new content. Conversely, accidentally blocking important pages in robots.txt will prevent them from being crawled and can cause ranking drops.
Can Google index a page that is blocked in robots.txt?
Yes. Google may index a URL without crawling it if external links point to that URL. In this case, Google may show the page in search results with a message like "A description of this page is not available" because it could not crawl the content. To fully remove a page from Google's index, you need to remove the Disallow rule and add a noindex meta robots tag, allow Googlebot to crawl it and read the noindex instruction, then wait for Google to re-process the page.
What is the difference between robots.txt and a sitemap?
A robots.txt file tells crawlers what they should not access. A sitemap tells crawlers what you want them to index. They serve complementary roles: robots.txt restricts crawling of unwanted URLs, while a sitemap actively promotes crawling of your most important URLs. You can learn more in our guide on what is a sitemap.
Does robots.txt work for all bots?
Only bots that choose to follow the Robots Exclusion Protocol will honour your robots.txt rules. All major search engines (Google, Bing, Yandex, Baidu, DuckDuckGo) follow the protocol. Many legitimate crawlers — SEO tools, academic research bots, web archiving services — also follow it. However, malicious scrapers and spam bots may deliberately ignore robots.txt. For protecting sensitive content, use server-level access controls rather than robots.txt.
How often does Google re-read my robots.txt file?
Google typically caches and re-fetches robots.txt approximately every 24 hours. This means changes you make may not be reflected in Googlebot's behaviour immediately. If you need Googlebot to re-read your robots.txt urgently (for example, if you have accidentally blocked your site), you can request a re-fetch in Google Search Console under Settings > Crawling.
Can I have more than one robots.txt file?
No. Each domain or subdomain can have only one robots.txt file, located at the root. If you have multiple subdomains, each requires its own robots.txt. You cannot have a robots.txt in a subdirectory — it will be ignored by search engines.
Should I block my /wp-admin/ in robots.txt?
Yes, but with care. The standard WordPress robots.txt blocks /wp-admin/ while specifically allowing /wp-admin/admin-ajax.php. The admin-ajax.php exception is important because many plugins and themes use this endpoint for front-end functionality, and blocking it can cause visual rendering issues that affect how Google sees your pages.
What happens if my robots.txt file has a syntax error?
Behaviour varies by crawler. Google's Googlebot is relatively lenient and will attempt to parse rules even with minor syntax errors. However, malformed rules may be silently ignored, meaning your intended blocks or permissions may not take effect. Bing and some other crawlers are stricter. Always validate your robots.txt syntax before deploying, using our robots.txt generator or Google Search Console's robots.txt viewer.
Is robots.txt required for SEO?
No, a robots.txt file is not strictly required. If no robots.txt exists, search engines assume everything is accessible. However, having a correctly configured robots.txt is considered best practice. At minimum, every site should have a file that includes the Sitemap directive, even if all the content rules are set to Allow. This ensures search engines can always find your sitemap.
Can I use robots.txt to block AI training bots?
Yes, with caveats. Several AI companies have published user-agent names for their training crawlers: GPTBot (OpenAI), Google-Extended (Google), CCBot (Common Crawl), and anthropic-ai (Anthropic), among others. You can add Disallow rules targeting these specific user-agents. However, compliance is voluntary — not all AI training crawlers declare their identity or follow robots.txt rules. This is an evolving area with no universal enforcement mechanism.
# Block AI training crawlers User-agent: GPTBot Disallow: / User-agent: Google-Extended Disallow: / User-agent: CCBot Disallow: / User-agent: anthropic-ai Disallow: / # Allow search engine crawlers User-agent: * Allow: / Sitemap: https://www.example.com/sitemap.xml
What should I always include in robots.txt?
At a minimum: a User-agent: * group that either allows or disallows access as appropriate, and a Sitemap: directive pointing to your XML sitemap. Beyond that, block admin paths, checkout flows, internal search results, and any URL patterns that generate significant numbers of low-value or duplicate pages. Use our SEO glossary for definitions of technical terms you encounter during this process.
How do I check if robots.txt is blocking a specific URL?
The most reliable method is Google Search Console's URL Inspection tool, which shows you explicitly whether a URL is blocked by robots.txt. You can also use the robots directives checker to see a URL's robots meta tag and indexing signals, or manually read your robots.txt and trace which rules would apply to the URL in question.
The RankNibbler site audit checks your robots.txt for sitemap references and will flag pages that are both listed in your sitemap and blocked by robots.txt — one of the most counterproductive configurations in technical SEO. For a complete overview of your site's health, visit the RankNibbler homepage.
Last updated: April 2026