What Is Robots.txt?

A robots.txt file is a plain-text file placed at the root of a website — for example, https://www.yoursite.com/robots.txt — that tells automated web crawlers which parts of the site they are and are not permitted to access. The file is part of the Robots Exclusion Protocol (REP), a voluntary standard that has been in use since 1994 and remains one of the most fundamental technical SEO instruments available to webmasters.

Every major search engine crawler — Googlebot, Bingbot, DuckDuckBot, Yandex, Baidu Spider, and hundreds of others — checks a site's robots.txt file before crawling any page. The file acts like a doorman for your entire website, greeting bots at the entrance and directing them toward or away from specific sections. Understanding what is robots.txt, what it can do, and critically, what it cannot do, is essential knowledge for anyone working in SEO.

The robots.txt file is deceptively simple in its syntax, yet it influences two deeply important technical SEO concerns: crawl efficiency and crawl budget. Large websites with hundreds of thousands of URLs often rely on carefully constructed robots.txt rules to ensure search engines spend their crawl budget on the URLs that matter, rather than wasting requests on internal search results, session-based parameters, or duplicate content. You can learn more about this concept in our guide on what is crawl budget.

Quick summary: A robots.txt file controls crawling. It does not control indexing. A page blocked in robots.txt can still be indexed by Google if external links point to it. For indexing control, use a noindex meta tag or the X-Robots-Tag HTTP header instead.

The History of Robots.txt and the Robots Exclusion Protocol

The Robots Exclusion Protocol (also called the Robots Exclusion Standard) was proposed in June 1994 by Martijn Koster as an informal agreement between webmasters and the operators of early web crawlers. The original motivation was simple: automated crawlers were hitting web servers hard, and site owners needed a way to communicate which pages should be left alone.

The protocol remained an informal standard for over two decades. In 2019, Google, along with other search engine companies, submitted a proposal to the Internet Engineering Task Force (IETF) to formalise the standard. In September 2022, RFC 9309 was published, making the Robots Exclusion Protocol an official internet standard. This formalisation clarified previously ambiguous behaviour around directive matching, wildcard support, and precedence rules.

Despite its age, robots.txt remains the primary mechanism for crawl control. It predates sitemaps, the meta robots tag, and the X-Robots-Tag header — all of which serve related but distinct purposes, as we will discuss later in this guide.

How Robots.txt Works

When a search engine crawler visits your domain for the first time, or when its crawl schedule prompts a revisit, the very first request it makes is to /robots.txt. This happens before any individual page is crawled. The crawler downloads the file, parses the rules, and uses those rules to decide which URLs to queue for crawling and which to ignore.

The processing sequence works as follows:

The crawler requests https://www.yoursite.com/robots.txt.
If the server returns a 200 OK response, the crawler reads the file and follows the rules that apply to its specific user-agent.
If the server returns a 404 Not Found, the crawler assumes there are no restrictions and proceeds to crawl freely.
If the server returns a 5xx server error, the crawler treats this as a temporary block and will not crawl the site until the file becomes accessible again. This is an important reason to ensure your robots.txt never throws a 500 error.
If the server returns a 301 or 302 redirect, Google will follow the redirect up to five hops. A redirect chain to robots.txt is bad practice — always serve it directly.

Google caches the robots.txt file and re-fetches it periodically (typically every 24 hours, though this varies). Changes to your robots.txt may not be reflected immediately in crawler behaviour. If you need Googlebot to re-check your file urgently, you can request a re-crawl in Google Search Console.

One critical point deserves repeating: robots.txt controls crawling, not indexing. If Googlebot has previously crawled a page and it is in the index, adding a Disallow rule for that URL will stop future crawls but will not remove it from the index. Google may still show the page in search results based on signals it has already gathered — including anchor text from external links. For full indexing control, pair crawl blocks with a noindex directive.

Robots.txt Syntax Rules

The robots.txt file must follow a specific syntax. Errors in formatting are common and can cause entire rule blocks to be silently ignored by crawlers. Here are the core rules:

The file must be served as a plain text file with the MIME type text/plain.
It must be located at the root of the domain: /robots.txt. You cannot place it in a subdirectory.
Each directive is on its own line. A directive consists of a field name, a colon, a space, and a value.
Lines beginning with # are comments and are ignored by crawlers. Use them to annotate your file.
Directive names are case-insensitive (disallow and Disallow are equivalent). Path values are case-sensitive on case-sensitive servers (most Linux-based web servers).
A blank line separates groups of rules. Each group starts with one or more User-agent lines and is followed by Allow and Disallow directives.
There is no official maximum file size, but Google will only process the first 500 kibibytes (512,000 bytes) of a robots.txt file. Keep your file well under this limit.
UTF-8 encoding is recommended. Non-ASCII characters in path values may behave unexpectedly across different crawlers.

Here is a minimal but valid robots.txt file:

# Allow all crawlers to access everything
User-agent: *
Allow: /

Sitemap: https://www.example.com/sitemap.xml

And here is an example with comments and multiple groups:

# Block all bots from admin and staging areas
User-agent: *
Disallow: /admin/
Disallow: /staging/
Disallow: /search?
Disallow: /checkout/
Allow: /

# Block only Bingbot from a specific section
User-agent: Bingbot
Disallow: /members/

Sitemap: https://www.example.com/sitemap.xml

All Robots.txt Directives Explained

User-agent

The User-agent directive specifies which crawler or group of crawlers the following rules apply to. It is always the first directive in a rule group. A single group can target one specific bot or multiple bots (by repeating the User-agent line), and a wildcard (*) targets all crawlers that are not otherwise named.

Common user-agent names include:

User-agent Name	Crawler	Search Engine
`Googlebot`	Main web crawler	Google
`Googlebot-Image`	Image crawler	Google
`Googlebot-Video`	Video crawler	Google
`Bingbot`	Main web crawler	Microsoft Bing
`DuckDuckBot`	Main web crawler	DuckDuckGo
`Slurp`	Main web crawler	Yahoo
`YandexBot`	Main web crawler	Yandex
`Baiduspider`	Main web crawler	Baidu
`*`	All crawlers (wildcard)	Any

When a crawler evaluates your robots.txt file, it first looks for a group that explicitly names its user-agent. If a specific group exists, the crawler uses those rules and ignores all other groups, including the wildcard group. The wildcard group only applies to crawlers that have no explicitly named group.

Important: A crawler that finds a specific group for itself will not also read the wildcard group. Specific user-agent rules always take precedence over the User-agent: * group for that crawler.

Disallow

The Disallow directive instructs the crawler not to access any URL that begins with the specified path. It is the most commonly used directive in the robots.txt file.

Disallow: /admin/ — blocks all URLs under /admin/, such as /admin/dashboard and /admin/users/edit.
Disallow: / — blocks the entire website. Use with extreme caution — only appropriate for staging environments or sites under maintenance.
Disallow: (empty value) — allows full access; equivalent to having no Disallow rule at all.
Disallow: /page.html — blocks a single specific page.

Matching is done by prefix. Disallow: /news will block /news, /newsletter, /news/article-1, and /news-archive — anything that begins with /news. If you only want to block the /news/ directory, use the trailing slash: Disallow: /news/.

Allow

The Allow directive is used to override a broader Disallow rule for a specific path. It is most useful when you want to block an entire directory but make an exception for one or more paths within it.

User-agent: *
Disallow: /private/
Allow: /private/press-releases/

In this example, everything under /private/ is blocked except /private/press-releases/, which is explicitly allowed. When a URL matches both an Allow and a Disallow rule, the more specific (longer) path takes precedence. If both paths are the same length, the Allow rule wins.

A common technique is to use Allow: /$ (or just Allow: /) as a baseline to explicitly permit crawling of the homepage even if other broad rules exist.

Sitemap

The Sitemap directive tells crawlers where to find your XML sitemap. It is not strictly a rule — it is informational — but it is one of the most valuable lines you can include in a robots.txt file. You can specify multiple sitemap lines if you use a sitemap index or have separate sitemaps for different content types.

Sitemap: https://www.example.com/sitemap.xml
Sitemap: https://www.example.com/sitemap-images.xml
Sitemap: https://www.example.com/sitemap-news.xml

The Sitemap directive is supported by Google, Bing, Yahoo, and Ask. It must contain a fully qualified absolute URL (including the protocol). Placing your sitemap URL in robots.txt supplements — but does not replace — submitting it in Google Search Console and Bing Webmaster Tools. Learn more about how sitemaps work in our guide on what is a sitemap.

Crawl-delay

The Crawl-delay directive requests that the crawler wait a specified number of seconds between consecutive requests to the server. This is intended to prevent aggressive crawlers from overloading web servers.

User-agent: *
Crawl-delay: 10

There are important limitations to be aware of. Google does not honour the Crawl-delay directive. To manage Googlebot's crawl rate, use the crawl rate settings in Google Search Console. Bing, Yandex, and some other crawlers do respect Crawl-delay. For most websites, Crawl-delay is not necessary — only add it if your server is genuinely struggling under crawler load.

Host

The Host directive was introduced by Yandex to specify the preferred domain of a website when multiple domains resolve to the same content (for example, both www.example.com and example.com). It is a Yandex-specific directive and is not recognised by Google or Bing, which rely on canonical tags and Search Console preferred domain settings instead.

Host: www.example.com

If your audience includes significant traffic from Russia or other Yandex-dominated markets, adding a Host directive is worthwhile. For all other websites, you can omit it safely.

Wildcard Patterns in Robots.txt

Google and Bing both support two wildcard characters in robots.txt path values. Not all crawlers support wildcards, but for SEO purposes the two that matter most do.

Asterisk Wildcard (*)

The asterisk (*) matches any sequence of zero or more characters. It can appear anywhere in the path value.

Pattern	What It Blocks
`Disallow: /*?`	All URLs containing a query string parameter (the `?` character)
`Disallow: /search/*`	All URLs beginning with `/search/` followed by anything
`Disallow: /*.pdf$`	All PDF files (using both wildcards)
`Disallow: /category/*/page/`	Paginated category pages across all category names

End-of-String Anchor ($)

The dollar sign ($) at the end of a path value anchors the match to the end of the URL. Without it, path matching is prefix-based. With it, the rule only applies if the URL ends exactly with the specified string.

# Block only URLs ending in .pdf (not /pdf/ as a directory)
Disallow: /*.pdf$

# Block the /search page itself but not /search/ subdirectories
Disallow: /search$

Combining both wildcards gives you granular control over URL pattern matching without having to list every individual URL. This is particularly useful for blocking faceted navigation, session IDs, printer-friendly pages, and other dynamically generated URL patterns that can cause crawl budget waste.

Common Robots.txt Examples

Allow Everything (Default Open Site)

User-agent: *
Allow: /

Sitemap: https://www.example.com/sitemap.xml

Standard E-commerce Site

User-agent: *
Allow: /

# Block internal search, filters, and cart
Disallow: /search
Disallow: /checkout/
Disallow: /cart/
Disallow: /account/
Disallow: /*?sort=
Disallow: /*?filter=
Disallow: /*?page=

# Block admin and system files
Disallow: /admin/
Disallow: /wp-admin/

Sitemap: https://www.example.com/sitemap.xml
Sitemap: https://www.example.com/sitemap-products.xml

Block a Specific Bot Entirely

# Block a scraping bot
User-agent: BadBot
Disallow: /

# Allow all other bots
User-agent: *
Allow: /

Sitemap: https://www.example.com/sitemap.xml

Staging or Development Site (Block All)

# Staging environment - do not index
User-agent: *
Disallow: /

News Site with Images

User-agent: *
Allow: /
Disallow: /archive/
Disallow: /tag/
Disallow: /author/

# Allow Google Image bot to access everything
User-agent: Googlebot-Image
Allow: /

Sitemap: https://www.example.com/news-sitemap.xml
Sitemap: https://www.example.com/sitemap.xml

Robots.txt for WordPress

WordPress has special considerations when it comes to the robots.txt file for SEO. WordPress does not create a physical robots.txt file by default — instead it generates a virtual one dynamically. This virtual file has some sensible defaults, but many SEO plugins extend or override it.

The default WordPress-generated robots.txt looks something like this:

User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php

Sitemap: https://www.example.com/wp-sitemap.xml

The Allow: /wp-admin/admin-ajax.php line is important — this endpoint powers many front-end AJAX requests and blocking it can cause rendering issues. WordPress plugins such as Yoast SEO, Rank Math, and All in One SEO all provide interfaces to edit your robots.txt without needing to create the physical file manually. Key WordPress-specific paths you often want to block include:

/wp-admin/ — the admin panel (block this, but keep admin-ajax.php allowed)
/wp-login.php — the login page (blocking it from crawlers is fine; it cannot be indexed anyway)
/?s= or /search/ — WordPress search result pages, which are typically low-quality duplicate content
/feed/ — RSS feeds, though some sites choose to leave these crawlable
/?author= — author archive pages if you do not want them indexed

If you want to create a physical robots.txt file in WordPress (which takes precedence over the dynamic one), simply upload a robots.txt file to the root of your server — the same directory that contains wp-config.php. Use our robots.txt generator to create a correctly formatted file.

Robots.txt for Shopify

Shopify has its own approach to robots.txt that differs significantly from WordPress. Until 2021, Shopify merchants had no way to customise their robots.txt file — it was entirely controlled by the platform. Shopify now allows merchants to edit robots.txt through a robots.txt.liquid template in their theme.

The default Shopify robots.txt blocks a sensible set of paths out of the box, including:

/admin — the Shopify admin panel
/cart — the shopping cart page
/checkout — the checkout flow
/orders — customer order pages
/cgi-bin — legacy system paths
*login*, *password* — login and password-protected pages

To customise your Shopify robots.txt, navigate to Online Store > Themes > Edit Code, and add or edit the robots.txt.liquid template. Common Shopify-specific customisations include blocking faceted navigation parameters (e.g. ?sort_by=, ?filter.p.m.*) and collection+tag combinations that generate large numbers of duplicate or thin pages.

Shopify note: Shopify automatically adds your sitemap to robots.txt at /sitemap.xml. You do not need to add it manually unless you have additional sitemaps.

Robots.txt vs Meta Robots vs X-Robots-Tag

One of the most common points of confusion in technical SEO is understanding the difference between the three main mechanisms for controlling how search engines crawl and index your site. Each operates at a different level and controls different things.

Mechanism	Location	Controls Crawling	Controls Indexing	Page-Level
robots.txt	Root of domain	Yes	No (indirectly)	No (path-based)
Meta robots tag	HTML <head>	No	Yes	Yes
X-Robots-Tag	HTTP header	No	Yes	Yes (incl. non-HTML)

Meta Robots Tag

The meta robots tag is an HTML element placed in the <head> section of a page. Common values include noindex (do not add this page to the search index), nofollow (do not follow links on this page), noarchive (do not show a cached version), and nosnippet (do not show a text snippet in search results). Because the tag lives on the page itself, a crawler must be able to access and render the page to read it — which is why blocking a page in robots.txt while relying on a meta noindex tag is a self-defeating strategy. You can check your pages' robots directives with our robots directives checker.

X-Robots-Tag

The X-Robots-Tag is an HTTP response header that serves the same purpose as the meta robots tag but is set at the server level rather than in the HTML. Its key advantage is that it works for non-HTML files — PDFs, images, and other file types that cannot contain an HTML <head> element. It supports the same directives as the meta robots tag.

When to Use Which

Use robots.txt to stop crawlers from accessing pages you do not want crawled at all — admin panels, internal search results, duplicate parameter URLs, and staging areas.
Use the meta robots noindex tag on pages you want crawlers to access but not include in search results — thank-you pages, login pages, paginated pages beyond page one, and low-value tag archives.
Use X-Robots-Tag when you need indexing control on non-HTML resources, or when you want to apply directives site-wide without modifying individual page templates.
Never rely on robots.txt alone to keep a page out of the index. Block + noindex = fully removed from Google. Block alone = may still appear in search results with limited information.

Understanding the relationship between crawling and indexing is covered in depth in our guide on what is indexing.

How to Test Your Robots.txt File

Before deploying any changes to your robots.txt file, testing is essential. A single misplaced Disallow can silently block entire sections of your site from being crawled, with ranking consequences that may not surface for weeks.

Google Search Console Robots.txt Tester

Google Search Console provides a built-in robots.txt tester under Settings > Crawling > robots.txt. It displays the current file Google has fetched and cached, allows you to enter any URL to test whether it is blocked or allowed, and shows you any syntax errors. Note that this tool was briefly removed in 2022 but was restored after widespread user feedback.

Manual Testing

You can view your live robots.txt file at any time by navigating to yourdomain.com/robots.txt in a browser. Check that the file loads correctly, is served as plain text, and contains all the rules you expect.

Third-Party Tools

Our robots.txt generator allows you to build and preview a robots.txt file before deploying it. The robots directives checker lets you verify that specific pages are returning the intended robots meta tags and HTTP headers. The site audit tool will flag if your robots.txt is blocking pages that appear in your sitemap — a common and significant error.

Google's URL Inspection Tool

In Google Search Console, the URL Inspection tool shows you whether a specific URL is blocked by robots.txt, alongside its indexing status, crawl history, and any canonical issues. This is the most direct way to confirm that a specific page is accessible to Googlebot.

Robots.txt SEO: Impact on Search Rankings

Robots.txt has a significant but indirect impact on robots.txt SEO performance. Its primary value is in crawl efficiency: by preventing crawlers from wasting requests on low-value URLs, you direct more crawl activity toward your most important pages. For large sites, this has a measurable effect on how quickly new and updated content gets discovered and ranked.

On smaller sites (fewer than a few thousand pages), crawl budget is rarely a limiting factor. Google's crawlers are efficient, and for typical small business or blog websites, robots.txt is more of a hygiene measure than a performance lever. Still, even small sites benefit from keeping their robots.txt clean and free of errors.

Key SEO benefits of a well-maintained robots.txt file include:

Preventing duplicate content — blocking parameter-based URLs, printer-friendly versions, and session ID variations prevents Googlebot from seeing multiple versions of the same content.
Preserving crawl budget — directing crawlers away from pagination, faceted navigation, internal search, and admin areas preserves crawl budget for your valuable pages. Read more in our crawl budget guide.
Improving crawl freshness — when crawlers are not wasting time on junk URLs, they revisit your important pages more frequently, picking up updates faster.
Providing sitemap discovery — the Sitemap directive ensures crawlers can always find your XML sitemap, supporting broader URL discovery.

Common Robots.txt Mistakes

Even experienced webmasters make errors with robots.txt. The following mistakes are among the most frequently seen in SEO audits.

1. Blocking CSS and JavaScript Files

This is one of the most damaging mistakes you can make. Google renders web pages much like a browser does, using Googlebot to process HTML, CSS, and JavaScript to understand page content and layout. If your robots.txt blocks access to your stylesheets (/wp-content/themes/) or JavaScript files (/assets/js/), Googlebot cannot render your pages correctly. The result is that Google may misunderstand your content, fail to discover links, and potentially rank your pages lower because it cannot see what users see. Always check a Google Search Console URL inspection for "render blocking" warnings, and ensure your CSS and JS directories are never Disallowed.

2. Blocking the Entire Site with Disallow: /

Adding Disallow: / under User-agent: * blocks every crawler from every page on your entire site. This is appropriate only for staging environments and sites under maintenance that have not yet launched. It is an alarmingly common mistake on live production sites — often left in accidentally after a site migrates from staging to live. Run a site audit to check your robots.txt as part of your regular SEO maintenance.

3. Using Robots.txt Instead of Noindex for Sensitive Pages

Blocking a page in robots.txt does not remove it from Google's index. If Google has already crawled and indexed a page, adding a Disallow rule stops future crawls but the page may remain in search results indefinitely. Furthermore, Google can index a page it has never crawled if external links point to it — it just won't know the page's content. For pages you want removed from search results, you must use a noindex meta robots tag or X-Robots-Tag header, and the page must be crawlable so Google can read that instruction.

4. Case Sensitivity Errors

On Linux web servers (the majority of production web servers), file paths are case-sensitive. Disallow: /Admin/ will not block /admin/ if your URLs are lowercase. Always match the exact case of your URL paths.

5. Missing Trailing Slashes

Disallow: /news blocks /news, /newsletter, /news-archive, and anything else starting with /news. If you only intend to block the /news/ directory, use Disallow: /news/ with a trailing slash.

6. Conflicting Allow and Disallow Rules

When multiple rules match the same URL, Google applies the most specific rule (the longest matching path). If two rules are the same length, Allow takes precedence over Disallow. Unintentional conflicts can either accidentally block pages you want crawled or allow pages you intended to block. Always test specific URLs after making rule changes.

7. No Sitemap Directive

Omitting the Sitemap directive is a missed opportunity. Including your sitemap URL in robots.txt ensures every crawler that reads the file — not just Googlebot — can find your sitemap. This is particularly valuable for Bing and Yandex, where you may not have submitted your sitemap via their respective webmaster tools.

8. Forgetting Subdomains

Each subdomain requires its own robots.txt file. A robots.txt at www.example.com/robots.txt does not apply to blog.example.com or shop.example.com. If you run content on subdomains, ensure each has its own appropriately configured file.

9. Robots.txt on HTTPS and HTTP Separately

If your site is accessible on both http:// and https://, search engines treat these as separate origins, each with their own robots.txt. Typically this is not a problem if you have correct 301 redirects from HTTP to HTTPS — the crawler follows the redirect and reads the HTTPS robots.txt. But if your HTTP version is somehow still accessible without a redirect, ensure both origins are configured correctly.

Robots.txt and Security: Common Misconceptions

Robots.txt is not a security mechanism. It is a public file that any person or program can read — in fact, some malicious actors specifically read robots.txt to discover paths that the site owner considers sensitive. Using robots.txt to "hide" admin panels, API endpoints, or private content provides no real protection.

If a URL is listed in robots.txt under a Disallow rule, a human can still navigate directly to that URL. A malicious bot that ignores the Robots Exclusion Protocol will also access it freely. Legitimate search engine crawlers follow robots.txt, but not all bots do.

For genuinely sensitive areas of your website, the appropriate security measures are:

Password protection or authentication (HTTP Basic Auth, login walls)
IP allowlists for admin panels
Server-level access controls (firewall rules, .htaccess deny directives)
Private networks or VPNs for internal tools

That said, blocking your admin panel in robots.txt is still good practice — it prevents search engines from wasting crawl budget on pages they can never index, and avoids any chance of admin URLs appearing in search results (which could expose their existence even if the pages themselves are locked). It is a belt-and-braces measure, not a standalone security solution.

Robots.txt and Crawl Budget

Crawl budget is the number of URLs Googlebot will crawl on your site within a given time period. For most small and medium-sized websites, crawl budget is not a constraint — Google crawls everything without issue. For large websites — e-commerce sites with millions of product and category URLs, news sites with deep archives, or platforms with user-generated content — crawl budget management becomes critical.

Effective use of robots.txt is one of the primary tools for crawl budget optimisation. By blocking URLs that have no indexing value — thin pages, faceted navigation combinations, session IDs, internal search results — you direct Googlebot's attention toward pages that can actually rank. This improves the speed at which new and updated pages are discovered and re-crawled.

Beyond robots.txt, crawl budget is also influenced by:

Site speed — faster servers get crawled more aggressively
Crawl demand — popular pages are re-crawled more frequently
Internal link structure — deeply buried pages are crawled less often
Sitemap accuracy — keeping your sitemap clean of dead URLs helps

Our full guide on what is crawl budget covers this topic in detail, including how to analyse your crawl log data to identify budget waste.

Checking Your Robots.txt with RankNibbler

RankNibbler provides several tools that interact with robots.txt as part of a broader technical SEO workflow:

The robots.txt generator lets you build a syntactically correct robots.txt file by choosing directives from a visual interface, then copies or downloads the result ready for deployment.
The robots directives checker fetches a URL and reports the robots meta tag, X-Robots-Tag header, and canonical tag — alongside whether the page is blocked by robots.txt.
The site audit tool crawls your site and flags configuration issues including robots.txt blocking pages listed in your sitemap, missing sitemap directives, and pages that should be blocked but aren't.

For a comprehensive check of your entire SEO setup, run a full audit from the RankNibbler homepage. The audit checks title tags, meta descriptions, heading structure, image alt text, internal links, structured data, and more — across 30+ SEO checks in a single pass.

Try it now: Use the free robots.txt generator to create or update your robots.txt file, then verify individual pages are behaving as expected with the robots directives checker.

Robots.txt Frequently Asked Questions

Does robots.txt affect Google rankings?

Indirectly, yes. Robots.txt does not directly influence ranking signals, but by controlling which pages are crawled, it affects how efficiently Google can discover and process your important content. Blocking irrelevant or low-quality URLs improves crawl efficiency and can lead to faster indexing of new content. Conversely, accidentally blocking important pages in robots.txt will prevent them from being crawled and can cause ranking drops.

Can Google index a page that is blocked in robots.txt?

Yes. Google may index a URL without crawling it if external links point to that URL. In this case, Google may show the page in search results with a message like "A description of this page is not available" because it could not crawl the content. To fully remove a page from Google's index, you need to remove the Disallow rule and add a noindex meta robots tag, allow Googlebot to crawl it and read the noindex instruction, then wait for Google to re-process the page.

What is the difference between robots.txt and a sitemap?

A robots.txt file tells crawlers what they should not access. A sitemap tells crawlers what you want them to index. They serve complementary roles: robots.txt restricts crawling of unwanted URLs, while a sitemap actively promotes crawling of your most important URLs. You can learn more in our guide on what is a sitemap.

Does robots.txt work for all bots?

Only bots that choose to follow the Robots Exclusion Protocol will honour your robots.txt rules. All major search engines (Google, Bing, Yandex, Baidu, DuckDuckGo) follow the protocol. Many legitimate crawlers — SEO tools, academic research bots, web archiving services — also follow it. However, malicious scrapers and spam bots may deliberately ignore robots.txt. For protecting sensitive content, use server-level access controls rather than robots.txt.

How often does Google re-read my robots.txt file?

Google typically caches and re-fetches robots.txt approximately every 24 hours. This means changes you make may not be reflected in Googlebot's behaviour immediately. If you need Googlebot to re-read your robots.txt urgently (for example, if you have accidentally blocked your site), you can request a re-fetch in Google Search Console under Settings > Crawling.

Can I have more than one robots.txt file?

No. Each domain or subdomain can have only one robots.txt file, located at the root. If you have multiple subdomains, each requires its own robots.txt. You cannot have a robots.txt in a subdirectory — it will be ignored by search engines.

Should I block my /wp-admin/ in robots.txt?

Yes, but with care. The standard WordPress robots.txt blocks /wp-admin/ while specifically allowing /wp-admin/admin-ajax.php. The admin-ajax.php exception is important because many plugins and themes use this endpoint for front-end functionality, and blocking it can cause visual rendering issues that affect how Google sees your pages.

What happens if my robots.txt file has a syntax error?

Behaviour varies by crawler. Google's Googlebot is relatively lenient and will attempt to parse rules even with minor syntax errors. However, malformed rules may be silently ignored, meaning your intended blocks or permissions may not take effect. Bing and some other crawlers are stricter. Always validate your robots.txt syntax before deploying, using our robots.txt generator or Google Search Console's robots.txt viewer.

Is robots.txt required for SEO?

No, a robots.txt file is not strictly required. If no robots.txt exists, search engines assume everything is accessible. However, having a correctly configured robots.txt is considered best practice. At minimum, every site should have a file that includes the Sitemap directive, even if all the content rules are set to Allow. This ensures search engines can always find your sitemap.

Can I use robots.txt to block AI training bots?

Yes, with caveats. Several AI companies have published user-agent names for their training crawlers: GPTBot (OpenAI), Google-Extended (Google), CCBot (Common Crawl), and anthropic-ai (Anthropic), among others. You can add Disallow rules targeting these specific user-agents. However, compliance is voluntary — not all AI training crawlers declare their identity or follow robots.txt rules. This is an evolving area with no universal enforcement mechanism.

# Block AI training crawlers
User-agent: GPTBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

# Allow search engine crawlers
User-agent: *
Allow: /

Sitemap: https://www.example.com/sitemap.xml

What should I always include in robots.txt?

At a minimum: a User-agent: * group that either allows or disallows access as appropriate, and a Sitemap: directive pointing to your XML sitemap. Beyond that, block admin paths, checkout flows, internal search results, and any URL patterns that generate significant numbers of low-value or duplicate pages. Use our SEO glossary for definitions of technical terms you encounter during this process.

How do I check if robots.txt is blocking a specific URL?

The most reliable method is Google Search Console's URL Inspection tool, which shows you explicitly whether a URL is blocked by robots.txt. You can also use the robots directives checker to see a URL's robots meta tag and indexing signals, or manually read your robots.txt and trace which rules would apply to the URL in question.

The RankNibbler site audit checks your robots.txt for sitemap references and will flag pages that are both listed in your sitemap and blocked by robots.txt — one of the most counterproductive configurations in technical SEO. For a complete overview of your site's health, visit the RankNibbler homepage.

Check your site now: Run a free audit on the RankNibbler homepage to see how your page scores across 30+ SEO checks, including robots.txt validation.

Last updated: April 2026