What does robots.txt do and why is it important?

It tells crawlers where they may crawl on a host, protecting crawl budget and server resources. Each origin needs its own file at /robots.txt.

Does robots.txt control indexing or rankings?

No. It controls crawling, not indexing. Disallow does not remove a URL from search. Use noindex (meta/X-Robots-Tag) or 404/410 to remove pages.

What happens if robots.txt is invalid or unreachable?

404/410 is treated as no restrictions. 5xx may reduce crawling. Syntax errors or misplaced directives can cause rules to be ignored and lead to unintended crawling or blocking.

How do user-agent groups and rule precedence work?

The group with the longest matching user-agent token applies. Within that group, the longest matching path rule wins; ties favor Allow. * is a wildcard; $ anchors the end.

Why must robots.txt be valid for proper crawling?

Valid syntax ensures crawlers can apply your intent. Broken or ambiguous rules can cause critical sections to be crawled or blocked unintentionally.

Should I block CSS/JS or use it for performance?

Do not block assets needed for rendering. Crawlers render pages; blocking core CSS/JS can degrade understanding and eligibility for rich features.

Where do Sitemap and Host belong, and do they affect rules?

Sitemap is a hint and does not change Allow/Disallow logic. Host is not part of the modern standard and may be ignored by crawlers.

Robots.txt Tester

Fetch or paste /robots.txt, choose a User-Agent, and test multiple URLs. We show Allow/Disallow, the matching rule, and why it won.

Mode

Fetch robots.txt

Paste robots.txt

Site URL or robots.txt URL

We’ll fetch /robots.txt from the site.

User-Agent

We use Google-like matching: the group with the longest matching user-agent token applies.

Timeout

Test URLs (one per line)

Tip: You can paste 5–50 URLs to batch-check rules.

Permalink

Results

Enter a site/robots URL or paste robots.txt, choose a User-Agent, add URLs, and run the test.

Notes & references

Simulation follows common crawler behavior: user-agent group with the longest UA token match; within that group, the longest matching path rule wins; ties favor Allow.
* matches any sequence; $ anchors the end of the path/query string.
Disallow: (empty) doesn’t block; it’s treated as “no path specified”.
Robots.txt governs crawling, not indexing; blocked URLs can still appear if discovered via links, but without content.

Official docs: Search Central: Robots.txt · RFC 9309 (Robots Exclusion Protocol)

FAQs: robots.txt & proper crawling

/robots.txt tells crawlers where they may crawl on a host. A clear file prevents wasted crawl budget on duplicate or non-public URLs (e.g., filters, faceted pages, admin areas) and reduces server load.

Each origin (protocol + host + port) needs its own file, located exactly at https://example.com/robots.txt.

No. It controls crawling, not indexing. A Disallow rule doesn’t remove a URL from search if it’s discovered by links—search engines may keep a URL-only listing.

To deindex, use noindex (meta/X-Robots-Tag) or return 404/410. Note: noindex in robots.txt is not supported.

404/410: treated as “no restrictions” (everything allowed). 5xx: some crawlers may act conservatively and crawl less.

Syntax mistakes or misplaced directives can cause rules to be ignored, leading to over- or under-blocking. Keep fields like User-agent, Allow, Disallow well-formed.

The group with the longest matching user-agent token applies (fallback * if none). Within that group, the longest matching path wins; on ties, Allow beats Disallow.

Wildcards * match any sequence; $ anchors the end of the URL path/query.

Valid syntax ensures crawlers can reliably apply your intent. Broken lines, wrong casing, or mixing directives inside the wrong group can cause critical sections to be crawled or blocked unintentionally.

A valid file also avoids support fallbacks (e.g., treating the file as missing or ignoring lines) that skew crawl distribution.

Don’t block assets needed to render pages. Crawlers render to understand layout and quality signals; blocking core CSS/JS can degrade understanding and eligibility for rich features.

Sitemap: may appear anywhere and is just a hint; it doesn’t change Allow/Disallow logic. Host: isn’t part of the modern standard and is ignored by some crawlers—prefer canonical URLs and sitemaps.