Robots.txt Tester
Fetch or paste /robots.txt, choose a User-Agent, and test multiple URLs. We show Allow/Disallow, the matching rule, and why it won.
Results
Notes & references
- Simulation follows common crawler behavior: user-agent group with the longest UA token match; within that group, the longest matching path rule wins; ties favor Allow.
*matches any sequence;$anchors the end of the path/query string.Disallow:(empty) doesn’t block; it’s treated as “no path specified”.- Robots.txt governs crawling, not indexing; blocked URLs can still appear if discovered via links, but without content.
FAQs: robots.txt & proper crawling
/robots.txt tells crawlers where they may crawl on a host. A clear file prevents wasted crawl budget on duplicate or non-public URLs (e.g., filters, faceted pages, admin areas) and reduces server load.
Each origin (protocol + host + port) needs its own file, located exactly at https://example.com/robots.txt.
No. It controls crawling, not indexing. A Disallow rule doesn’t remove a URL from search if it’s discovered by links—search engines may keep a URL-only listing.
To deindex, use noindex (meta/X-Robots-Tag) or return 404/410. Note: noindex in robots.txt is not supported.
404/410: treated as “no restrictions” (everything allowed). 5xx: some crawlers may act conservatively and crawl less.
Syntax mistakes or misplaced directives can cause rules to be ignored, leading to over- or under-blocking. Keep fields like User-agent, Allow, Disallow well-formed.
The group with the longest matching user-agent token applies (fallback * if none). Within that group, the longest matching path wins; on ties, Allow beats Disallow.
Wildcards * match any sequence; $ anchors the end of the URL path/query.
Valid syntax ensures crawlers can reliably apply your intent. Broken lines, wrong casing, or mixing directives inside the wrong group can cause critical sections to be crawled or blocked unintentionally.
A valid file also avoids support fallbacks (e.g., treating the file as missing or ignoring lines) that skew crawl distribution.
Don’t block assets needed to render pages. Crawlers render to understand layout and quality signals; blocking core CSS/JS can degrade understanding and eligibility for rich features.
Sitemap: may appear anywhere and is just a hint; it doesn’t change Allow/Disallow logic. Host: isn’t part of the modern standard and is ignored by some crawlers—prefer canonical URLs and sitemaps.