Robots.txt Tester

Fetch or paste /robots.txt, choose a User-Agent, and test multiple URLs. We show Allow/Disallow, the matching rule, and why it won.

We’ll fetch /robots.txt from the site.
We use Google-like matching: the group with the longest matching user-agent token applies.
Tip: You can paste 5–50 URLs to batch-check rules.
Permalink

Results

Enter a site/robots URL or paste robots.txt, choose a User-Agent, add URLs, and run the test.

Notes & references

  • Simulation follows common crawler behavior: user-agent group with the longest UA token match; within that group, the longest matching path rule wins; ties favor Allow.
  • * matches any sequence; $ anchors the end of the path/query string.
  • Disallow: (empty) doesn’t block; it’s treated as “no path specified”.
  • Robots.txt governs crawling, not indexing; blocked URLs can still appear if discovered via links, but without content.

FAQs: robots.txt & proper crawling

/robots.txt tells crawlers where they may crawl on a host. A clear file prevents wasted crawl budget on duplicate or non-public URLs (e.g., filters, faceted pages, admin areas) and reduces server load.

Each origin (protocol + host + port) needs its own file, located exactly at https://example.com/robots.txt.

No. It controls crawling, not indexing. A Disallow rule doesn’t remove a URL from search if it’s discovered by links—search engines may keep a URL-only listing.

To deindex, use noindex (meta/X-Robots-Tag) or return 404/410. Note: noindex in robots.txt is not supported.

404/410: treated as “no restrictions” (everything allowed). 5xx: some crawlers may act conservatively and crawl less.

Syntax mistakes or misplaced directives can cause rules to be ignored, leading to over- or under-blocking. Keep fields like User-agent, Allow, Disallow well-formed.

The group with the longest matching user-agent token applies (fallback * if none). Within that group, the longest matching path wins; on ties, Allow beats Disallow.

Wildcards * match any sequence; $ anchors the end of the URL path/query.

Valid syntax ensures crawlers can reliably apply your intent. Broken lines, wrong casing, or mixing directives inside the wrong group can cause critical sections to be crawled or blocked unintentionally.

A valid file also avoids support fallbacks (e.g., treating the file as missing or ignoring lines) that skew crawl distribution.

Don’t block assets needed to render pages. Crawlers render to understand layout and quality signals; blocking core CSS/JS can degrade understanding and eligibility for rich features.

Sitemap: may appear anywhere and is just a hint; it doesn’t change Allow/Disallow logic. Host: isn’t part of the modern standard and is ignored by some crawlers—prefer canonical URLs and sitemaps.

Robots.txt issues usually hide bigger crawl-budget problems.

A single wrong Disallow, a missing file on a subdomain, or outdated rules for filters and collections can quietly block money pages, waste crawl on junk parameters, and slow down how fast new content gets discovered.

  • Audit robots.txt across www / non-www / subdomains / staging for consistency.
  • Map robots.txt rules to real URLs (logs, sitemaps, internal links) — not just theory.
  • Fix “over-blocking” patterns that hide PDPs, category pages, filters, or blog clusters.
  • Reduce crawl waste on faceted URLs, search pages, tracking params, and duplicates.
  • Align robots.txt with canonical, sitemaps, hreflang, and performance goals.
See the 90-day Technical SEO & crawl plan Book a robots.txt & crawl-budget review Email Kiran with your tester JSON export

Based in Dubai, helping SaaS, marketplaces & eCommerce teams across India, MENA, Europe, and the US turn crawl control into revenue-focused crawling.