Robots.txt Configuration: Control What Google Crawls
One robots.txt mistake can deindex your entire site. This guide shows the exact configuration used by Fortune 500 sites.
TL;DR
Robots.txt controls which pages search engines crawl. Mistakes in this file can deindex your entire site (happened to 1 in 5 sites). This guide covers 17 critical robots.txt rules: User-agent directives, Disallow/Allow syntax, Crawl-delay, Sitemap declarations, wildcard patterns, and testing. 42% of sites have robots.txt errors blocking important pages. SEOLOGY automatically manages and monitors your robots.txt to prevent catastrophic errors.
What is Robots.txt and Why It Matters
Robots.txt is a text file in your site\'s root directory that tells search engine crawlers which pages they can and cannot access.
Location: https://yoursite.com/robots.txt (must be in root directory)
Format: Plain text file following Robots Exclusion Protocol
Purpose: Control crawl budget, protect sensitive pages, prevent duplicate content indexing
17 Critical Robots.txt Rules
Basic Syntax (5 Rules)
1. User-agent Directive
What it does: Specifies which crawler the rules apply to.
Syntax:
User-agent: Googlebot Disallow: /admin/ User-agent: * Disallow: /private/
Common user-agents:
• Googlebot - Google\'s web crawler
• Bingbot - Microsoft Bing crawler
• Googlebot-Image - Google Images
• * - All crawlers (wildcard)
Case sensitivity: User-agent names are case-insensitive.
2. Disallow Directive
What it does: Blocks crawlers from accessing specified paths.
Examples:
# Block entire directory Disallow: /admin/ # Block specific file Disallow: /secret-page.html # Block all pages (DANGEROUS!) Disallow: / # Block nothing (allow all) Disallow:
Warning: Disallow: / blocks your entire site. Most common catastrophic error.
3. Allow Directive
What it does: Explicitly allows crawling specific paths within blocked directories.
Example:
User-agent: Googlebot Disallow: /admin/ Allow: /admin/public/ # Result: /admin/ blocked except /admin/public/
Note: More specific rules override general rules. Allow is more specific than Disallow.
4. Sitemap Declaration
What it does: Tells crawlers where to find your XML sitemap.
Syntax:
Sitemap: https://example.com/sitemap.xml Sitemap: https://example.com/sitemap-products.xml Sitemap: https://example.com/sitemap-blog.xml
Requirements: Full absolute URL required (not relative paths).
Benefit: Helps search engines discover your sitemap without manual submission.
5. Crawl-delay Directive
What it does: Adds delay (in seconds) between crawler requests.
Syntax:
User-agent: * Crawl-delay: 10 # Crawler waits 10 seconds between requests
Important: Google ignores Crawl-delay. Use Google Search Console to adjust crawl rate instead.
Use case: Slow down aggressive crawlers that overload your server (Bing respects this).
Advanced Patterns (6 Rules)
6. Wildcard Asterisk (*)
What it does: Matches any sequence of characters.
Examples:
# Block all URLs with query parameters Disallow: /*? # Block all PDF files Disallow: /*.pdf$ # Block all URLs containing "sort=" Disallow: /*sort=*
Support: Google and most modern crawlers support wildcards.
7. Dollar Sign End Anchor ($)
What it does: Matches end of URL.
Examples:
# Block URLs ending in .pdf Disallow: /*.pdf$ # Allow .pdf (matches middle of URL too) Disallow: /*.pdf # Block URLs ending with /private Disallow: /*/private$
Use case: Block specific file types or URL patterns.
8. Blocking URL Parameters
Problem: Query parameters create duplicate content (example.com/page vs example.com/page?ref=social).
Solution:
# Block all URLs with query strings Disallow: /*? # Block specific parameters Disallow: /*?ref=* Disallow: /*?utm_* Disallow: /*?sessionid=*
Alternative: Use canonical tags + Google Search Console URL Parameters tool for better control.
9. Blocking Specific Bots
Use case: Block malicious bots, scrapers, or specific search engines.
Examples:
# Block specific scraper bots User-agent: AhrefsBot Disallow: / User-agent: SemrushBot Disallow: / User-agent: MJ12bot Disallow: / # Allow Google, block everyone else User-agent: Googlebot Disallow: User-agent: * Disallow: /
Reality: Malicious bots ignore robots.txt. Use server-level blocking for real security.
10. Comments
Syntax: Lines starting with # are comments.
# This is a comment # Block admin area from all crawlers User-agent: * Disallow: /admin/ # Inline comments NOT supported
Best practice: Document why you\'re blocking specific paths.
11. Case Sensitivity
Important: Paths ARE case-sensitive. User-agents are NOT.
# These are DIFFERENT Disallow: /Admin/ Disallow: /admin/ # These are the SAME User-agent: Googlebot User-agent: googlebot
Best practice: Match exact case of your URLs.
Common Use Cases (6 Rules)
12. E-commerce Site Configuration
Goal: Block admin, checkout, filters while allowing products.
User-agent: * # Block admin and checkout Disallow: /admin/ Disallow: /checkout/ Disallow: /cart/ Disallow: /account/ # Block URL parameters (filters, sessions) Disallow: /*?*sort=* Disallow: /*?*filter=* Disallow: /*sessionid=* # Allow product images Allow: /images/ # Sitemap Sitemap: https://example.com/sitemap.xml
13. WordPress Site Configuration
Goal: Block WordPress admin and system files.
User-agent: * # Block WordPress admin Disallow: /wp-admin/ Allow: /wp-admin/admin-ajax.php # Block WordPress system directories Disallow: /wp-includes/ Disallow: /wp-content/plugins/ Disallow: /wp-content/cache/ Disallow: /wp-content/themes/ # Allow uploads (images, media) Allow: /wp-content/uploads/ # Block search results Disallow: /?s= Disallow: /search/ Sitemap: https://example.com/sitemap_index.xml
14. Development/Staging Environment
Critical: Block all crawlers from dev/staging sites.
# Block entire site (staging environment) User-agent: * Disallow: /
Better approach: Use meta robots noindex + HTTP authentication for double protection.
15. Allow Everything (Default)
Configuration: Most sites should allow all crawling.
User-agent: * Disallow: Sitemap: https://example.com/sitemap.xml
Note: Empty Disallow allows everything. This is the safest starting point.
16. Blocking Duplicate Content
Use case: Print versions, mobile versions, filtered pages.
User-agent: * # Block print versions Disallow: /print/ Disallow: /*?print=* # Block mobile site (if you have responsive design) Disallow: /m/ # Block paginated pages (use canonical instead) Disallow: /*?page=*
Better alternative: Use canonical tags instead of robots.txt for duplicate content.
17. Enterprise Multi-Domain Setup
Scenario: Multiple country/language sites.
# US Site (example.com/robots.txt) User-agent: * Disallow: /admin/ Sitemap: https://example.com/sitemap-us.xml # UK Site (example.com/uk/robots.txt) - WRONG! # Robots.txt MUST be in root directory # Correct: Use single robots.txt User-agent: * Disallow: /admin/ Sitemap: https://example.com/sitemap-us.xml Sitemap: https://example.com/sitemap-uk.xml
Common Robots.txt Mistakes That Destroy Rankings
❌ Blocking Entire Site Accidentally
Error: Disallow: / blocks everything.
How it happens: Copy staging robots.txt to production. Within 24 hours, entire site deindexed.
Impact: 68% average traffic loss. 3-7 days to recover after fix.
❌ Blocking CSS and JavaScript
Error: Disallow: /css/ and Disallow: /js/
Impact: Google can\'t render pages properly. Mobile-first indexing fails. Rankings drop.
Fix: NEVER block CSS/JS files. Google needs them to render pages.
❌ Blocking Images
Error: Disallow: /images/
Impact: Images never appear in Google Image Search. 30% of organic traffic comes from images for many sites.
Note: Only block images if you truly don\'t want them indexed (copyright concerns).
❌ Wrong Robots.txt Location
Error: Robots.txt in subdirectory (/blog/robots.txt) instead of root.
Reality: Crawlers only check /robots.txt. File in wrong location is ignored.
Fix: Must be https://example.com/robots.txt (root only).
❌ Conflicting Rules
Error: Multiple conflicting User-agent blocks.
# WRONG - Conflicting rules User-agent: * Disallow: / User-agent: Googlebot Disallow: # Which wins? Googlebot's more specific rule wins.
Rule: Most specific user-agent takes precedence.
❌ Using Robots.txt for Security
Error: Blocking /admin/ and thinking it\'s protected.
Reality: Robots.txt is PUBLIC. Everyone can read it. Malicious bots ignore it. You\'re advertising your admin URL.
Fix: Use HTTP authentication, IP whitelisting, or strong passwords for security.
Testing Your Robots.txt
Google Search Console Robots.txt Tester
Location: Google Search Console → Settings → robots.txt
Features: View current robots.txt. Test specific URLs. See if they\'re blocked. Submit robots.txt changes.
Best practice: Test BEFORE deploying robots.txt changes. One typo can deindex your site.
Syntax Validators
Tools: Google\'s robots.txt tester (in GSC), Technical SEO tools (Screaming Frog, Sitebulb).
Check for: Syntax errors. Blocking critical pages. Blocking CSS/JS. Missing sitemap declaration.
Manual Testing
Process: Visit https://yoursite.com/robots.txt. Verify it loads correctly. Check no 404 error. Verify syntax.
Quick test: Try accessing a URL you blocked. Use Google Search Console URL Inspection to verify it\'s blocked.
Advanced Robots.txt Strategies
Dynamic Robots.txt Generation
Use case: Different rules for different environments (dev, staging, production).
Implementation: Generate robots.txt server-side based on environment variables.
// Node.js example
app.get('/robots.txt', (req, res) => {
if (process.env.ENV === 'production') {
res.send(`User-agent: *
Disallow:
Sitemap: https://example.com/sitemap.xml`);
} else {
// Block entire staging site
res.send(`User-agent: *
Disallow: /`);
}
});Monitoring Robots.txt Changes
Problem: Accidental changes to robots.txt can deindex site.
Solution: Set up monitoring to alert on changes.
Tools: Uptime monitoring (checks robots.txt hourly). Version control (git). Automated testing in CI/CD.
Alert triggers: Robots.txt returns 404. Content changes. Disallow: / detected.
Robots Meta Tag vs Robots.txt
Robots.txt: Blocks crawling. Page never accessed by Google.
Robots meta tag: Blocks indexing. Google crawls but doesn\'t index.
Use robots.txt when: Want to save crawl budget (duplicate content, admin pages).
Use meta robots when: Want page crawled but not indexed (thin content, private info).
How SEOLOGY Manages Robots.txt
SEOLOGY automatically monitors and optimizes your robots.txt:
- Monitors robots.txt 24/7 for accidental changes or errors
- Instant alerts if entire site is accidentally blocked
- Validates syntax before deployment to prevent catastrophic errors
- Automatically adds sitemap declarations and keeps them updated
- Detects and fixes common mistakes (blocking CSS/JS, wrong location)
- Tests robots.txt changes before they go live
Never Deindex Your Site Again
SEOLOGY monitors your robots.txt 24/7 and alerts you instantly if critical errors are detected.
Protect Your Site NowRelated Posts:
Tags: #RobotsTxt #TechnicalSEO #Crawling