Home / Blog / Robots.txt Configuration Guide

Robots.txt Configuration: Control What Google Crawls

Sarah KimOctober 12, 2024

One robots.txt mistake can deindex your entire site. This guide shows the exact configuration used by Fortune 500 sites.

TL;DR

Robots.txt controls which pages search engines crawl. Mistakes in this file can deindex your entire site (happened to 1 in 5 sites). This guide covers 17 critical robots.txt rules: User-agent directives, Disallow/Allow syntax, Crawl-delay, Sitemap declarations, wildcard patterns, and testing. 42% of sites have robots.txt errors blocking important pages. SEOLOGY automatically manages and monitors your robots.txt to prevent catastrophic errors.

What is Robots.txt and Why It Matters

Robots.txt is a text file in your site\'s root directory that tells search engine crawlers which pages they can and cannot access.

20%
Of websites accidentally deindexed their entire site with robots.txt errors (Moz study)
42%
Of sites have robots.txt errors blocking important pages from crawlers
68%
Traffic loss (average) when robots.txt accidentally blocks entire site
3-7 days
Average time to recover from robots.txt deindexing error

Location: https://yoursite.com/robots.txt (must be in root directory)
Format: Plain text file following Robots Exclusion Protocol
Purpose: Control crawl budget, protect sensitive pages, prevent duplicate content indexing

17 Critical Robots.txt Rules

Basic Syntax (5 Rules)

1. User-agent Directive

What it does: Specifies which crawler the rules apply to.

Syntax:

User-agent: Googlebot
Disallow: /admin/
User-agent: *
Disallow: /private/

Common user-agents:
Googlebot - Google\'s web crawler
Bingbot - Microsoft Bing crawler
Googlebot-Image - Google Images
* - All crawlers (wildcard)

Case sensitivity: User-agent names are case-insensitive.

2. Disallow Directive

What it does: Blocks crawlers from accessing specified paths.

Examples:

# Block entire directory
Disallow: /admin/
# Block specific file
Disallow: /secret-page.html
# Block all pages (DANGEROUS!)
Disallow: /
# Block nothing (allow all)
Disallow:

Warning: Disallow: / blocks your entire site. Most common catastrophic error.

3. Allow Directive

What it does: Explicitly allows crawling specific paths within blocked directories.

Example:

User-agent: Googlebot
Disallow: /admin/
Allow: /admin/public/
# Result: /admin/ blocked except /admin/public/

Note: More specific rules override general rules. Allow is more specific than Disallow.

4. Sitemap Declaration

What it does: Tells crawlers where to find your XML sitemap.

Syntax:

Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/sitemap-products.xml
Sitemap: https://example.com/sitemap-blog.xml

Requirements: Full absolute URL required (not relative paths).

Benefit: Helps search engines discover your sitemap without manual submission.

5. Crawl-delay Directive

What it does: Adds delay (in seconds) between crawler requests.

Syntax:

User-agent: *
Crawl-delay: 10
# Crawler waits 10 seconds between requests

Important: Google ignores Crawl-delay. Use Google Search Console to adjust crawl rate instead.

Use case: Slow down aggressive crawlers that overload your server (Bing respects this).

Advanced Patterns (6 Rules)

6. Wildcard Asterisk (*)

What it does: Matches any sequence of characters.

Examples:

# Block all URLs with query parameters
Disallow: /*?
# Block all PDF files
Disallow: /*.pdf$
# Block all URLs containing "sort="
Disallow: /*sort=*

Support: Google and most modern crawlers support wildcards.

7. Dollar Sign End Anchor ($)

What it does: Matches end of URL.

Examples:

# Block URLs ending in .pdf
Disallow: /*.pdf$
# Allow .pdf (matches middle of URL too)
Disallow: /*.pdf
# Block URLs ending with /private
Disallow: /*/private$

Use case: Block specific file types or URL patterns.

8. Blocking URL Parameters

Problem: Query parameters create duplicate content (example.com/page vs example.com/page?ref=social).

Solution:

# Block all URLs with query strings
Disallow: /*?
# Block specific parameters
Disallow: /*?ref=*
Disallow: /*?utm_*
Disallow: /*?sessionid=*

Alternative: Use canonical tags + Google Search Console URL Parameters tool for better control.

9. Blocking Specific Bots

Use case: Block malicious bots, scrapers, or specific search engines.

Examples:

# Block specific scraper bots
User-agent: AhrefsBot
Disallow: /
User-agent: SemrushBot
Disallow: /
User-agent: MJ12bot
Disallow: /
# Allow Google, block everyone else
User-agent: Googlebot
Disallow:
User-agent: *
Disallow: /

Reality: Malicious bots ignore robots.txt. Use server-level blocking for real security.

10. Comments

Syntax: Lines starting with # are comments.

# This is a comment
# Block admin area from all crawlers
User-agent: *
Disallow: /admin/ # Inline comments NOT supported

Best practice: Document why you\'re blocking specific paths.

11. Case Sensitivity

Important: Paths ARE case-sensitive. User-agents are NOT.

# These are DIFFERENT
Disallow: /Admin/
Disallow: /admin/
# These are the SAME
User-agent: Googlebot
User-agent: googlebot

Best practice: Match exact case of your URLs.

Common Use Cases (6 Rules)

12. E-commerce Site Configuration

Goal: Block admin, checkout, filters while allowing products.

User-agent: *
# Block admin and checkout
Disallow: /admin/
Disallow: /checkout/
Disallow: /cart/
Disallow: /account/
# Block URL parameters (filters, sessions)
Disallow: /*?*sort=*
Disallow: /*?*filter=*
Disallow: /*sessionid=*
# Allow product images
Allow: /images/
# Sitemap
Sitemap: https://example.com/sitemap.xml

13. WordPress Site Configuration

Goal: Block WordPress admin and system files.

User-agent: *
# Block WordPress admin
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
# Block WordPress system directories
Disallow: /wp-includes/
Disallow: /wp-content/plugins/
Disallow: /wp-content/cache/
Disallow: /wp-content/themes/
# Allow uploads (images, media)
Allow: /wp-content/uploads/
# Block search results
Disallow: /?s=
Disallow: /search/
Sitemap: https://example.com/sitemap_index.xml

14. Development/Staging Environment

Critical: Block all crawlers from dev/staging sites.

# Block entire site (staging environment)
User-agent: *
Disallow: /

Better approach: Use meta robots noindex + HTTP authentication for double protection.

15. Allow Everything (Default)

Configuration: Most sites should allow all crawling.

User-agent: *
Disallow:
Sitemap: https://example.com/sitemap.xml

Note: Empty Disallow allows everything. This is the safest starting point.

16. Blocking Duplicate Content

Use case: Print versions, mobile versions, filtered pages.

User-agent: *
# Block print versions
Disallow: /print/
Disallow: /*?print=*
# Block mobile site (if you have responsive design)
Disallow: /m/
# Block paginated pages (use canonical instead)
Disallow: /*?page=*

Better alternative: Use canonical tags instead of robots.txt for duplicate content.

17. Enterprise Multi-Domain Setup

Scenario: Multiple country/language sites.

# US Site (example.com/robots.txt)
User-agent: *
Disallow: /admin/
Sitemap: https://example.com/sitemap-us.xml
# UK Site (example.com/uk/robots.txt) - WRONG!
# Robots.txt MUST be in root directory
# Correct: Use single robots.txt
User-agent: *
Disallow: /admin/
Sitemap: https://example.com/sitemap-us.xml
Sitemap: https://example.com/sitemap-uk.xml

Common Robots.txt Mistakes That Destroy Rankings

❌ Blocking Entire Site Accidentally

Error: Disallow: / blocks everything.

How it happens: Copy staging robots.txt to production. Within 24 hours, entire site deindexed.

Impact: 68% average traffic loss. 3-7 days to recover after fix.

❌ Blocking CSS and JavaScript

Error: Disallow: /css/ and Disallow: /js/

Impact: Google can\'t render pages properly. Mobile-first indexing fails. Rankings drop.

Fix: NEVER block CSS/JS files. Google needs them to render pages.

❌ Blocking Images

Error: Disallow: /images/

Impact: Images never appear in Google Image Search. 30% of organic traffic comes from images for many sites.

Note: Only block images if you truly don\'t want them indexed (copyright concerns).

❌ Wrong Robots.txt Location

Error: Robots.txt in subdirectory (/blog/robots.txt) instead of root.

Reality: Crawlers only check /robots.txt. File in wrong location is ignored.

Fix: Must be https://example.com/robots.txt (root only).

❌ Conflicting Rules

Error: Multiple conflicting User-agent blocks.

# WRONG - Conflicting rules
User-agent: *
Disallow: /
User-agent: Googlebot
Disallow:
# Which wins? Googlebot's more specific rule wins.

Rule: Most specific user-agent takes precedence.

❌ Using Robots.txt for Security

Error: Blocking /admin/ and thinking it\'s protected.

Reality: Robots.txt is PUBLIC. Everyone can read it. Malicious bots ignore it. You\'re advertising your admin URL.

Fix: Use HTTP authentication, IP whitelisting, or strong passwords for security.

Testing Your Robots.txt

Google Search Console Robots.txt Tester

Location: Google Search Console → Settings → robots.txt

Features: View current robots.txt. Test specific URLs. See if they\'re blocked. Submit robots.txt changes.

Best practice: Test BEFORE deploying robots.txt changes. One typo can deindex your site.

Syntax Validators

Tools: Google\'s robots.txt tester (in GSC), Technical SEO tools (Screaming Frog, Sitebulb).

Check for: Syntax errors. Blocking critical pages. Blocking CSS/JS. Missing sitemap declaration.

Manual Testing

Process: Visit https://yoursite.com/robots.txt. Verify it loads correctly. Check no 404 error. Verify syntax.

Quick test: Try accessing a URL you blocked. Use Google Search Console URL Inspection to verify it\'s blocked.

Advanced Robots.txt Strategies

Dynamic Robots.txt Generation

Use case: Different rules for different environments (dev, staging, production).

Implementation: Generate robots.txt server-side based on environment variables.

// Node.js example
app.get('/robots.txt', (req, res) => {
  if (process.env.ENV === 'production') {
    res.send(`User-agent: *
Disallow:
Sitemap: https://example.com/sitemap.xml`);
  } else {
    // Block entire staging site
    res.send(`User-agent: *
Disallow: /`);
  }
});

Monitoring Robots.txt Changes

Problem: Accidental changes to robots.txt can deindex site.

Solution: Set up monitoring to alert on changes.

Tools: Uptime monitoring (checks robots.txt hourly). Version control (git). Automated testing in CI/CD.

Alert triggers: Robots.txt returns 404. Content changes. Disallow: / detected.

Robots Meta Tag vs Robots.txt

Robots.txt: Blocks crawling. Page never accessed by Google.

Robots meta tag: Blocks indexing. Google crawls but doesn\'t index.

Use robots.txt when: Want to save crawl budget (duplicate content, admin pages).

Use meta robots when: Want page crawled but not indexed (thin content, private info).

How SEOLOGY Manages Robots.txt

SEOLOGY automatically monitors and optimizes your robots.txt:

  • Monitors robots.txt 24/7 for accidental changes or errors
  • Instant alerts if entire site is accidentally blocked
  • Validates syntax before deployment to prevent catastrophic errors
  • Automatically adds sitemap declarations and keeps them updated
  • Detects and fixes common mistakes (blocking CSS/JS, wrong location)
  • Tests robots.txt changes before they go live

Never Deindex Your Site Again

SEOLOGY monitors your robots.txt 24/7 and alerts you instantly if critical errors are detected.

Protect Your Site Now

Related Posts:

Tags: #RobotsTxt #TechnicalSEO #Crawling