Robots.txt: Robots.txt is a file put into a site's root directory that tells search engines which pages they should crawl and index, and which pages they shouldn't. It prevents search engines from indexing all the parts of a website.

Web developers use robots.txt files to:

  • Discourage crawlers from going into private folders.
  • Keep crawlers from visiting less attractive,
  • noteworthy content, which in turn gives crawlers more time to look at the important parts of a website that web developers would actually like to be ranked.
  • Save bandwidth by only allowing specific bots to crawl the site.
  • Prevent 404 errors.
  • Tell bots where a sitemap is.

All that being said, robots.txt files don't:

  • Stop content from being indexed. Crawlers may find a URL through external resources, consequently indexing it and showing it in page results.
  • Protect private content.
  • Guard against duplicate content indexing.
  • Block all robots. Not all crawlers follow robots.txt files instructions, and may need a firewall to be kept out.

Basically, robots.txt disallow search engines's spiders from crawling certain pages and indexing them, making crawlers use their time in the most efficient way possible. However, they're not perfect, and thusly shouldn't be exclusively relied upon. In order to protect against malicious crawlers, sites may need firewalls, while private content may need to be kept offline.