Mark Gritter (markgritter) wrote,
Mark Gritter
markgritter

Building robots.txt

Google searches for "Netflix X" where X is a movie title usually find the Netflix page for that movie, even though their robots.txt prevent it being searched. (Not too surprising--- if there are enough links, Google can find it even if it can't scan the page itself.)

So out of curiousity I took a look at robots.txt and not only is it commented but it also has a disabled portion listed:

# Uncomment this when we start generating sitemaps again.
#Sitemap: http://movies.netflix.com/sitemap_Movies.xml.gz


The implication is that the deployment process for the website just does a "copy", not a "build." This is pretty common for websites--- you can find lots of comments and comment-disabled portions in HTML and Javascript documents. http://www.tintri.com/robots.txt has a ton of boilerplate text that obviously came with the web server or framework.

There are exceptions: http://cnn.com/robots.txt doesn't have any comments. http://facebook.com/robots.txt has comments that are directed at outside people (like they should be)! But what surprises me most is that I couldn't quickly find tools or best-practices guide for stripping out "internally"-directed comments, other than the JavaScript compaction + obfuscation tools whose main goal is reducing size.
Tags: programming, web
Subscribe
  • Post a new comment

    Error

    default userpic

    Your reply will be screened

    Your IP address will be recorded 

    When you submit the form an invisible reCAPTCHA check will be performed.
    You must follow the Privacy Policy and Google Terms of use.
  • 0 comments