Qoding: Using Pingdom to check robots.txt


In this post I'm going to explain our solution for automatically checking that a website is allowing itself to be indexed by search engines, in this case relating to the robots.txt file.

In this post I'm going to explain our solution for automatically checking that a website is allowing itself to be indexed by search engines, in this case relating to the robots.txt (1) file. At a basic level, this file instructs web spiders (aka "bots") which parts of a site (if any) should be indexed (2)

Usually you specify certain areas that should be ignored by spiders. For example the following snippet denotes that all user agents (i.e. spiders) should not index certain folders.

User-agent: *
Disallow: /App_Browsers/
Disallow: /bin/
Disallow: /Controls/
...


Sometimes it is useful to block a spider from indexing the entire site, for example on a test server. This prevents test server URLs from appearing in search results (which can also be done via IP restrictions). To facilitate this the entire robots.txt file should only contain the following code:

User-agent: *
Disallow: /


So now we have explored a little about what the robots.txt file does, where does Pingdom come in?

Consider the deployment process; maybe you have changed the robots.txt file locally, and commit it to source control by mistake (3). This then gets deployed to the production server, and your site is no longer crawled by search engines. In turn this will have a negative impact on your search engine presence, and unfortunately there will be a time lag before you notice the impact.

One solution is to check the file manually on every deploy. A much better solution is to automate this process (and even better is to integrate it into your Continuous Integration pipe if you have one).

Quba uses Pingdom to monitor client sites for uptime; if a site goes down a notification email is sent automatically. Unfortunately Pingdom doesn't have the syntax available out of the box to setup alerts for the robots.txt files as well. It allows for a string comparison to be added (for example, the page should contain "abcdef") but it doesn't allow regex to be used. If you look back to the code examples above you will notice that they both contain:

Disallow: /

Therefore we can't just search for this string. To solve this we have developed a simple middleware MVC application that contains just one action:

public ActionResult CheckRobots(string key, string url)
{
    if (key != KEY)
    {
        ViewBag.IsValid = "Invalid key";
        return View();
    }

    if (string.IsNullOrEmpty(url))
    {
        ViewBag.IsValid = "No URL supplied";
        return View();
    }

    var robotsManager = new RobotsManager();
    ViewBag.IsValid = robotsManager.ValidateRobotsFile(url);
    return View();
}


The key in this method is to prevent the site from being spammed by bots (the irony!). The URL supplied is then passed to a manager class that checks the robots.txt file of the supplied site.

public RobotsValidationEnum ValidateRobotsFile(string url)
{
    var absoluteUrl = GenerateRobotsFileUrl(url);

    var client = new WebClient();
    try
    {
        var downloadString = client.DownloadString(absoluteUrl);
        if (string.IsNullOrEmpty(downloadString))
        {
            return RobotsValidationEnum.Empty;
        }

        // Consolidate into 1 line
        downloadString = Regex.Replace(downloadString, @"\s+", "").Replace("\"", "");

        return downloadString.Equals("User-agent:*Disallow:/", StringComparison.InvariantCultureIgnoreCase) 
                    ? RobotsValidationEnum.BlockingAll 
                    : RobotsValidationEnum.Valid;
    }
    catch (Exception)
    {
        return RobotsValidationEnum.NotFound;
    }
}


This code uses the WebClient class to request the contents of the specifed URL. The contents is then checked and an enum is returned. The value of this result is then assigned to the ViewBag.

The view that is returned is probably the simplest view you will ever see, and only contains a single line of code:

@ViewBag.IsValid

All we are doing is writing the response to the screen. So now we have a way of validating the robots.txt file using a URL such as:

http://{robotcheckersite}/Robots/checkrobots/{key}/www.quba.co.uk

The final step is to setup an alert in Pingdom. The URL is set as follows:

Pingdom-1.jpg

We then have to add in the check for the custom text:

Pingdom-2.jpg

If Pingdom checks the middleware URL and it contains anything other than "valid" an alert will be trigged according to your alert policy preferences.

(1) More information on robots.txt can be found at http://www.robotstxt.org/robotstxt.html
(2) Note that robots.txt is not a fixed standard, it is optional for spiders to follow it

(3) Quba make use of dynamic robots.txt files by using .ashx files which also prevents erroneous commits

12 Jun

2015
Ben Franklin
Listed in:  Development Platforms
Estimated read time:
 words,  minutes

Signup to receive these articles straight to your inbox.