In search of 100% uptime


What is meant by 100% uptime, and is it an achievable goal?

For more and more companies their web infrastructure is becoming critical to how they do business; in many cases it is the entire business. Therefore, any unplanned downtime can have far reaching effects in terms of income, reputation, customer retention and efficiency of handling enquiries.

For this reason many clients demand 100% uptime, but how realistic is this?

What is 100% uptime?

I have found that this can be quite fuzzy, and requirements can vary greatly. Concepts I usually hear tend to be things like:

  • My site must be fully functional 24 hours a day, 365 days a year
  • If there is a catastrophic failure in X, Y or Z the site should still work
  • The site should work correctly even if we get 1000% more traffic than normal

The problem with these statements is that they are subjective. “Work correctly” can mean different things in different contexts.

For example, on a high traffic ecommerce site 10 minutes of downtime may result in a great deal of lost revenue. However, on a site serving up user generated content, 10 minutes of downtime may not have a financial impact, but it may cause reputational damage.

How about 99.9%?

So depending on your budget, 100% uptime might be slightly unrealistic. The next best thing? 99.9%.

If we look at some major hosting providers, their SLAs state 99.9%:

In real terms 99.9% over the course of a calendar year equates to the following:

  • 5259 minutes
  • 87.65 hours
  • 7.30 hours per month
  • 1.68 hours per week

To put this into context, there are often times when downtime is unavoidable, for example with server infrastructure updates. Planned outside of core business hours (or at the time when the traffic is lowest) this amount of downtime can have very little impact on the effectiveness of the site.

However, although this is a very low figure, it is up to you to decide if this is acceptable for your business.

Achieving minimal downtime

Whether you are happy to accept 99.9% or you are striving for 100%, there are a number of techniques that can be used to keep downtime as low as possible.

Firstly, let’s look at how a basic hosting setup might be architected.

 

 

The user connects to the website via their device, which is requested over the web. This part of the transaction is simplified in the diagram above, but contains transactions such as DNS lookups.

Within the hosting environment, traffic is first routed through a firewall which then ends up at the web server, which sends a response to the client. The web server makes requests to the database server depending on the page that has been requested.

This setup has a number of possible failure points:

  • The internet; if this fails we are all doomed!
  • Web server
  • Database server
  • Firewall

Therefore, there are a number of ways that these devices could fail:

  • Hardware fault or corruption, causing it to be unusable/unstable
  • Bug in software
  • Patching issues
  • Unable to cope with traffic, either organic or via a DDOS attack
  • Network connectivity issues
  • Malicious software, such as virus or malware

Let’s take a look at how we can mitigate against these failures.

Load balancing web servers

When designing a house, architects spread the load of the building over a number of load bearing walls to prevent the structure from collapsing.

In server architecture, a load balancer can be installed between source traffic and the destination servers. They can be either software based or a physical unit that is installed into the rack. 

The web server is then mirrored N times, resulting in multiple copies of the same server. These servers don’t have to be physically located in the same place, and could be based in different geographical locations (known as Geo Load Balancing). This is useful if the system receives traffic from users based throughout the world.

 

 

The load balancer uses an algorithm to send traffic to different web servers depending on the current traffic levels.

For example, during normal traffic the load balancer will send users to server A. When there is a spike above 50% of the normal limit, the load balancer spreads the load between server A, B and C.

There are a number of different algorithms that can be used, as detailed on ServerFault.

How does this relate to achieving a high uptime? 

  • Prevents high traffic from causing a site to fail
  • If a web server node is offline/dead, the load balancer will compensate by using the other web servers
  • In extreme cases, multi load balancers can be used

Three of the CMS platforms we support, Kentico, Sitecore and Sitefinity, all include functionality for load balancing. However, they go about it in different ways, and some are more efficient than others, so it is worth investigating before choosing a platform.

Database mirroring

Even with spreading the web server load, there is still a failure point at the database level because all the web servers are still pointing to a single database, so if this fails the whole architecture fails. 

The solution? More databases.

 

 

Different vendors have varying ways of mirroring databases, but the basic idea is that data is duplicated into a separate inactive database.

The application points at a master database, and this automatically syncs data into the slave database. Should the master database fail, the system recognises this and automatically switches the application to use the slave database.

A detailed description of SQL data mirroring can be found on Microsoft MSDN.

Firewall redundancy

Firewalls are physical units, and as such can also fail. There are 2 main ways to cater for this:

  • Implement 2 firewalls in a similar way to load balancing
  • Defer firewall capabilities to the load balancer

Modern load balancers can also act as firewalls, so this is the simplest option to go for.

Patching and releases

There are some instances of downtime that are caused by proactive actions, rather than things outside of our control.

For example, the server may need patching to a newer version, or the application code may require a code release.

The way this is handled really depends on whether we are striving for 100% or 99.9% uptime. 

The simplest approach is to show a maintenance page when making code changes whilst the changes are being made. For patching, the load balancer can be used to point traffic to a temporary page. Technically, the application is still up, but it isn’t available to its primary purpose. 

If we are aiming for ultimate uptime, then the approaches already discussed can be used to reduce the impact. The patching/updates will be released in a phased approach, so they only affect one server at a time. Whilst this is is happening, the server in question is removed from any load balancing logic so it can never be hit by traffic. Once all servers are complete, the load balancing logic is reset to normal.

Note: for this to be possible the architecture must include at least 3 servers (both in terms of web and database). This means that whilst one is out of action being upgraded, the remaining infrastructure still has redundancy.

Final thoughts

The techniques described here can come at significant cost. Increasing the amount of databases, servers and adding a load balancer can significantly increase the monthly outlay.

It really comes down to whether the extra 0.1% is actually required, and the impact it will have on your business if you don’t achieve it.

02 Jun

2016
Ben Franklin
Listed in:  Development
Estimated read time:
 words,  minutes

Signup to receive these articles straight to your inbox.