One of the most sacred and important measurements for any application or infrastructure operator is “uptime”, which is a measure of how long an application or system has been up and running, or “available”. Most often, this measure of availability is expressed in terms of a percentage of uptime in a given year expressed by the “class of nines” that they can achieve. For example, a service that is available 99.9% of the time is referred to as a “three nines” service. In this case, an uptime of 99.9% represents only 10 minutes of downtime in a full week. Should scheduled downtime count against an application or system’s measure of availability? It is debatable, though in a purist view of a measure for high availability, downtime is downtime for whatever reason, scheduled or not.
Most organizations formalize their commitment to a specific availability objective through a Service Level Agreement (SLA). For example, AWS currently offers an SLA (https://aws.amazon.com/ec2/sla/) that covers virtual machines and block storage (EC2 and EBS) at three nines (99.95%) as measured in a monthly billing cycle, and will offer some monetary discounts if they fail to meet this threshold of availability. But don’t get too excited about a big refund since there are many caveats to what this really means for your application. Still, publishing SLAs does represent a strong commitment to measuring and being accountable to your availability numbers whether or not there is a financial commitment tied to it or not.
Now that we’ve set the groundwork for availability and SLAs, let’s move on to the topic at hand – the availability for your cloud-based applications, how to measure your underlying platform, and what you can do to improve your availability.
Traditionally, the upper-limit on availability for an application was directly tied to the availability of the underlying infrastructure platform. A basic, bare-metal commodity server might offer two nines of availability, whereas a complex and expensive mainframe might offer four nines. In general, the platform availability was a direct function of hardware cost, which the hardware vendors loved. And because most traditional applications are designed to assume that their reliability comes from the hardware platform, they are bounded by that hardware availability limit.
But, importantly, cloud-native applications break this dependency. Properly-designed cloud-native applications take advantage of the programmatic and always-available capacity of a cloud platform to stay up and running in the face of hardware failures. So even though most cloud infrastructure platforms are constructed from two-nines hardware, it is very possible to build cloud-native applications on them that can achieve four-nines, five-nines, or even higher depending on need. If you want to understand how well this works, try building a simple cloud-native “canary” that uses the cloud APIs to self-heal within the local region and even across regions, and measure its uptime. You’ll see that this approach can deliver availability even surpassing a large mainframe platform.
What can you do to improve the availability of your application? First, consider all single-point-of-failures and use the appropriate cloud services to provide redundancy where needed. Second, test your “crossover points” (where the redundant element comes into play) to ensure that it works properly in practice. Netflix famously did this when they invented Chaos Monkey to kill parts of their live application at random. Third, deploy the proper operational monitoring to detect failure as soon as it occurs and to trigger automation to address the failure.
Finally, it is important to remember that measuring and responding to your application’s availability relies on a solid and accurate application operational monitoring platform. By solid I mean that your monitoring platform needs also to be highly-available since any downtime for the monitoring platform will be reflected as downtime for your application. By accurate, I mean that your monitoring platform needs to view your application exactly in the same way as your users or it may under- or over-report downtime that is not seen by your actual users. So invest in buying/building/using a solid and accurate monitoring platform for your applications.
At IBB, our Cloud and Software Transformation group has real-world experience in creating and running cloud-native applications and operational monitoring platforms. We understand the importance of accurate availability numbers and can help your enterprise provide proper SLAs for your cloud-native applications and cloud platforms.