The difference between Reliability and Availability

Subtle differences but closely related

Derek Hutson
4 min readAug 24, 2022

Another couple of common words that get confused and interchanged are reliability and availability. Although they both need to be taken into consideration when designing a well-architected framework, there are some key differences you need to be aware of.

First, lets look at what they actually mean and then we can speak more practically about what it means to conceptually integrate them into your frameworks.

Availability refers to what percentage of the time your resources are available to be used, regardless of how well they perform. If 30% of the time when a user accesses your website they get an error for example, then your architecture is not highly available. But if they get a response and a functioning site 99.99% of the time, regardless of site performance, your architecture is highly available.

Reliability refers to the capability of your architecture to meet defined performance standards while conducting its intended function. As an example if you have a load balancer and it is handling a very large number of requests, you want it to be able to have sub millisecond response time with ultra low latency. However if you notice that during high usage times your load balancer tends to have higher latency and takes longer to handle the higher traffic, then it is not very reliable.

In the context of a well-architected framework, both of these concepts are very important. If you are running a business and your customers have poor experiences with latency or errors on your website, then there is a chance they will not return and will just go elsewhere, which will cost you money.

So how can you ensure that your services are highly available and reliable?

In regards to availability you want to ensure that you have pre defined disaster recovery strategies in place. With manage cloud services they are likely not going to fault on their own, it will take some sort of event to knock out the resources you have provisioned, such as a natural disaster at an availability zones data center, or a security breach. The most common and straightforward way to guard against this is taking backups of your data, and having secondary resources in a different availability zone ready to deploy in a worst case scenario. You also want to consider what is tolerable for your RTO (Recovery Time Objective) and RPO (Recovery Point Objective). More information on those 2 terms can be found here.

Depending on your budget there are 4 different strategies to use to backup your resources:

  1. Backup and restore: The cheapest, but slowest, backup option that involves provisioning ALL of your resources after an event occurs utilizing your existing backups, launch templates, etc.
  2. Pilot light: One of the most common DR strategies, it involves having a very small amount of services already provisioned, but idle. When an event occurs you activate them, scale them, and provision other services as needed to meet a production workload.
  3. Warm Standby: Similar to pilot light but more costly and better RPO/RTO points. Your business critical systems are always running but on a smaller scale with all resources you need provisioned already there. When an event occurs you switch your production workload here and scale as needed.
  4. Multi-site active/active: The most expensive but with little to no downtime or data loss, this involves a replica of your production workloads and business critical systems constantly running at another location that you can switch over to be your primary infrastructure in case of a DR scenario.

For reliability, you need to make sure that you are using the appropriate services, and the appropriate amount of provisioned resources within those services. One of the best practices in this domain is to utilize auto scaling to automatically scale your workloads up to meet demand, and down when demand/traffic falls to save money. There are 2 main ways to scale your infrastructure:

  1. Vertical scaling: Adding more capacity to existing resources, for example taking an EC2 instance you have and upgrading it to an instance with more memory, higher throughput, etc.
  2. Horizontal scaling: Adding separate resources to compliment your existing resources and distribute workloads. For example if you have 2 EC2 instances running but they are not meeting your performance requirements, you could add 2 more to reduce latency across all of them, leaving you with a total of 4 instances.

There are of course pros and cons to each of the scaling strategies and which one you use depends on any given scenario, and of course your budget. However keep in mind that it is generally recommended to decouple your architecture, meaning you have separate services for separate functions. So if you have 1 massive EC2 instances that handles 10 different tasks at once, it would be hard to diagnose and fix issues with a couple tasks without potentially affecting others. On the other hand if you have 10 EC2 instances and each one does 1 task, if you start to run into any kind of issues it is easier to look at the specific instance where problems are arising and address just that one instance without directly impacting others.

Hopefully that was helpful and you can at least takeaway what the difference between reliability and availability is. One of the reasons I love AWS is that they have clear and concise documentation. Some of this documentation encompasses the Reliability pillar of the well-architected framework, which can be found here. If you have some time to read it I would highly recommend it.

As always best of luck in your continued journey through cloud computing.

--

--

Derek Hutson
Derek Hutson

Written by Derek Hutson

Practicing Kaizen in all things. Being a dad is pretty neat too.

Responses (1)