With Amazon Web Services (AWS) experiencing a widespread issue with its web-based object storage platform yesterday (find out more) – designing, building and deploying resilient and scalable applications when hosted on public cloud infrastructure is ever more important. It’s no secret that public cloud platforms offer greater flexibility and simplicity, but ensuring applications are designed to be robust enough to handle common failure scenarios should always be a high priority consideration for organisations.
The impact of yesterday’s issue was felt by a large number of established websites and brands where their own online services were rendered unavailable as a result of the outage. The issue was reported as a result of instability in AWS’ S3 object storage service in its Northern Virginia region (US-EAST-1) and affected all five availability zones. The issue also affected a number of other AWS services that rely on S3 in the same region. While AWS worked quickly to identify and resolve the issue, the resulting effects cascaded across the internet community extremely quickly and didn’t go unnoticed.
Designing applications and services that are both highly resilient and scalable is an essential component of any application architecture, and particularly important in the world of cloud computing. The principle that an application should be able to withstand the loss or failure of one or more components or resources while being able to inherently manage changes in demand is a common theme that architects and engineers alike discuss at length. As a technology consulting provider, this is a principle that we encourage all our customers to adopt as much as we talk about securing applications.
Design for failure, everything else will follow…
Sometimes, we are told by customers that we are a little pessimistic because we talk about failure quite a lot. We don’t get offended and go on to explain that when designing cloud-based applications we always assume things will break from time to time. So, the best way to handle failure is to recognise it, plan for it and embrace it (even if it never happens!).
Resilience (noun)
The capacity to recover quickly from difficulties; toughness.Oxford English Dictionary
Public cloud computing platforms like AWS and Microsoft Azure allow customers to easily build and run web-based applications that are both scalable and resilient as well as cost effective. However, applications have been designed to take advantage of a number of the features and capabilities that these platforms have to offer. One of the major benefits of public cloud services today is geographical diversity coupled with conceptually infinite scalability and proven mechanisms to protect applications against failure. This vast scalability opportunity unlocks a range of approaches to address the resiliency challenges that are common in modern application architecture, but only if applications are designed correctly!
We work on a number of key design principles that help our customers implement resilient and scalable solutions that are measured against business objectives. These design principles include…
1. Decouple your application components.
As a rule, we believe that the more loosely coupled the components of a system are – the better a system will scale and handle failure. The key is to avoid introducing tight dependencies between all of the components that collectively make up an application or system. For example, a typical web application principle is to ensure that a web server is isolated from an application server and the two components communicate asynchronously. Introducing ‘layers’ and ‘modules’ into a system that are totally independent of each other is a fundamental method of addressing fundamental resilience and scalability challenges while significantly reducing overall complexity.
2. Always stateless and distribute load.
A common and proven design pattern for cloud-based systems is to be as stateless as possible throughout the application stack. Managing state can be one of the biggest challenges in designing a resilient and scalable system and is something that is usually ignored until late in the design lifecycle. One immediate benefit of primarily stateless architectures, is the ability to distribute transactional load across a scaled-out environment consisting of larger pools of resources. Traditional stateful systems are known for making recovery more difficult in cloud environments. However its not always possible to avoiding persisting ‘state’ in modern applications. When it comes to storing and retrieving data, cloud platforms are designed to handle parallel operations. Therefore, it is of course possible to store, manage and maintain state (such as in the context of a user session) through a number of dedicated and fit-for-purpose repositories that are designed to handle this type of data at scale. The key is to effectively manage state with as little data as possible and offload processing to greatly improve performance, resilience and scalability.
3. Encourage elasticity.
Elasticity is one of the newer concepts that cloud computing has introduced and is critical in modern cloud-based application design. In essence, elasticity is the concept of an application being able to automatically can its state and scale without any human intervention. There are a number of approaches to achieve this, but if an application is designed with elasticity in mind then this makes it considerably easier to handle failure from the onset. For example, adopting ‘auto-scaling’ allows an application to identify current demand and seamlessly adjust accordingly to meet it (even if it’s a decrease in demand). While focused on scalability, this approach can dramatically improve system resilience and redundancy as it enables dynamic configuration across a large number of possible scenarios, including component level failure.
4. Detect failure and recover fast.
One of the most significant advancements through using cloud computing is the ability to easily automate processes. Creating automated and repeatable deployment and configuration processes helps to reduce operation risk and improves performance. Using this principle to detect and manage failures in a system is an extremely effective way of making an application robust. Understanding and monitoring the performance of an application is crucial and provides a mechanism to detect and recover fast from foreseeable failures. By predefining failure scenarios and using automation techniques to trigger graceful recovery processes can not only make a system more resilient but can significantly improve performance and customer experience.
5. Understand your application’s data.
In general, it’s a good principle to understand the data that a system produces and consumes. With cloud-based application architecture, it’s important to understand where this data resides, its purpose and what components rely on what. There are many options when it comes to storing data in the cloud domain (from traditional relational databases, low-latency caching, object stores, No-SQL databases etc). Therefore, instead of storing data in a single database that all components rely on – data can be categorised and divided up depending on its requirements and usage. By taking a holistic approach to managing and protecting the data components of a system, it is easier to address data availability challenges when considering failure scenarios and the impact they might have.
So, in summary… designing, building and deploying high resilient and scalable solutions is extremely possible, and cloud computing is a great opportunity in being able to achieve this in a cost effective and proactive way. If your organisation is looking to build performance-driven and robust cloud-native applications or struggling with an existing strategy, then Prodera Group can help – please visit our Cloud Advisory Service for more information or contact us today to discuss your challenges with our experts.