Achieving Service Resilience on Cloud-Reliant Infrastructure

[tl;dr Rather than require banks to have multi-cloud or on-premise redundant strategies for critical digital services delivered via cloud, regulators should require firms to instead adhere to ‘well-architected’ guidelines to ensure technology resilience for a given provider or PaaS solution, as part of an overall industry-wide framework of service resilience that can accommodate very rare but impactful events such as cloud provider failure.]

Resilience is the ability of a ‘system’ of processes to recover quickly from a significant shock, ideally without any noticeable impact on those depending on that system. Systems are comprised of people, technologies and environments (e.g., laws, physical locations, etc), any of which can ‘fail’.

Financial regulators are particularly concerned with resilience, as failures in banking systems can cause business disruption, serious personal inconvenience, and even the potential for economic instability.

For many years, banks have had strict requirements to ensure critical systems (“applications”) could be recovered in the event of a disaster (specifically, a data center failure incurring loss or unavailability of all software, hardware and data). Enterprise-wide people and environment ‘failure’ are often handled within an overall operational risk and business continuity management framework, but technology organizations are accountable for managing technology failure risk – which is to say, an application will be guaranteed to be available within a fixed amount of time in the event of a major infrastructure failure or breach.

In recent times, regulators have been focusing more on business service resilience (see, for example, the UK’s Financial Conduct Authority statement on business service resilience). In the UK, a succession of high-profile loss of digital services (through which most UK customers interact with their bank) in several banks has put increased focus on the area of operational resilience. To date, most such incidents can be attributed to poor technical debt management rather than anything specifically cloud-related – but on cloud, a business-as-usual approach to technical debt will greatly amplify the likelihood and impact of failures.

So, as banks move their infrastructure to leverage public cloud capabilities, the question arises as to how to meet operational resilience demands in various cloud failure scenarios – from failure of individual hosted servers, to loss of a particular provider data center, to loss of a particular infrastructure service in a region, or a complete loss of all service globally from a particular provider.

Related to provision of service is stewardship of related data. Assuming a service cannot be provided without access to and availability of underlying data services, cloud resilience solutions must also address data storage and protection.

Cloud Resilience Approaches

There are multiple ways to address cloud provider availability risk, but all have negative implications for agility and innovation – i.e., the key reasons firms want to use the cloud in the first place.

Mitigating solutions could include the following:

  • Deploy to cloud but maintain on-premise redundant storage, processing and networking facilities to fall-back on (an approach used by Barclays)
  • Implement multi-cloud solutions so that new compute infrastructure can be spun up when needed on an alternate cloud provider, with full cross-cloud data replication always enabled
  • Maintain on-premise deployment for all production and resilience workloads, and only use cloud for non-production environments.

The disadvantage with the above is that all of these strategies require treating cloud as a lowest-common-denominator – i.e., infrastructure-as-a-service. This implies firms need to build or buy capabilities to enable on-premise or multi-cloud ‘platform-as-a-service’ – a significant investment, and likely only justifiable for larger firms. It also effectively excludes a cloud provider’s proprietary services from consideration in solution architectures.

It should be noted as well that multi-cloud data replication strategies can be expensive, as cloud providers charge for data egress, and maintaining a complete view of data in another cloud could consume considerable data egress resources on an on-going basis.

The challenge gets more nuanced when software-as-a-service providers or business-process-as-a-service providers form part of the critical path for a regulated business service. Should a firm therefore have multiple SaaS or BPaaS providers to fall back on in the event of a complete failure of one such provider? In many cases, this is not feasible, but in other cases, the cost of keeping a backup service on ‘standby’ may be acceptably low.

Clearly, there is a point at which there are diminishing returns to actively mitigating these risks – but ignoring these ‘fat tail risks‘ completely runs the risk that a business could disappear overnight with one failure in its ‘digital supply chain’.

Regulators are taking note of this in the UK and in the US, with increased focus on business resilience for key processes, and the role that cloud providers are likely to play in the future in this.

Architecting for Resiliency

Building resilient systems is hard. A key factor in any resilient system is decentralization. The most resilient systems are highly decentralized – there is no ‘brain’ or ‘heart’ that can fail and bring down the whole system. Examples of modern resilient systems are the Internet and BitCoin. Both can survive significant failure, where the impact of failure is localised rather than systemic. (In the not-so-distant future, a truly resilient digitally-based financial infrastructure may have no choice but to be built on decentralized blockchain-like technologies…in other words, there may be diminishing cost/benefit returns for ‘too big to fail’ banks to implement digital resilience using traditional, centralised financial infrastructure architectures, even if cloud-based.)

Cloud providers, and services built on top of them, must by necessity be fundamentally resilient to failure. AWS, for example, ensures that all its regions are technically isolated from its other regions. A regional S3 outage in 2017 highlighted the dependency many businesses and services have on the S3 service. Amazon’s response to the issue implemented change of practices to mitigate the risk of a future failure, but a future regional outage of any cloud service is not only possible, it’s relatively likely (although generally less likely than the failure of an enterprise’s traditional data center, due to multiple redundant availability zones – aka data centers – in each region). So cloud providers typically provide regional isolation, and encourage (through programs like the AWS well-architected review or the Azure Architecture center) customers to design resilience of their systems around this, as well as providing hooks and interfaces to enable cross-region redundancy and fail-over of region-specific services.

Along with architecting for resiliency, organizations need to validate the resiliency of their solutions. Techniques like chaos-engineering ensure teams do not fear change or failure, and are confident in their system’s response to production failures.

Multi-cloud Options

Cloud providers, no matter how much their customers ask for it, are not likely to make it simpler for multi-cloud services to be implemented by their customers. It is in their interest (and is indeed their strategy) to assume that their customers are wholly dependent on them, and as such to take resilience extremely seriously. If cloud providers were to view their competitors as the “disaster recovery” solution for customers, then this would be a significant step backwards for the industry. While it is not likely to happen imminently, there is no room for complacency; perhaps this is where some regulatory oversight of providers may, in the future, be beneficial..

There is, however, room in the market for multi-cloud specialists (and hybrid on-premise/cloud specialists). The biggest player in this space is VMware/Dell, with their recent (re)acquisition of Pivotal, which includes cloud-neutral technologies like (open-sourced) Cloud Foundry and Pivotal Container Services. VMware is going all-in on Kubernetes, as are IBM/Red Hat and others. Kubernetes offers enterprises or 3rd party vendors the opportunity to build their own on-premise ‘cloud’ infrastructure via a standard orchestration API, which in principle should also be able to run on any cloud provider’s IaaS offerings. These steps are all pointing to a future where, irrespective of whether the ‘cloud’ infrastructure provider is public or private, container based software-provisioned infrastructure management is the future, and with it a fundamental shift in thinking about failure.

Cloud is not a Silver Bullet for Resilience

For enterprises, the choices on offer for implementing cloud are many and varied: all ultimately benefit organizations and their stakeholders by providing more flexibility at a lower cost than ever before – as long as systems on the cloud are architected for failure.

System availability (aka resilience) has traditionally been measured as mean-time-between-failure (MTBF), where MTBF is is defined as mean-time-to-failure (MTTF) + mean-time-to-recover (MTTR).

Many on-premise systems are architected to reduce MTTF – i.e., aiming for a large mean-time-to-failures rather than a reduced mean-time-to-recover (MTTR). Distributed systems favor a small MTTR over a large MTTF, and cloud is no exception. Systems explicitly architected for a large MTTF and relatively large MTTR (hours, not seconds) will, in general, find it difficult to migrate to cloud without significant re-engineering – to the extent it is unlikely to be financially feasible. (MTTR for legacy systems is roughly equivalent to ‘recovery time objective (RTO)’ in business continuity planning, and system recovery in such systems is generally assumed to be a highly manual process.)

Cloud-native applications have the advantage that they can be architected for low MTTR from the outset (the AWS well-architected reliability guidelines highlight this), taking into account a relatively small MTTF of foundational cloud components – in particular, virtual machines and docker images.

Going Serverless

It is easy to get lost in all of the infrastructure innovation happening to enable flexible, low cost and resilient systems. But ultimately enterprises don’t care about servers – they care about (business) services. Resilience planning should be built around service availability, and issues like orchestration, redundancy, fail-over, observability, load-balancing, data replication, etc should all be provided via underlying platforms, so applications can focus to the fullest extent possible on business functionality, and not on infrastructure.

Organizations seem to be taking a primarily infrastructure-first approach to cloud to date. This is helpful to build essential cloud engineering and operations competencies, but is unlikely to yield major 10x benefits sought by digital transformation agendas, as the focus is still on infrastructure (specifically, servers, networks and storage). Instead, a more useful longer term approach would be to take a serverless-first stance in solution design – i.e., to highlight where resilient stateful services are needed, and which APIs, events and services need to be created or instantiated that use them. Identify 3rd party or internal infrastructure services which are ‘serverless’ (for example, AWS Cognito, AWS Lambda/Azure Functions/GCP Serverless, Office 365, ServiceNow, …) and plan architectures which use these across all phases of the SDLC. Where required services to achieve the ‘serverless’ target state do not exist, these capabilities should be built or bought. This is essentially the approach advocated by AWS, and should, in my view, form the core of enterprise digital technology roadmaps.

Resilient architectures will consist of multiple ‘serverless’ services, connected via many structured asynchronous events streams or synchronous APIs. Behind each event/API will be a highly automated, resilient platform capable of adjusting to demand and responding to failure through graceful degradation or services. The choice of orchestration platform technology, and its physical location, should largely be neutral to service consumers – in much the same way that users of AWS Lambda don’t know or (for the most part) care about the details of how it works behind the scenes. Standards like the open-sourced AWS ‘serverless application model’ (SAM) enables this for Lambda. Other emerging cloud-neutral serverless standards like Knative and OpenFaaS also exist.

While these technologies are still maturing and are not yet robust enough to handle every demanding scenario, framing architecture in a ‘serverless’ way can be helpful to identify which technologies and providers to use where to provide overall resilience. In particular, using planning tools like Wardley Maps can help map out where and when it makes most business sense to transition from serverful on-premise current-state (custom infrastructure) to serverful (rental infrastructure/ IaaS – private or public) cloud to the serverless target state (commodity PaaS/SaaS) or ultimately to BPaaS.

Summary

Fundamentally, the near term goal is to avoid restricting firms adoption of cloud by unnecessarily limiting their technology choices to a ‘least-common-denominator’ of provider services for infrastructure services that can be delivered cloud natively. For hybrid or multi-cloud solutions, a least-common-denominator approach is currently unavoidable but a healthy, open market in Kubernetes Operators (Custom Resource Descriptors) should, over time, close the on-premise vs cloud-provider-native infrastructure service gap and raise the level of service abstractions available at platform level.

Maximizing a given provider’s managed solutions – and adhering to ‘well-architected’ practices – is likely to provide sufficient cloud-native resilience, certainly well in excess of many existing on-premise practices. For private clouds, bespoke ‘well-architected’ guidance – supported by automated tooling – should be mandatory.

If any regulation is to happen regarding cloud utilization for regulated services, it should be to require firms to follow a provider’s well-architected practices to a very high degree – and that provider’s need to define such practices. In other words, new regulation is not needed to change the core engineering posture of the biggest cloud providers, but rather it should elevate the existing posture as the benchmark others must follow.

For private, hybrid or multi-cloud solutions an industry-wide accepted set of ‘well-architected’ guidance to take the guesswork out of resilience compliance would be welcome, coupled with an overarching framework for maintaining minimum digital service availability in the event of ‘fat-tail failures’ occurring (i.e., extremely rare but high impact failures).

For some regulated services, therefore, regulators will need to know which cloud providers and/or vendor infrastructure orchestration solutions are being used, to better manage concentration risk and ensure the failure of a cloud provider or vendor does not present a systemic risk. But for a given financial institution, the best option is likely to go ‘all-in’ either on a particular cloud provider, or on a cloud-neutral, open-sourced PaaS based on Kubernetes.

Achieving Service Resilience on Cloud-Reliant Infrastructure