Achieving Service Resilience on Cloud-Reliant Infrastructure

[tl;dr Rather than require banks to have multi-cloud or on-premise redundant strategies for critical digital services delivered via cloud, regulators should require firms to instead adhere to ‘well-architected’ guidelines to ensure technology resilience for a given provider or PaaS solution, as part of an overall industry-wide framework of service resilience that can accommodate very rare but impactful events such as cloud provider failure.]

Resilience is the ability of a ‘system’ of processes to recover quickly from a significant shock, ideally without any noticeable impact on those depending on that system. Systems are comprised of people, technologies and environments (e.g., laws, physical locations, etc), any of which can ‘fail’.

Financial regulators are particularly concerned with resilience, as failures in banking systems can cause business disruption, serious personal inconvenience, and even the potential for economic instability.

For many years, banks have had strict requirements to ensure critical systems (“applications”) could be recovered in the event of a disaster (specifically, a data center failure incurring loss or unavailability of all software, hardware and data). Enterprise-wide people and environment ‘failure’ are often handled within an overall operational risk and business continuity management framework, but technology organizations are accountable for managing technology failure risk – which is to say, an application will be guaranteed to be available within a fixed amount of time in the event of a major infrastructure failure or breach.

In recent times, regulators have been focusing more on business service resilience (see, for example, the UK’s Financial Conduct Authority statement on business service resilience). In the UK, a succession of high-profile loss of digital services (through which most UK customers interact with their bank) in several banks has put increased focus on the area of operational resilience. To date, most such incidents can be attributed to poor technical debt management rather than anything specifically cloud-related – but on cloud, a business-as-usual approach to technical debt will greatly amplify the likelihood and impact of failures.

So, as banks move their infrastructure to leverage public cloud capabilities, the question arises as to how to meet operational resilience demands in various cloud failure scenarios – from failure of individual hosted servers, to loss of a particular provider data center, to loss of a particular infrastructure service in a region, or a complete loss of all service globally from a particular provider.

Related to provision of service is stewardship of related data. Assuming a service cannot be provided without access to and availability of underlying data services, cloud resilience solutions must also address data storage and protection.

Cloud Resilience Approaches

There are multiple ways to address cloud provider availability risk, but all have negative implications for agility and innovation – i.e., the key reasons firms want to use the cloud in the first place.

Mitigating solutions could include the following:

  • Deploy to cloud but maintain on-premise redundant storage, processing and networking facilities to fall-back on (an approach used by Barclays)
  • Implement multi-cloud solutions so that new compute infrastructure can be spun up when needed on an alternate cloud provider, with full cross-cloud data replication always enabled
  • Maintain on-premise deployment for all production and resilience workloads, and only use cloud for non-production environments.

The disadvantage with the above is that all of these strategies require treating cloud as a lowest-common-denominator – i.e., infrastructure-as-a-service. This implies firms need to build or buy capabilities to enable on-premise or multi-cloud ‘platform-as-a-service’ – a significant investment, and likely only justifiable for larger firms. It also effectively excludes a cloud provider’s proprietary services from consideration in solution architectures.

It should be noted as well that multi-cloud data replication strategies can be expensive, as cloud providers charge for data egress, and maintaining a complete view of data in another cloud could consume considerable data egress resources on an on-going basis.

The challenge gets more nuanced when software-as-a-service providers or business-process-as-a-service providers form part of the critical path for a regulated business service. Should a firm therefore have multiple SaaS or BPaaS providers to fall back on in the event of a complete failure of one such provider? In many cases, this is not feasible, but in other cases, the cost of keeping a backup service on ‘standby’ may be acceptably low.

Clearly, there is a point at which there are diminishing returns to actively mitigating these risks – but ignoring these ‘fat tail risks‘ completely runs the risk that a business could disappear overnight with one failure in its ‘digital supply chain’.

Regulators are taking note of this in the UK and in the US, with increased focus on business resilience for key processes, and the role that cloud providers are likely to play in the future in this.

Architecting for Resiliency

Building resilient systems is hard. A key factor in any resilient system is decentralization. The most resilient systems are highly decentralized – there is no ‘brain’ or ‘heart’ that can fail and bring down the whole system. Examples of modern resilient systems are the Internet and BitCoin. Both can survive significant failure, where the impact of failure is localised rather than systemic. (In the not-so-distant future, a truly resilient digitally-based financial infrastructure may have no choice but to be built on decentralized blockchain-like technologies…in other words, there may be diminishing cost/benefit returns for ‘too big to fail’ banks to implement digital resilience using traditional, centralised financial infrastructure architectures, even if cloud-based.)

Cloud providers, and services built on top of them, must by necessity be fundamentally resilient to failure. AWS, for example, ensures that all its regions are technically isolated from its other regions. A regional S3 outage in 2017 highlighted the dependency many businesses and services have on the S3 service. Amazon’s response to the issue implemented change of practices to mitigate the risk of a future failure, but a future regional outage of any cloud service is not only possible, it’s relatively likely (although generally less likely than the failure of an enterprise’s traditional data center, due to multiple redundant availability zones – aka data centers – in each region). So cloud providers typically provide regional isolation, and encourage (through programs like the AWS well-architected review or the Azure Architecture center) customers to design resilience of their systems around this, as well as providing hooks and interfaces to enable cross-region redundancy and fail-over of region-specific services.

Along with architecting for resiliency, organizations need to validate the resiliency of their solutions. Techniques like chaos-engineering ensure teams do not fear change or failure, and are confident in their system’s response to production failures.

Multi-cloud Options

Cloud providers, no matter how much their customers ask for it, are not likely to make it simpler for multi-cloud services to be implemented by their customers. It is in their interest (and is indeed their strategy) to assume that their customers are wholly dependent on them, and as such to take resilience extremely seriously. If cloud providers were to view their competitors as the “disaster recovery” solution for customers, then this would be a significant step backwards for the industry. While it is not likely to happen imminently, there is no room for complacency; perhaps this is where some regulatory oversight of providers may, in the future, be beneficial..

There is, however, room in the market for multi-cloud specialists (and hybrid on-premise/cloud specialists). The biggest player in this space is VMware/Dell, with their recent (re)acquisition of Pivotal, which includes cloud-neutral technologies like (open-sourced) Cloud Foundry and Pivotal Container Services. VMware is going all-in on Kubernetes, as are IBM/Red Hat and others. Kubernetes offers enterprises or 3rd party vendors the opportunity to build their own on-premise ‘cloud’ infrastructure via a standard orchestration API, which in principle should also be able to run on any cloud provider’s IaaS offerings. These steps are all pointing to a future where, irrespective of whether the ‘cloud’ infrastructure provider is public or private, container based software-provisioned infrastructure management is the future, and with it a fundamental shift in thinking about failure.

Cloud is not a Silver Bullet for Resilience

For enterprises, the choices on offer for implementing cloud are many and varied: all ultimately benefit organizations and their stakeholders by providing more flexibility at a lower cost than ever before – as long as systems on the cloud are architected for failure.

System availability (aka resilience) has traditionally been measured as mean-time-between-failure (MTBF), where MTBF is is defined as mean-time-to-failure (MTTF) + mean-time-to-recover (MTTR).

Many on-premise systems are architected to reduce MTTF – i.e., aiming for a large mean-time-to-failures rather than a reduced mean-time-to-recover (MTTR). Distributed systems favor a small MTTR over a large MTTF, and cloud is no exception. Systems explicitly architected for a large MTTF and relatively large MTTR (hours, not seconds) will, in general, find it difficult to migrate to cloud without significant re-engineering – to the extent it is unlikely to be financially feasible. (MTTR for legacy systems is roughly equivalent to ‘recovery time objective (RTO)’ in business continuity planning, and system recovery in such systems is generally assumed to be a highly manual process.)

Cloud-native applications have the advantage that they can be architected for low MTTR from the outset (the AWS well-architected reliability guidelines highlight this), taking into account a relatively small MTTF of foundational cloud components – in particular, virtual machines and docker images.

Going Serverless

It is easy to get lost in all of the infrastructure innovation happening to enable flexible, low cost and resilient systems. But ultimately enterprises don’t care about servers – they care about (business) services. Resilience planning should be built around service availability, and issues like orchestration, redundancy, fail-over, observability, load-balancing, data replication, etc should all be provided via underlying platforms, so applications can focus to the fullest extent possible on business functionality, and not on infrastructure.

Organizations seem to be taking a primarily infrastructure-first approach to cloud to date. This is helpful to build essential cloud engineering and operations competencies, but is unlikely to yield major 10x benefits sought by digital transformation agendas, as the focus is still on infrastructure (specifically, servers, networks and storage). Instead, a more useful longer term approach would be to take a serverless-first stance in solution design – i.e., to highlight where resilient stateful services are needed, and which APIs, events and services need to be created or instantiated that use them. Identify 3rd party or internal infrastructure services which are ‘serverless’ (for example, AWS Cognito, AWS Lambda/Azure Functions/GCP Serverless, Office 365, ServiceNow, …) and plan architectures which use these across all phases of the SDLC. Where required services to achieve the ‘serverless’ target state do not exist, these capabilities should be built or bought. This is essentially the approach advocated by AWS, and should, in my view, form the core of enterprise digital technology roadmaps.

Resilient architectures will consist of multiple ‘serverless’ services, connected via many structured asynchronous events streams or synchronous APIs. Behind each event/API will be a highly automated, resilient platform capable of adjusting to demand and responding to failure through graceful degradation or services. The choice of orchestration platform technology, and its physical location, should largely be neutral to service consumers – in much the same way that users of AWS Lambda don’t know or (for the most part) care about the details of how it works behind the scenes. Standards like the open-sourced AWS ‘serverless application model’ (SAM) enables this for Lambda. Other emerging cloud-neutral serverless standards like Knative and OpenFaaS also exist.

While these technologies are still maturing and are not yet robust enough to handle every demanding scenario, framing architecture in a ‘serverless’ way can be helpful to identify which technologies and providers to use where to provide overall resilience. In particular, using planning tools like Wardley Maps can help map out where and when it makes most business sense to transition from serverful on-premise current-state (custom infrastructure) to serverful (rental infrastructure/ IaaS – private or public) cloud to the serverless target state (commodity PaaS/SaaS) or ultimately to BPaaS.

Summary

Fundamentally, the near term goal is to avoid restricting firms adoption of cloud by unnecessarily limiting their technology choices to a ‘least-common-denominator’ of provider services for infrastructure services that can be delivered cloud natively. For hybrid or multi-cloud solutions, a least-common-denominator approach is currently unavoidable but a healthy, open market in Kubernetes Operators (Custom Resource Descriptors) should, over time, close the on-premise vs cloud-provider-native infrastructure service gap and raise the level of service abstractions available at platform level.

Maximizing a given provider’s managed solutions – and adhering to ‘well-architected’ practices – is likely to provide sufficient cloud-native resilience, certainly well in excess of many existing on-premise practices. For private clouds, bespoke ‘well-architected’ guidance – supported by automated tooling – should be mandatory.

If any regulation is to happen regarding cloud utilization for regulated services, it should be to require firms to follow a provider’s well-architected practices to a very high degree – and that provider’s need to define such practices. In other words, new regulation is not needed to change the core engineering posture of the biggest cloud providers, but rather it should elevate the existing posture as the benchmark others must follow.

For private, hybrid or multi-cloud solutions an industry-wide accepted set of ‘well-architected’ guidance to take the guesswork out of resilience compliance would be welcome, coupled with an overarching framework for maintaining minimum digital service availability in the event of ‘fat-tail failures’ occurring (i.e., extremely rare but high impact failures).

For some regulated services, therefore, regulators will need to know which cloud providers and/or vendor infrastructure orchestration solutions are being used, to better manage concentration risk and ensure the failure of a cloud provider or vendor does not present a systemic risk. But for a given financial institution, the best option is likely to go ‘all-in’ either on a particular cloud provider, or on a cloud-neutral, open-sourced PaaS based on Kubernetes.

Achieving Service Resilience on Cloud-Reliant Infrastructure

AWS CDK – why it’s worth looking at

[tl;dr AWS CDK provides a means for developers to consume compliant, reusable cloud infrastructure components in a way that matches their SDLC, improving developer experience and reducing the risk of silo’ing cloud infrastructure development and operations .]

Some weeks ago, Amazon launched the AWS Cloud Development Kit, or CDK. This article provides my initial (neutral) thoughts on the potential impact and relevance of the CDK for organizations building and deploying solutions on the cloud.

Developer Experience

First, does it work? As the CDK has been in beta for quite a while, when Amazon makes a product generally available, it means it meets a very high bar in terms of quality and stability, and that certainly is the case with the CDK. The examples all worked as expected, although the lambda code pipeline example took some mental gyrations to understand the full implications of what was actually being done – specifically, that the build/deploy pipeline created by the CDK can include running the CDK to generate templates that can be used as an input into that code-deployment pipeline – even if the resulting template is not deployed by the CDK.

All in all, the out-of-the-box experience for the CDK was excellent.

A polyglot framework

Secondly, the CDK is a software development framework. This means it uses ‘traditional’ programming languages (imperative, not declarative), it uses SDLC processes all application developers are familiar with (i.e., build/test/deploy cycles), and it provides many software abstractions that serve to hide (unnecessary) complexity from developers, while enabling developers to build ‘safe’ solutions.

The framework itself has been developed in Typescript, with an interesting technology called ‘jsii‘ used to generate native libraries in other programming languages (specifically, Java, Python, and C#/.NET as well as Javascript).

The polyglot nature of the framework is critical, as cloud infrastructure (as exposed to consumers) must be neutral to any specific programming language. But the most used languages have at least one popular framework for abstracting infrastructure necessary for building distributed applications (such as data stores, message queues, service discovery & routing, configuration, logging/tracing, caching, etc). For example, Java has the Spring framework, C# has .NET Core, Python has Django, Javascript has multiple frameworks based on NodeJS).

So, the question is, should developers now learn to use the CDK or focus instead on language-specific frameworks?

What the CDK is – and what it is not?

To answer this question, we need to be clear on what the CDK is, and what it is not. CDK is a compiler to generate CloudFormation code. If one considers CloudFormation templates to be the ‘assembly language’ for the AWS cloud ‘processor’, then what is important is that CDK generates high-quality CloudFormation templates – period.

To that extent, the CDK only needs to be as efficient as it takes to generate valid CloudFormation templates. It does not execute those templates, so CDK code will never have run-time performance sensitivities (except perhaps, as with traditional compilers, in build toolchains).

For this reason, Typescript seems to have been a pragmatic choice to implement the CDK in. Run-time performance is not the key factor here – rather, it is the creation of flexible, adaptable constructs that avoids the need for developers to write CloudFormation YAML/JSON. Languages that use the CDK libraries should expect the same performance criteria to apply.

Why CDK is necessary

After working through several of the examples, and observing the complexity of the CloudFormation code the CDK generates – and the simplicity of the example code – it is clear that having developers write CloudFormation template is no more sustainable than having developers write assembly language. CloudFormation (as with Azure Resource Manager and GCP Cloud Deployment Manager) is excellent for small, well-defined projects, but rapidly gets complex when expanded to many applications with complex infrastructure inter-dependencies.

In particular, ensuring the security and compliance of templates becomes very complex when templates are hand-crafted. While services like CloudSploit offer to statically scan CloudFormation templates for security breaches, it would be much better to ensure secure CloudFormation code was written in the first place.

Through Constructs, Stacks, and Apps, the CDK allows enterprise engineering teams to provide libraries of secure, compliant infrastructure components to developers that can be safely deployed.

For this reason, as well as the familiarity of CDK constructs to developers, CDK is likely to end up being more popular than having developers hand-craft CloudFormation templates. The complexity of CloudFormation risks enterprises splitting teams into specialists and non-specialists, reverting organizations back to silo’d infrastructure anti-patterns.

However, having specialist engineering teams focused on building, publishing and maintaining high-quality reusable constructs is a good thing, and this is likely what most organizations enterprise engineering teams should focus on (as well as the larger global community of CDK developers).

With respect to language-specific frameworks, perhaps it is only a matter of time before these frameworks generate the cloud-native templates that the software level abstractions can map directly onto. This may mean that the application footprint for such applications could get much smaller in future, as the framework abstractions are increasingly implemented by cloud constructs. Indeed, as observed by Adrian Cockroft, many of the open-source microservice components developed by Netflix ended up being absorbed by AWS, greatly simplifying the Netflix-specific code-base.

If this outlook proves correct, the correct approach for organizations already committed to a microservices framework would be to stick with it, rather than have their business-facing application developers learn CDK.

With respect to Terraform, the most popular cross-cloud provisioning, deployment and configuration tool, its principle benefit is consistent SDLC workflow across cloud providers. Organizations need to decide if a single end-to-end SDLC for application and infrastructure developers on a single cloud (using CDK) provides more benefits than a single infrastructure SDLC across multiple cloud providers, but with a different SDLC for application developers.

CDK & Serverless

For architectures which are fundamentally not ‘serverless’ in nature, the CDK presents a conflict: by allowing infrastructure to be specified and built as part of the developer lifecycle, where does responsibility for managing infrastructure lie?

The reality is, most organizations still exist in a ‘serverful’ world – where infrastructure environments, even if it’s cloud-based, is a ‘pet’ and not ‘cattle’. Environments tend to be created and managed over the long-term, especially where datastores are involved. Stacks tend to be stable across environments, changing only with new releases of software. Separate teams (from developers) are responsible for the ongoing health, security and cost of environments. These teams are likely to be much more comfortable with configuration and scripting than outright coding, using tools like Chef, Puppet, Ansible or AWS OpsWorks. They may prefer developers or architects to request infrastructure components via tools like AWS Service Catalog or ServiceNow, so that infrastructure code is firmly managed away from developers, and the benefits of SDLC-friendly CDK may be less obvious to them.

Generating and maintaining safe, secure and compliant cloud stacks is a vibrant area of growth, and CDK is unlikely to monopolise this – rather, it may spur the growth of 3rd party solutions. 3rd parties that aim to simplify and standardize cloud infrastructure management (such as Pulumi) will have a role to play, particularly for polyglot language and multi-cloud environments, but ‘serverful’ platform and infrastructure teams need to decide what infrastructure building blocks to expose to developers, and how.

With serverless, this dynamic changes significantly, and the CDK can safely become part of the development team’s SDLC. Indeed, a potential goal for enterprise’s moving towards a serverless target state (i.e., applications consisting of composable services with no fixed/bespoke infrastructure) is to use CDK constructs to define those business-level services as infrastructure components. The concept of a platform-as-a-service to integrate software-as-a-service is a concept worth exploring as this space matures, particularly with the advent of services like AWS EventBridge.

In the meantime, behind every ‘serverless’ service lie many servers..teams have many options as to how best to automate this underlying infrastructure, and CDK is another tool in the toolbox to enable this.

Conclusion

AWS CDK is ground-breaking technology that is a big step towards improving the developer experience and capabilities on (AWS) cloud. Other polyglot cloud providers will likely follow suit or risk widening the gap between cloud infrastructure teams and application development teams. Organizations should consider investing in building and publishing CDK construct libraries to be used by application teams – constructs which can be verified to be secure and with sufficient guardrails to allow less-experienced engineers to safely experiment with solutions.

In the meantime, as cloud platforms extend their capabilities, expect language-specific microservices frameworks to get simpler and smaller (or at least more modular in implementation), enabling application developers to fully exploit a given cloud provider’s platform services. Teams relying on these frameworks should understand and drive the roadmap for how these frameworks leverage cloud-native services, and ensure they align with their wider platform cloud/infrastructure automation strategy.

AWS CDK – why it’s worth looking at

Bending The Serverless Spoon

Do not try and bend the spoon. That’s impossible. Instead, only realize the truth… THERE IS NO SPOON. Then you will see that it not the spoon that bends, it is yourself.” — The Matrix

[tl;dr To change the world around them, organizations should change themselves by adopting serverless + agile as a target. IT organizations should embrace serverless to optimize and automate IT workflows and processes before introducing it for critical business applications.]

“Serverless” is the latest shiny new thing to come on the architectural scene. An excellent (opinionated) analysis on what ‘serverless’ means has been written by Jeremy Daly, a serverless evangelist – the basic conclusion being that ‘serverless’ is ultimately a methodology/culture/mindset.

If we accept that as a reasonable definition, how does this influence how we think about solution design and engineering, given that generations of computer engineers have grown up with servers front-and-center of design thinking?

In other words, how do we bend our way of thinking of a problem space to serverless-first, and use that understanding to help make better architectural decisions – especially with respect to virtual machines, containers, and orchestration, and distributed systems in general?

Worked Example

To provide some insight into the practicalities of building and running a serverless application, I used a worked example, “Building a Serverless App Using Athena and AWS Lambda” by Epsagon, a serverless monitoring specialist. This uses the open-source Serverless framework to simplify the creation of serverless infrastructure on a given cloud provider. This example uses AWS.

Note to those attempting to follow this exercise: not all the required code was provided in the version I used, so the tutorial does require some (javascript) coding skills to fill the gaps. The code that worked for me (with copious logging..) can be found here.

This worked example focuses on two reference-data oriented architectural patterns:

  • The transactional creation via a RESTful API of a uniquely identifiable ‘product’ with an ad-hoc set of attributes, including but not limited to ‘ProductId’, ‘Name’ and ‘Color’.
  • The ability to query all ‘products’ which share specific attributes – in this case, a shared name.

In addition, the ability to create/initialize shared state (in the form of a virtual database table) is also handled.

Problem-domain Non-Functional Characteristics

Conceptually, the architecture has the following elements:

  • Public, anonymous RESTful APIs for product creation and query
    • APIs could be defined in OpenAPI 3.0, but by default are created
  • Durable storage of product data information
    • Variable storage cost structure based on access frequency can be added through configuration
    • Long-term archiving/backup obligations can be met without using any other services.
  • Very low data management overhead
  • Highly resilient and available infrastructure
    • Additional multi-regional resilience can be added via Application Load Balancer and deploying Lambda functions to multiple regions
    • S3 and Athena are globally resilient
  • Scalable architecture
    • No fixed constraint on number of records that can be stored
    • No fixed constraint on number of concurrent users using the APIs (configurable)
    • No fixed constraint on the number of concurrent users querying Athena (configurable)
  • No servers to maintain (no networks, servers, operating systems, software, etc)
  • Costs based on utilization
    • If nobody is updating or querying the database, then no infrastructure is being used and no charges (beyond storage) are incurred
  • Secure through AWS IAM permissioning and S3 encryption.
    • Many more security authentication, authorization and encryption options available via API Gateway, AWS Lambda, AWS S3, and AWS Athena.
  • Comprehensive log monitoring via CloudWatch, with ability to add alerts, etc.

For a couple of days coding, that’s a lot of non-functional goodness..and overall the development experience was pretty good (albeit not CI/CD optimized..I used Microsoft’s Code IDE on a MacBook, and the locally installed serverless framework to deploy.) Of course, I needed to be online and connect to AWS, but this seemed like a minor issue (for this small app). I did not attempt to deploy any serverless mock services locally.

So, even for a slightly contrived use case like the above, there are clear benefits to using serverless.

Why bend the spoon?

There are a number of factors that typically need to be taken into consideration when designing solutions which tend to drive architectures away from ‘serverless’ towards ‘serverful’. Typically, these revolve around resource management ( i.e., network, compute, storage ) and state management (i.e., transactional state changes).

The fundamental issue that application architects need to deal with in any solution architecture is the ‘impedance mismatch’ between general purpose storage services, and applications. Applications and application developers fundamentally want to treat all their data objects as if they are always available, in-memory (i.e., fast to access) and globally consistent, forcing engineers to optimize infrastructure to meet that need. This generally precludes using general-purpose or managed services, and results in infrastructure being tightly coupled with specific application architectures.

The simple fact is that a traditional well-written, modular 3-tier (GUI, business logic, data store) monolithic architecture will always outperform a distributed system – for the set of users and use-cases it is designed for. But these architectures are (arguably) increasingly rare in enterprises for a number of reasons, including:

  • Business processes are increasing in complexity (aka features), consisting of multiple independently evolving enterprise functions that must also be highly digitally cohesive with each other.
  • More and more business functions are being provided by third-parties that need close (digital) integration with enterprise processes and systems, but are otherwise managed independently.
  • There are many, disparate consumers of (digital) process data outputs – in some cases enabling entirely new business lines or customer services.
  • (Digital) GUI users extend well outside the corporate network, to mobile devices as well as home networks, third-party provider networks, etc.

All of the above conspire to drive even the most well-architected monolithic application to the a ‘ball-of-mud‘ architecture.

Underpinning all of this is the real motivation behind modern (cloud-native) infrastructure: in a digital age, infrastructure needs to be capable of being ‘internet scale’ – supporting all 4.3+ billion humans and growing.

Such scale demands serverless thinking. However, businesses that do not aspire to internet-scale usage still have key concerns:

  • Ability to cope with sudden demand spikes in b2c services (e.g., due to marketing campaigns, etc), and increased or highly variable utilisation of b2b services (e.g., due to b2b customers going digital themselves)
  • Provide secure and robust services to their customers when they need it, that is resilient to risks
  • Ability to continuously innovate on products and services to retain customers and remain competitive
  • Comply with all regulatory obligations without impeding ability to change, including data privacy and protection
  • Ability to reorganize how internal capabilities are provisioned and provided with minimal impact to any of the above.

Without serverless thinking, meeting all of these sometimes conflicting needs, becomes very complex, and will consume ever more enterprise IT engineering capacity.

Note: for firms to really understand where serverless should fit in their overall investment strategy, Wardley Maps are a very useful strategic planning tool.

Bending the Spoon

Bending the spoon means rethinking how we architect systems. It fundamentally means closing the gap between models and implementation, and recognizing that where an architecture is deficient, the instinctive reaction to fix or change what you control needs to be overcome: i.e., drive the change to the team (or service provider) where the issue properly belongs. This requires out-of-the-box thinking – and perhaps is a decision that should not be taken by individual teams on their own unless they really understand their service boundaries.

This approach may require teams to scale back new features, or modify roadmaps, to accommodate what can currently be appropriately delivered by the team, and accepting what cannot.

Most firms fail at this – because typically senior management focus on the top-line output and not on the coherence of the value-chain enabling it. But this is what ‘being digital’ is all about.

Everyone wants to be serverless

The reality is, the goal of all infrastructure teams is to avoid developers having to worry about their infrastructure. So while technologies like Docker initially aimed to democratize deployment, infrastructure engineering teams are working to ensure developers never need to know how to build or manage a docker image, configure a virtual machine, manage a network or storage device, etc, etc. This even extends to hiding the specifics of IaaS services exposed by cloud providers.

Organizations that are evaluating Kubernetes, OpenFaaS, or Knative , or which use services such as AWS Fargate, AWS ECS, Azure Container Service, etc, ultimately want to ensure to minimize the knowledge developers need to have of the infrastructure they are working on.

Unfortunately for infrastructure teams, most developers still develop applications using the ‘serverful’ model – i.e., they want to know what containers are running where, how they are configured, how they interact, how they are discovered, etc. Developers also want to run containers on their own laptop whenever they can, and deploy applications to authorized environments whenever they need to.

Developers also build applications which require complex configuration which is often hand-constructed between and across environments, as performance or behavioural issues are identified and ‘patched’ (i.e., worked around instead of directing the problem to the ‘right’ team/codebase).

At the same time, developers do not want anything to do with servers…containers are as close as they want to get to infrastructure, but containers are just an abstraction of servers – they are most definitely not ‘serverless’.

To be Serverless, Be Agile

Serverless solutions are still in the early stages of maturity. For many problems that require a low-cost, resilient and always-available solution, but are not particularly performance sensitive (i.e., are naturally asynchronous and eventually consistent), then serverless solutions are ideal.

In particular, IT (and the proverbial shoes for the cobblers children) processes would benefit significantly from extensive use of serverless, as the management overhead of serverless solutions will be significantly less than other solutions. Integrating bespoke serverless solutions with workflows managed by tools like ServiceNow could be a significant game changer for existing IT organizations.

However, mainstream use of serverless technologies and solutions for business critical enterprise applications is still some way away – but if IT departments develop skills in it, it won’t be long before it finds its way into critical business solutions.

For broader use of serverless, firms need to be truly agile. Work for teams needs to come equally from other dependent teams as from top-down sources. Teams themselves need to be smaller (and ‘senior’ staff need to rethink their roles), and also be prepared to split or plateau. And feature roadmaps need to be as driven by capabilities as imagined needs.

Conclusion

Organizations already know they need to be ‘agile’. To truly change the world (bend the spoon), serverless and agile together will enable firms to change themselves, and so shape the world around them.

Unfortunately, for many organizations it is still easier to try to bend the spoon..for those who understand they need to change, adopting the ‘serverless’ mindset is key to success, even if – at least initially – true serverless solutions remain a challenge to realize in organizations dealing with legacy (serverful) architectures.

Bending The Serverless Spoon

The cloudy future of data management & governance

[tl;dr The cloud enables novel ways of handling an expected explosion in data store types and instances, allowing stakeholders to know exactly what data is where at all times without human process dependencies.]

Data management & governance is a big and growing concerns for more and more organizations of all sizes. Effective data management is critical for compliance, resilience, and innovation.

Data governance is necessary to know what data you have, when you got it, where it came from, where it is being used, and whether it is of good quality or not.

While the field is relatively mature, the rise of cloud-based services and service-enabled infrastructure will, I believe, fundamentally change the nature of how data is managed in the future and enable greater agility if leveraged effectively.

Data Management Meta-Data

Data and application architects are concerned about ensuring that applications use the most appropriate data storage solution for the problem being solved. To better manage cost and complexity, firms tend to converge on a handful of data management standards (such as Oracle or SQL Server for databases; NFS or NTFS for filesystems; Netezza, Terradata for data warehousing, Hadoop/HDFS for data processing, etc). Expertise is concentrated around central teams that manage provisioning, deployments, and operations for each platform. This introduces dependencies that project teams must plan around. This also requires forward planning and long-term commitment – so not particularly agile.

Keeping up with data storage technology is a challenge – technologies like key/value stores, graph databases, columnar databases, object stores, and document databases exist as these represent varying datasets in a more natural way for applications to consume, reducing or eliminating the ‘impedance mismatch‘ between how applications view state and how that state is stored.

In particular, may datastore technologies are used to scaling up rather than out; i.e., the only way to make them perform faster is to add more CPU/memory, or faster IO hardware. While this keeps applications simpler, it require significant forward planning and longer-term commitments to scale up, and is out of the control of application development teams. Cloud-based services can typically handle scale-out transparently, although applications may need to be aware of the data dimensions across which scale out happens (e.g., sharding by primary key, etc).

Fulfilling provisioning requests for a new datastore on-premise is mostly ticket driven, but fulfillment is still mostly by humans and not by software within enterprises – which means an “infrastructure-as-code” approach is not feasible.

Data Store Manageability vs Application Complexity

Most firms decide that it is better to simplify the data landscape such that fewer datastore solutions are available, but to resource those solutions so that they are properly supported to handle business critical production workloads with maximum efficiency.

The trade-off is in the applications themselves, where the data storage solutions available end up driving the application architecture, rather than the application architecture (i.e., requirements) dictating the most appropriate data store solution, which would result in the lowest impedance mismatch.

A typical example of an impedance mismatch are object-oriented applications (written in, say C++ or Java) which use relational databases. Here, object/relational mapping technologies such as Hibernate or Gigaspaces are used to map the application view of the data (which likes to view data as in-memory objects) to the relational view. These middle layers, while useful for naturally relational data, can be overly expensive to maintain and operate if what your application really needs is a more appropriate type of datastore (e.g., graph).

This mismatch gets exacerbated in a microservices environment where each microservice is responsible for its own persistence, and individual microservices are written in the language most appropriate for the problem domain. Typical imperative, object-oriented languages implementing transactional systems will lean heavily towards relational databases and ORMs, whereas applications dealing with multi-media, graphs, very-large objects, or simple key/value pairs will not benefit from this architecture.

The rise of event-driven architectures (in particular, transactional ‘sagas‘, and ‘aggregates‘ from DDD) will also tend to move architectures away from ‘kitchen-sink’ business object definitions maintained in a single code-base into multiple discrete but overlapping schemas maintained by different code-bases, and triggered by common or related events. This will ultimately lead to an increase in the number of independently managed datastores in an organisation, all of which need management and governance across multiple environments.

For on-premise solutions, the pressure to keep the number of datastore options down, while dealing with an explosion in instances, is going to limit application data architecture choices, increase application complexity (to cope with datastore impedance mismatch), and reduce the benefits from migrating to a microservices architecture (shared datastores favor a monolithic architecture).

Cloud Changes Everything

So how does cloud fundamentally change how we deal with data management and governance? The most obvious benefit cloud brings is around the variety of data storage services available, covering all the typical use cases applications need. Capacity and provisioning is no longer an operational concern, as it is handled by the cloud provider. So data store resource requirements can now be formulated in code (e.g., in CloudFormation, Terraform, etc).

This, in principle, allows applications (microservices) to choose the most appropriate storage solution for their problem domain, and to minimize the need for long-term forward planning.

Using code to specify and provision database services also has another advantage: cloud service providers typically offer the means to tag all instantiated services with your own meta-data. So you can define and implement your own data management tagging standards, and enforce these using tools provided by the cloud provider. These can be particularly useful when integrating with established data discovery tools, which depend on reliable meta-data sources. For example, tags can be defined based on a data ontology defined by the chief data office (see my previous article on CDO).

These mechanisms can be highly automated via service catalogs (such as AWS Service Catalog or ServiceNow), which allow compliant stacks to be provisioned without requiring developers to directly access the cloud providers APIs.

Let a thousand flowers bloom

The obvious downside to letting teams select their storage needs is the likely explosion of data stores – even if they are selected from a managed service catalog. But the expectation is that each distinct store would be relatively simple – at least compared to relational stores which support many application use cases and queries in a single database.

In on-premise situations, data integration is also a real challenge – usually addressed by a myriad of ad-hoc jobs and processes whose purpose is to extract data from one system and send it to another (i.e., ETL). Usually no meta-data exists around these processes, except that afforded by proprietary ETL systems.

In best case integration scenarios, ‘glue’ data flows are implemented in enterprise service buses that generally will have some form of governance attached – but which usually has the undesirable side-effect of introducing yet another dependency for development teams which needs planning and resourcing. Ideally, teams want to be able to use ‘dumb’ pipes for messaging, and be able to self-serve their message governance, such that enterprise data governance tools can still know what data is being published/consumed, and by whom.

Cloud provides two main game-changing capabilities to manage data complexity management at scale. Specifically:

  • All resources that manage data can be tagged with appropriate meta-data – without needing to, for example, examine tables or know anything about the specifics about the data service. This can also extend to messaging services.
  • Serverless functions (e.g., AWS Lambda, Azure Functions, etc) can be used to implement ‘glue’ logic, and can themselves be tagged and managed in an automated way. Serverless functions can also be used to do more intelligent updates of data management meta-data – for example, update a specific repository when a particular service is instantiated, etc. Serverless functions can be viewed as on-demand microservices which may have their own data stores – usually provided via a managed service.

Data, Data Everywhere

By adopting a cloud-enabled microservice architecture, using datastore services provisioned by code, applying event driven architecture, leveraging serverless functions, and engaging with the chief data officer for meta-data standards, it will be possible to have an unprecedented up-to-date view of what data exists in an organization and where. It may even address static views of data in motion (through tagging queue and notification topic resources). The data would be maintained via policies and rules implemented in service catalog templates and lambda functions triggered automatically by cloud configuration changes, so it would always be current and correct.

The CDO, as well as data and enterprise architects, would be the chief consumer of this metadata – either directly or as inputs into other applications, such as data governance tools, etc.

Conclusion

The ultimate goal is to avoid data management and governance processes which rely on reactive human (IT) input to maintain high-quality data management metadata. Reliable metadata can give rise to a whole new range of capabilities for stakeholders across the enterprise, and finally take IT out of the loop for business-as-usual data management queries, freeing up valuable resources for building even more data-driven applications.

The cloudy future of data management & governance

The future of modularity is..serverless

[tl;dr As platform solutions evolve and improve, the pressure for firms to reduce costs, increase agility and be resilient to failure will drive teams to adopt modern infrastructure platform solutions, and in the process decompose and simplify monoliths, adopt microservices and ultimately pave the way to building naturally modular systems on serverless platforms.]

“Modularity” – the (de)composition of complex systems into independently composable or replaceable components without sacrificing performance, security or usability – is an architectural holy grail.

Businesses may be modular (commonly expressed through capability maps), and IT systems can be modular. IT modularity can also be described as SOA (Service Oriented Architecture), although because of many aborted attempts at (commercializing) SOA in the past, the term is no longer in fashion. Ideally, the relationship between business ‘modules’ and IT application modules should be fully aligned (assuming the business itself has a coherent underlying business architecture).

Microservices are the latest manifestation of SOA, but this is born from a fundamentally different way of thinking about how applications are developed, tested, deployed and operated – without the need for proprietary vendor software.

Serverless takes takes the microservices concept one step further, by removing the need for developers (or, indeed, operators) to worry about looking after individual servers – whether virtual or physical.

A brief history of microservices

Commercial manifestations of microservices have been around for quite a while – for example Spring Boot, or OSGi for Java – but these have very commercial roots, and implement a framework tied to a particular language. Firms may successfully implement these technologies, but they will need to have already gone through much of the microservices stone soup journey. It is not possible to ‘buy’ a microservices culture from a technology vendor.

Because microservices are intended to be independently testable and deployable components, a microservices architecture inherently rejects the notion of a common framework for implementing/supporting the microservices natures of an application. This therefore puts the onus on the infrastructure platform to provide all the capabilities needed to build and run microservices.

So, capabilities like naming, discovery, orchestration, encryption, load balancing, retries, tracing, logging, monitoring, etc which used to be handled by language-specific frameworks are now increasingly the province of the ‘platform’. This greatly reduces the need for complex, hard-to-learn frameworks, but places a lot of responsibility on the platform, which must handle these requirements in a language-neutral way.

Currently, the most popular ‘platforms’ are the major cloud providers (Azure, Google, AWS, Digital Ocean, etc), IaaS vendors (e.g., VMWare, HPE), core platform building blocks such as Kubernetes, and platform solutions such as Pivotal Cloud Foundry,  Open Shift and Mesophere. (IBM’s BlueMix/Cloud is likely to be superseded by Red Hat’s Open Shift.)

The latter solutions previously had their own underlying platform solutions (e.g., OSGi for BlueMix, Bosh for PKS), but most platform vendors have now shifted to use Kubernetes under the hood. These solutions are intended to work in multiple cloud environments or on-premise, and therefore in principle allow developers to avoid caring about whether their applications are deployed on-premise or on-cloud in an IaaS-neutral way.

Decomposing Monolithic Architectures

With the capabilities these platforms offer, developers will be incentivized to decompose their applications into logical, distributed functional components, because the marginal additional cost of maintaining/monitoring each new process is relatively low (albeit definitely not zero). This approach is naturally amenable to supporting event driven architectures, as well as more conventional RESTful and RPC architectures (such as gRPC), as running processes can be mapped naturally to APIs, services and messages.

But not all processes need to be running constantly – and indeed, many processes are ‘out-of-band’ processes, which serve as ‘glue’ to connect events that happen in one system to another system: if events are relatively infrequent (e.g., less than one every few seconds), then no resources need to be used in-between events. So provisioning long-running docker containers etc may be overkill for many of these processes – especially if the ‘state’ required by those processes can be made available in a low-latency, highly available long-running infrastructure service such as a high-performance database or cache.

Functions on Demand

Enter ‘serverless’, which aims to specify the resources required to execute a single piece of code (basically a functional monolith) on-demand in a single package – roughly the equivalent of, for example, a declarative service in OSGi. The runtime in which the code runs is not the concern of the developer in a serverless architecture. There are no VMs, containers or side-cars – only functions communicating via APIs and events.

Currently, the serverless offerings by the major cloud providers are really only intended for ‘significant’ functions which justify the separate allocation of compute, storage and network resources needed to run them. A popular use case are ‘transformational’ functions which convert binary data from one form to another – e.g., create a thumbnail image from a full image – which may temporarily require a lot of CPU or memory. In contrast, an OSGi Declarative Service, for example, could be instantiated by the runtime inside the same process/memory space as the calling service – a handy technique for validating a modular architecture without worrying about the increased failure modes of a distributed system, while allowing the system to be readily scaled out in the future.

Modular Architectures vs Distributed Architectures

Serverless functions can be viewed as ‘modules’ by another name – albeit modules that happen to require separately allocated memory, compute and storage to the calling component. While this is a natural fit for browser-based applications, it is not a great fit for monolithic applications that would benefit from modular architectures, but not necessarily benefit from distributed architectures. For legacy applications, the key architectural question is whether it is necessary or appropriate to modularize the application prior to distributing the application or migrating it to an orchestration platform such as Kubernetes, AWS ECS, etc.

As things currently stand, the most appropriate (lowest risk) migration route for complex monolithic applications is likely to be a migration of some form to one of the orchestrated platforms identified above. By allowing the platform to take care of ‘non-functional’ features (such as naming, resilience, etc), perhaps the monolith can be simplified. Over time, the monolith can then be decomposed into modular ‘microservices’ aligned by APIs or events, and perhaps eventually some functionality could decompose into true serverless functions.

Serverless and Process Ownership

Concurrently with decomposing the monolith, a (significant) subset of features – mainly those not built directly using the application code-base, or which straddle two applications – may be meaningfully moved to serverless solutions without depending on the functional decomposition of the monolith.

It’s interesting to note that such an architectural move may allow process owners to own these serverless functions, rather than relying on application owners, where often, in large enterprises, it isn’t even clear which application owner should own a piece of ‘glue’ code, or be accountable when such code breaks due to a change in a dependent system.

In particular, existing ‘glue’ code which relies on centralized enterprise service buses or equivalent would benefit massively from being migrated to a serverless architecture. This not only empowers teams that look after the processes the glue code supports, but also ensures optimal infrastructure resource allocation, as ESBs can often be heavy consumers of infrastructure resources. (Note that a centralized messaging system may still be needed, but this would be a ‘dumb pipe’, and should itself be offered as a service.)

Serverless First Architecture

Ultimately, nirvana for most application developers and businesses, is a ‘serverless-first’ architecture, where delivery velocity is only limited by the capabilities of the development team, and solutions scale both in function and in usage seamlessly without significant re-engineering. It is fair to say that serverless is a long way from achieving this nirvana (technologies like ‘AIOps‘ have a long way to go), and most teams still have to shift from monolithic to modular and distributed thinking, while still knowing when a monolith is still the most appropriate solution for a given problem.

As platform solutions improve and mature, however, and the pressure mounts on businesses whose value proposition is not in the platform engineering space to reduce costs, increase agility and be increasingly resilient to failures of all kinds, the path from monolith to orchestrated microservices to serverless (and perhaps ‘low-code’) applications seems inevitable.

The future of modularity is..serverless