Making sense of serverless

The now annual Serverlessconf NYC (2019) event was held in NYC in October. This was a great opportunity to assess the current state of ‘serverless’ – what it is, why it’s important, whether and how it should factor in enterprise cloud planning, and what challenges there are around the serverless space.

The content below is informed by presentations given during the conference, but opinions and conclusions are my own.

What exactly is serverless

‘Serverless’ architectures are often correlated with ‘functions-as-a-service’ (FaaS) but ‘serverless’ extends far beyond FaaS. At the simplest – and most pragmatic – level, ‘serverless’ architectures do not have any ‘servers’ (i.e., long-running virtual and/or physical machines) to manage – i.e., no OS patches, no upgrades, no capacity to manage, etc. Operational overhead related to infrastructure management is zero. This means development teams are 100% responsible for their serverless application architectures.

Of course, the servers do not disappear: they are merely 100% managed by ‘someone else’, with little or no engagement required between development teams and those who manage the serverless infrastructure.

Architecturally, serverless is more than simply no infrastructure to manage. ‘Pure’ serverless architectures exhibit the following characteristics:

  • Are event-driven
  • Are function-oriented
  • Use managed services
  • Are scalable to zero and up

It’s worth noting that managed services in a a serverless architecture may not themselves be using ‘serverless’ architectures. However, the technology used to deliver each managed service is not exposed in a specific serverless solution architecture.

Why is serverless important?

From a purely economic perspective, serverless enables the introduction of true value-driven businesses (see Simon Wardley’s perspective here), enabling a whole new economics around automation and infrastructure. Essentially, if no value is being delivered, then no cost is being incurred on a serverless architecture (assuming usage of services constitutes ‘value’ delivered).

Because of the traditionally high cost of supporting and maintaining application infrastructure, businesses have historically put a lot of effort into planning new software; new features tend to increase both change and operating costs, usually more than linearly. This makes businesses more change-averse over time, leading to over-planning, a lack of agility and significantly reduced pace of innovation delivery.

Currently (as of 2019), the fashion is to invest in cloud computing to dampen change and operating costs – although this investment is still predominantly in the IaaS space (i.e., using the cloud’s superior economics for compute, storage and networking). Enterprises moving beyond IaaS are faced with either committing to a specific cloud provider’s infrastructure PaaS solutions (such as AWS ECS, Fargate, etc), or investing a lot of effort in building and operating their own orchestration and runtime solutions (usually using Kubernetes as the enabler).

But the war to deliver customer value is being fought above the runtime or infrastructure PaaS, as suggested by Simon Wardley. The risk to many enterprises is that they win the battle for infrastructure PaaS but lose the war to deliver customer value. Indeed, one conference presentation (entitled ‘Killing Kubernetes’) gave a real-world example of how running down the Kubernetes path prematurely can cause a team to lose sight of the customer value to be delivered. (The team in the end decided to go full serverless, and ditch Kubernetes.)

Using a tool likely Wardley Maps enables clarity of thought with respect to critical platform-level components, and which battles it makes sense to fight, vs leveraging what industry innovation will provide.

As Ben Kehoe describes, the point of serverless is to provide focus on business value – it’s not about functions, technology, cost or operations.

Changing Build vs Buy Mindsets

Many enterprises have a policy of ‘buy over build’ – i.e., buy (and customize) a solution rather than build a solution. Customized off-the-shelf solutions have their advantages, but often ultimately lead to businesses being constrained by vendor roadmaps, or by the cost of upgrading/keeping pace with vendor advances. In particular, vendor software is optimized for configurability, whereas what enterprises need is extensible software rather than configurable software.

Serverless provides organizations which do not have a depth in engineering expertise with a path towards ‘build over buy’. Functions and workflows proprietary to a business can be done with serverless functions, while use of managed services can minimize the need for infrastructure expertise. Integrating 3rd party software-as-a-service solution also becomes second nature in a serverless environment, particularly with the advent of integration tools such as AWS EventBridge. Such an architecture is readily extensible, and better suited to meet enterprise needs.

Serverless and the Enterprise

Serverless architectures are being actively used by all kinds of businesses with great success (aCloudGuru, a sponsor of ServerlessConf NYC, is just one example of a successful serverless user). Architectures which rely exclusively on managed services (such as AWS S3, Aurora, Lambda, StepFunctions, DynamoDB, etc) can be considered serverless.

Many enterprises are partially serverless on the cloud, choosing to leverage a cloud provider’s managed service offering as part of a ‘serverful’ solution architecture (e.g., using S3 with EC2). But these architectures do not provide the full benefits of a true ‘serverless’ architecture, as considerable effort is still required to manage the non-serverless elements.

It should be noted as well that just because traditional enterprise software is made available as a ‘managed service’, doesn’t mean that the enterprise overhead of managing that service is reduced: if the cloud provider still exposes all the configurable aspects of the software, there will not be a significant benefit in moving to the managed service. (Microsoft enterprise applications offered on Azure seem to suffer from this affliction.)

Fundamentally, serverless is not yet ready to take on all enterprise workloads – there are many constraints and conflicts between standardizing the serverless runtimes (necessary to allow them to be managed efficiently at scale) and the customization needs of enterprises. In particular, how systems manage state with sufficient performance is likely to remain a challenge – although this is certainly solvable, as emerging architectural best practices for a serverless world establish themselves (the traditional model-view-controller model being a poor model for serverless applications).

For most enterprises, therefore, business solutions should be planned as if performant platform solutions exists, being clear on what the functions are, what the managed services are, and what platform capabilities are assumed. These can then drive further investment decisions to build out (or buy/rent) these capabilities. A ‘serverless first’ mindset is key to this.

Underpinning all these is organizational design – in particular, the concepts and ideas espoused in Team Topologies map very well to this approach.

Decomposing the Monolith

A key use case for serverless is enabling the decomposition of legacy monolithic architectures. Most enterprises do not have the skills or expertise to successfully migrate complex monolithic architectures to microservices, as this requires some skills in developing on and managing highly distributed systems. While technologies like CloudFoundry and SpringBoot go a long way towards minimizing the cognitive load for application developers, organizations require considerable investment to make these technologies available as true managed services across an enterprise.

Serverless offers a route to decompose monolithic architectures without first building out the full capabilities needed to deploy serverful microservice architectures. It allows enterprises to experiment with service-based business solutions without incurring significant or hard-to-reverse costs in infrastructure and/or skills. Once a decomposed architecture begins to prove its worth, it may be unavoidable (for now) to move to serverful microservices at the back-end to scale out, but the business value proposition should be clear by then.

Serverless Challenges

Serverless architectures have their own challenges that organizations need to be prepared to handle, which are different from the challenges that building serverful architectures have.

Key challenges exist around:

  • Security
  • Local development and testing
  • Debugging, tracing, monitoring and alerting
  • Limit Management
  • Resilience
  • Lock-in
  • Integration testing
  • Serverless infrastructure-as-code

The above are the challenges specifically raised during the conference. Other challenges may yet reveal themselves.

Security

Serverless requires a different security model than traditional infrastructure. Specifically, security for serverless centers around security of functions, security of data, and security of configuration.

Key attack surfaces for serverless are event data injection, unauthorized deployments, and dependency poisoning. In particular, over-privileged permissions present a significant surface attack area. A good list of attack surfaces is published by Palo Alto/Puresec, a sponsor of the conference.

Serverless components therefore need their own security solutions, as part of an over-arching defense-in-depth security strategy.

Local development and testing

By definition, managed services cannot be available on local (laptop/desktop) development environments, as neither are serverless runtimes such as lambda. Instead, development is expected to happen directly on the cloud, which can cause issues for developers who are periodically disconnected from the internet.

For some development teams, the ability to code and test away from the cloud is important, and in this regard, cloud providers are beginning to standardize more on OCI containers in their serverless runtimes, and allow developers to run these containers locally on their laptop, as well as on standard orchestrated environments such as Kubernetes. Azure and GCP seems to be leading the way in this space, but AWS is always improving its lambda run-times and offering developers more ways to customize them, so this may eventually lead to AWS offering the same features.

The challenge, however, will be to maintain the benefits of serverless while avoiding requiring teams end up managing containers as the new ‘servers’…a trap many teams are likely to fall into.

Debugging, tracing, monitoring and alerting

The challenges here are not unique to serverless – microservices architectures have these challenges in spades. While cloud providers typically provide managed services to assist with these (e.g, AWS X-Ray, AWS CloudWatch, etc), a rich eco-system of 3rd parties also help to address these needs.

In general, while it is possible to get by with provider-native solutions, it may be best to augment team capabilities with a vendor solution, such as Lumigo, Serverless framework, Datadog, Epsagon, etc.

Limit Management

All serverless services have limits, usually defined per account. This protects rogue applications from over-loading lambdas or managed services (such as AWS DynamoDB).

Usually, limits can be increased, but may need a service request to the cloud provider. Limits can also be imposed per account at an enterprise level (for example, via AWS Organizational Units).

It is important that the service limits are known and understood, as incorrectly assuming no limits may have a material impact on a solution architecture. While serverless solutions can scale, they cannot scale infinitely.

Resilience

Resilience for managed services is different from resilience of functions-as-a-service. Managed services need to be available at all times – but the manner and means by which such services maintain availability is generally opaque to the user. Some services may be truly global, but cloud providers tend to make managed service resilient within a specific region (through multiple availability-zones in a region), which requires solution architectures to allow for redundancy across multiple regions in the event a single region fails in its entirety. Recovery in these scenarios may not need to be 100% automated, dependending on recovery time objectives.

For functions-as-a-service (lambdas), if an invocation fails, it should be safe for the runtime to try again (i.e., idempotent processing of events). So the runtime provides most of the resilience.

However, if a lambda depends on a ‘traditional’ service (i.e., not in itself dynamically scalable), there may be resilience issues. For example, a lambda connecting to a traditional relational database via SQL may run out of available server-side connections.

Resource constraints applies to any API which is not fronting a serverless architecture. So lambdas need to ensure sufficient resilience (e.g., circuit breaker pattern) is built-in so that constraints in other APIs do not cause the lambda to fail.

Lock-in

Many enterprises are reluctant to use a particular cloud providers serverless model as they tend to be very proprietary and cloud-provider specific, and therefore moving to another cloud, or enabling a solution to run on any cloud-provider, could involve considerable re-engineering expense.

Firms which are constrained by regulatory or other drivers to avoid provider lock-in have options available. Firms can use multi-cloud serverless frameworks such as Serverless. In addition, there are vendors appearing in the multi-cloud messaging space, with vendors like TriggerMesh offering a serverless multi-cloud event bus.

Some cloud providers are making source code for their lambda services available publically – for example Google Cloud Functions and Azure Functions. Open-source serverless solutions such as Google’s Knative and OpenFaaS are also available. In addition, some vendors, such as Platform9 provide a completely independent solution for lambdas, for organizations which want to deploy lambdas internally – for example, on Kubernetes.

Other mechanisms to minimize the effect of lock-in include the use of standard OCI or docker containers to host serverless functions, which may allow containers to run in other orchestration environments without requiring significant rework. (This doesn’t really help if the container relies on external provider-specific managed services, however.)

Regardless of steps taken to avoid lock-in, some cloud providers may include managed services that may be proprietary to them: once software is built to leverage such a managed service, you have a form of lock-in (in much the same way, for example, you may be locked-in to Oracle or Microsoft databases once you commit to using proprietary features of them).

As such, focusing on avoiding lock-in is, for many firms, going to result in unnecessary complexity. It may be better to exploit a given cloud provider, and manage the business risk associated with a complete provider outage. For regulated services, however, regulators may want to ensure regulated firms are not overly concentrated in one provider.

Integration testing

Integration testing is never easy to fully automate – it is partially reason why there is so much focus on microservices, as each microservice is an independently testable and deployable component. The same applies for lambdas. But each lambda may itself depend on multiple managed services, so how to test those? An excellent piece on serverless testing by Paul Johnston describes the challenge well:

The test boundaries for unit testing a FaaS Function appears to be very close to an integration test versus a component test within a microservice approach.

In essence, because all serverless features are available through APIs, it *should* be easier to build and maintain integration tests, but for now it is still harder than it ought to be.

Serverless Infrastructure as Code

There is a growing sense of dissatisfaction with the limitations of traditional YAML-based configuration languages with respect to serverless – in particular, that the lifecycle and dependencies of serverless resources are not properly represented in existing infrastructure configuration languages. Ben Kehoe gives a flavor of the issues, but this is a complex topic likely to get more attention in the future.

Summary

The key value proposition of serverless is that it permits application developers to focus more on delivering customer value, and to spend less time dealing with infrastructure concerns such as managing servers.

The time is right for organizations to start entering the serverless mindset, and to assess business solutions in the strategic context offered by serverless – whether that means ultimately using external services or designing internal services in a serverless way.

ServerlessConf 2019 was informative, and the presentations were generally accessible to a wide audience. For many presentations, it was not necessary to be a cloud engineer to understand the content and to appreciate the potential transformational opportunities of serverless in the coming years.

I hope that in future events, a broader coalition of business strategic planners and do-ers will be in attendance. It is definitely not a Kubecon, but engineering advances made at events like Kubecon will make the serverless vision possible, while freeing serverless practitioners from the complexities of managing containers, orchestrators and servers.

Making sense of serverless

Decentralized Applications in the Cloud

[tl;dr Are decentralized applications implemented using cloud provider managed services really decentralized, and are they ready for prime time usage? Proof-of-work is not cloud-friendly, and proof-of-stake still immature. Open source blockchain solutions provide mechanisms to ensure secure operational independence of organizations on the same network, even if they are using the same cloud provider. Integrating decentralized applications with cloud-native applications may help build highly integrated platform eco-systems.]

A “decentralized application” (or ‘dapp’) is a technology solution to a business process that spans a number of independent, geographically distributed and possibly untrustworthy, organizations. A list of dapps can be found here, but most current dapps are somewhat frivolous in nature, somewhat similar to websites appearing on the WWW in the early nineties after Mosaic was released.

A decentralized application relies on cryptographic techniques to ensure trust (i.e., immutability of history, validity of transactions), and uses peer-to-peer communication mechanisms to avoid a centralized bottleneck.

Today, in the financial world in particular, regulatory-mandated organizations such as clearing houses, SEF‘s and the SWIFT financial network broker trust between independent organizations. These are centralized services where trust is assumed (by rule of law). Other trust brokers, apart from government, include industry consortia, community groups, non-governmental organizations, etc. While trust brokers will always exist (in non-anarchic societies!), the way in which they participate in enterprise value chains will likely change significantly in the coming years, as cloud- and blockchain-based technologies enable processes to extend beyond the traditional enterprise boundaries.

The most famous decentralized application is BitCoin, based on blockchain technology, which assumes zero trust between individual participants. Many decentralized applications rely on blockchain technology to provide trust, or at least consensus. The concept of ‘blocks’ in a blockchain is a practical optimization for truly decentralized applications; applications which merely require immutability and verifiability of transactions do not require blocks, but will then have centrally managed state. (AWS’s new QLDB service is an example of this.)

The concept of ‘blockchain-as-a-service’ (BaaS) has been around for a while – as evidenced by the number of companies operating in this space; these are principally tools to enable the building of decentralized applications and aim to isolate businesses from the complexities of blockchain development.

Recently, the major cloud service providers have entered the ‘BaaS’ space – AWS and Azure offer BaaS solutions, currently leveraging open-source offerings from HyperLedger and Ethereum – neither of which rely on Bitcoin technology, although both are blockchain-based. Other blockchain solutions, such as R3’s Corda, have yet to be added as foundational cloud provider services, but Corda nodes can be deployed on any IaaS cloud provider (e.g., AWS and Azure, and GCP, etc.)

Implications of Blockchain on the cloud

There are some significant implications with respect to the use of blockchain technology in the cloud. The first is related to how trust is achieved effectively on cloud-based infrastructure, the second to how operationally decentralized applications built on cloud infrastructure really are.

Achieving Trust

In BitCoin, trust is achieved through ‘proof-of-work’ – i.e., a untrusted node has to consume considerable resources (mostly power) to prove they have correctly validated a block of transactions. Multiple nodes compete for this validation, and there is a reward for the ‘winner’, which a majority of nodes must accept as the winner. The basic idea is that it would be extremely expensive in terms of resource utilization and alliance building to be able to successfully validate ‘fake’ transactions.

The problem for cloud providers is that this is a very inefficient use of cloud resources. ‘Blockchain-as-a-service’ for proof-of-work-based protocols (i.e., Bitcoin and its derivatives) is essentially provided by mining pools. These are entities (people, organizations) who have specialised hardware who come together to solve Bitcoin proof-of-work puzzles, and share the proceeds as a group. This is not the business the major cloud providers are in.

Therefore, proof-of-work is essentially not feasible to provide on public cloud. Instead, other variants, such as ‘proof-of-stake‘, or permissioned-ledgers are necessary. Technically, this precludes ‘Ethereum-as-a-service’ from being offered by cloud providers, as Ethereum still relies on proof-of-work, but Ethereum’s intention is to (eventually) be proof-of-stake-based.

Unfortunately, these non proof-of-work-based alternatives are still relatively immature and unproven. In particular, there is a potential issue around combinatorial complexity management in permissioned ledgers, if each organization has to define and manage its own ‘universe’ of participants and the data they have access to. (Note – it’s not seen as an issue for public ledgers, as all participants have equal access to content on the network, so combinatorial complexity does not increase with number of participants. Also, the content of public ledgers is transparent to everyone,)

Decentralized Operations

If all application instances in a decentralized blockchain service are running on the same cloud, how ‘decentralized’ is that really – from an operational perspective?

Clearly, the decentralized blockchain service will fail if the cloud provider itself suffers an outright failure (i.e., all regions simultaneously unavailable). This could bring entire industries or supply chains to their knees..but it is also (we hope) very unlikely, given the focus that cloud providers rightly have on resilience. We are putting a lot of trust into the cloud providers to do the right thing..although at some point, to allay systemic risk concerns, this may perhaps need to be enshrined in law (in much the same way utilities have legal obligations for service provision).

Managed solutions like AWS Managed Blockchain currently assume the blockchain service is entirely hosted in the providers cloud, and that every organization with nodes on the network have their own cloud accounts in which they operate infrastructure they control.

In this architecture, every organization wholly controls its own infrastructure (in its own AWS account), the blockchain software running on that infrastructure, and the ledger data stored within it.

The blockchain service software itself (in the AWS case, Hyperledger Fabric) provides mechanisms to ensure new (ledger-updating) code is only deployed to the blockchain network when a quorum of participants have cryptographically signed the new code. This provides guarantees that ‘rogue’ code running on another member’s node cannot force an invalid consensus.

In this way, individual organisations retain control over the operation of their part of the blockchain network, while decoupling them sufficiently from other organizations and how they operate their parts of the network. This achieves operational decentralization, the equivalent of each organization running their nodes in their own data centers.

Achieving Decentralized Goals

A decentralized application is intended to remove process bottlenecks, or eliminate the ‘middle-man’ by ensuring multiple independent parties have the same reliable, consistent view of a shared dataset that cannot be tampered with, without relying on a single central 3rd party.

We have established that while the cloud service provider centralizes infrastructure and shared blockchain services, control over data (ledger state) and applications is truly decentralized. As long as the cloud provider itself is available, individual participants in the network can fail or be unavailable without impacting overall service availability.

Managing State in Decentralised Ledgers

A key challenge in any transactional system is transactionally recording an event as having occurred, as well as the impact of the event – e.g., recording that Bob requested to transfer $X to Alice’s account from his account, as well as updating Bob’s account balance and Alice’s account balance with the adjusted balance. But state is application specific, and many applications only need to know the event happened, and prefer to manage their own view of state – for example, analytics, machine-learning, AI, etc have very different states to manage for the same event.

For speed and integrity, some state may need to be managed in the decentralized ledger, but this should ideally be kept to a minimum. Instead, the event log is key, and most applications will likely need to maintain their own dynamic view of state which can be queried according to various application needs.

While blockchain solutions like Hyperledger offer some means to manage stage (e.g., using CouchDB to capture and query complex JSON objects), treating a distributed ledger like a traditional operational datastore is likely to cause problems down the line.

The Hyperledger Fabric is addressing this need by introducing the concept of the EventHub which in principle would allow organizations to build ‘listeners’ to act on committed events. Perhaps in the future this may be integrated with AWS EventBridge allowing events committed to the decentralized ledger to be integrated into more traditional (non-decentralized) application workflows, enabling a very powerful ecosystem of decentralized and cloud-native applications across multiple organizations to work together in a trusted manner.

Summary

As non proof-of-work blockchain solutions mature, decentralized business applications enabled via public cloud will likely be a major area of growth, as industries and special interest groups collaborate to solve common problems while retaining operational independence.

The nascent services provided by AWS and Azure are not yet ready for mission-critical large-scale use, but look like to be an excellent way for organizations to experiment with decentralized applications and re-imaging many legacy processes that currently rely on centralized intermediaries.

Finally, integration of decentralized applications with enterprise applications is likely to be significant area of growth in the future.

Decentralized Applications in the Cloud

Achieving Service Resilience on Cloud-Reliant Infrastructure

[tl;dr Rather than require banks to have multi-cloud or on-premise redundant strategies for critical digital services delivered via cloud, regulators should require firms to instead adhere to ‘well-architected’ guidelines to ensure technology resilience for a given provider or PaaS solution, as part of an overall industry-wide framework of service resilience that can accommodate very rare but impactful events such as cloud provider failure.]

Resilience is the ability of a ‘system’ of processes to recover quickly from a significant shock, ideally without any noticeable impact on those depending on that system. Systems are comprised of people, technologies and environments (e.g., laws, physical locations, etc), any of which can ‘fail’.

Financial regulators are particularly concerned with resilience, as failures in banking systems can cause business disruption, serious personal inconvenience, and even the potential for economic instability.

For many years, banks have had strict requirements to ensure critical systems (“applications”) could be recovered in the event of a disaster (specifically, a data center failure incurring loss or unavailability of all software, hardware and data). Enterprise-wide people and environment ‘failure’ are often handled within an overall operational risk and business continuity management framework, but technology organizations are accountable for managing technology failure risk – which is to say, an application will be guaranteed to be available within a fixed amount of time in the event of a major infrastructure failure or breach.

In recent times, regulators have been focusing more on business service resilience (see, for example, the UK’s Financial Conduct Authority statement on business service resilience). In the UK, a succession of high-profile loss of digital services (through which most UK customers interact with their bank) in several banks has put increased focus on the area of operational resilience. To date, most such incidents can be attributed to poor technical debt management rather than anything specifically cloud-related – but on cloud, a business-as-usual approach to technical debt will greatly amplify the likelihood and impact of failures.

So, as banks move their infrastructure to leverage public cloud capabilities, the question arises as to how to meet operational resilience demands in various cloud failure scenarios – from failure of individual hosted servers, to loss of a particular provider data center, to loss of a particular infrastructure service in a region, or a complete loss of all service globally from a particular provider.

Related to provision of service is stewardship of related data. Assuming a service cannot be provided without access to and availability of underlying data services, cloud resilience solutions must also address data storage and protection.

Cloud Resilience Approaches

There are multiple ways to address cloud provider availability risk, but all have negative implications for agility and innovation – i.e., the key reasons firms want to use the cloud in the first place.

Mitigating solutions could include the following:

  • Deploy to cloud but maintain on-premise redundant storage, processing and networking facilities to fall-back on (an approach used by Barclays)
  • Implement multi-cloud solutions so that new compute infrastructure can be spun up when needed on an alternate cloud provider, with full cross-cloud data replication always enabled
  • Maintain on-premise deployment for all production and resilience workloads, and only use cloud for non-production environments.

The disadvantage with the above is that all of these strategies require treating cloud as a lowest-common-denominator – i.e., infrastructure-as-a-service. This implies firms need to build or buy capabilities to enable on-premise or multi-cloud ‘platform-as-a-service’ – a significant investment, and likely only justifiable for larger firms. It also effectively excludes a cloud provider’s proprietary services from consideration in solution architectures.

It should be noted as well that multi-cloud data replication strategies can be expensive, as cloud providers charge for data egress, and maintaining a complete view of data in another cloud could consume considerable data egress resources on an on-going basis.

The challenge gets more nuanced when software-as-a-service providers or business-process-as-a-service providers form part of the critical path for a regulated business service. Should a firm therefore have multiple SaaS or BPaaS providers to fall back on in the event of a complete failure of one such provider? In many cases, this is not feasible, but in other cases, the cost of keeping a backup service on ‘standby’ may be acceptably low.

Clearly, there is a point at which there are diminishing returns to actively mitigating these risks – but ignoring these ‘fat tail risks‘ completely runs the risk that a business could disappear overnight with one failure in its ‘digital supply chain’.

Regulators are taking note of this in the UK and in the US, with increased focus on business resilience for key processes, and the role that cloud providers are likely to play in the future in this.

Architecting for Resiliency

Building resilient systems is hard. A key factor in any resilient system is decentralization. The most resilient systems are highly decentralized – there is no ‘brain’ or ‘heart’ that can fail and bring down the whole system. Examples of modern resilient systems are the Internet and BitCoin. Both can survive significant failure, where the impact of failure is localised rather than systemic. (In the not-so-distant future, a truly resilient digitally-based financial infrastructure may have no choice but to be built on decentralized blockchain-like technologies…in other words, there may be diminishing cost/benefit returns for ‘too big to fail’ banks to implement digital resilience using traditional, centralised financial infrastructure architectures, even if cloud-based.)

Cloud providers, and services built on top of them, must by necessity be fundamentally resilient to failure. AWS, for example, ensures that all its regions are technically isolated from its other regions. A regional S3 outage in 2017 highlighted the dependency many businesses and services have on the S3 service. Amazon’s response to the issue implemented change of practices to mitigate the risk of a future failure, but a future regional outage of any cloud service is not only possible, it’s relatively likely (although generally less likely than the failure of an enterprise’s traditional data center, due to multiple redundant availability zones – aka data centers – in each region). So cloud providers typically provide regional isolation, and encourage (through programs like the AWS well-architected review or the Azure Architecture center) customers to design resilience of their systems around this, as well as providing hooks and interfaces to enable cross-region redundancy and fail-over of region-specific services.

Along with architecting for resiliency, organizations need to validate the resiliency of their solutions. Techniques like chaos-engineering ensure teams do not fear change or failure, and are confident in their system’s response to production failures.

Multi-cloud Options

Cloud providers, no matter how much their customers ask for it, are not likely to make it simpler for multi-cloud services to be implemented by their customers. It is in their interest (and is indeed their strategy) to assume that their customers are wholly dependent on them, and as such to take resilience extremely seriously. If cloud providers were to view their competitors as the “disaster recovery” solution for customers, then this would be a significant step backwards for the industry. While it is not likely to happen imminently, there is no room for complacency; perhaps this is where some regulatory oversight of providers may, in the future, be beneficial..

There is, however, room in the market for multi-cloud specialists (and hybrid on-premise/cloud specialists). The biggest player in this space is VMware/Dell, with their recent (re)acquisition of Pivotal, which includes cloud-neutral technologies like (open-sourced) Cloud Foundry and Pivotal Container Services. VMware is going all-in on Kubernetes, as are IBM/Red Hat and others. Kubernetes offers enterprises or 3rd party vendors the opportunity to build their own on-premise ‘cloud’ infrastructure via a standard orchestration API, which in principle should also be able to run on any cloud provider’s IaaS offerings. These steps are all pointing to a future where, irrespective of whether the ‘cloud’ infrastructure provider is public or private, container based software-provisioned infrastructure management is the future, and with it a fundamental shift in thinking about failure.

Cloud is not a Silver Bullet for Resilience

For enterprises, the choices on offer for implementing cloud are many and varied: all ultimately benefit organizations and their stakeholders by providing more flexibility at a lower cost than ever before – as long as systems on the cloud are architected for failure.

System availability (aka resilience) has traditionally been measured as mean-time-between-failure (MTBF), where MTBF is is defined as mean-time-to-failure (MTTF) + mean-time-to-recover (MTTR).

Many on-premise systems are architected to reduce MTTF – i.e., aiming for a large mean-time-to-failures rather than a reduced mean-time-to-recover (MTTR). Distributed systems favor a small MTTR over a large MTTF, and cloud is no exception. Systems explicitly architected for a large MTTF and relatively large MTTR (hours, not seconds) will, in general, find it difficult to migrate to cloud without significant re-engineering – to the extent it is unlikely to be financially feasible. (MTTR for legacy systems is roughly equivalent to ‘recovery time objective (RTO)’ in business continuity planning, and system recovery in such systems is generally assumed to be a highly manual process.)

Cloud-native applications have the advantage that they can be architected for low MTTR from the outset (the AWS well-architected reliability guidelines highlight this), taking into account a relatively small MTTF of foundational cloud components – in particular, virtual machines and docker images.

Going Serverless

It is easy to get lost in all of the infrastructure innovation happening to enable flexible, low cost and resilient systems. But ultimately enterprises don’t care about servers – they care about (business) services. Resilience planning should be built around service availability, and issues like orchestration, redundancy, fail-over, observability, load-balancing, data replication, etc should all be provided via underlying platforms, so applications can focus to the fullest extent possible on business functionality, and not on infrastructure.

Organizations seem to be taking a primarily infrastructure-first approach to cloud to date. This is helpful to build essential cloud engineering and operations competencies, but is unlikely to yield major 10x benefits sought by digital transformation agendas, as the focus is still on infrastructure (specifically, servers, networks and storage). Instead, a more useful longer term approach would be to take a serverless-first stance in solution design – i.e., to highlight where resilient stateful services are needed, and which APIs, events and services need to be created or instantiated that use them. Identify 3rd party or internal infrastructure services which are ‘serverless’ (for example, AWS Cognito, AWS Lambda/Azure Functions/GCP Serverless, Office 365, ServiceNow, …) and plan architectures which use these across all phases of the SDLC. Where required services to achieve the ‘serverless’ target state do not exist, these capabilities should be built or bought. This is essentially the approach advocated by AWS, and should, in my view, form the core of enterprise digital technology roadmaps.

Resilient architectures will consist of multiple ‘serverless’ services, connected via many structured asynchronous events streams or synchronous APIs. Behind each event/API will be a highly automated, resilient platform capable of adjusting to demand and responding to failure through graceful degradation or services. The choice of orchestration platform technology, and its physical location, should largely be neutral to service consumers – in much the same way that users of AWS Lambda don’t know or (for the most part) care about the details of how it works behind the scenes. Standards like the open-sourced AWS ‘serverless application model’ (SAM) enables this for Lambda. Other emerging cloud-neutral serverless standards like Knative and OpenFaaS also exist.

While these technologies are still maturing and are not yet robust enough to handle every demanding scenario, framing architecture in a ‘serverless’ way can be helpful to identify which technologies and providers to use where to provide overall resilience. In particular, using planning tools like Wardley Maps can help map out where and when it makes most business sense to transition from serverful on-premise current-state (custom infrastructure) to serverful (rental infrastructure/ IaaS – private or public) cloud to the serverless target state (commodity PaaS/SaaS) or ultimately to BPaaS.

Summary

Fundamentally, the near term goal is to avoid restricting firms adoption of cloud by unnecessarily limiting their technology choices to a ‘least-common-denominator’ of provider services for infrastructure services that can be delivered cloud natively. For hybrid or multi-cloud solutions, a least-common-denominator approach is currently unavoidable but a healthy, open market in Kubernetes Operators (Custom Resource Descriptors) should, over time, close the on-premise vs cloud-provider-native infrastructure service gap and raise the level of service abstractions available at platform level.

Maximizing a given provider’s managed solutions – and adhering to ‘well-architected’ practices – is likely to provide sufficient cloud-native resilience, certainly well in excess of many existing on-premise practices. For private clouds, bespoke ‘well-architected’ guidance – supported by automated tooling – should be mandatory.

If any regulation is to happen regarding cloud utilization for regulated services, it should be to require firms to follow a provider’s well-architected practices to a very high degree – and that provider’s need to define such practices. In other words, new regulation is not needed to change the core engineering posture of the biggest cloud providers, but rather it should elevate the existing posture as the benchmark others must follow.

For private, hybrid or multi-cloud solutions an industry-wide accepted set of ‘well-architected’ guidance to take the guesswork out of resilience compliance would be welcome, coupled with an overarching framework for maintaining minimum digital service availability in the event of ‘fat-tail failures’ occurring (i.e., extremely rare but high impact failures).

For some regulated services, therefore, regulators will need to know which cloud providers and/or vendor infrastructure orchestration solutions are being used, to better manage concentration risk and ensure the failure of a cloud provider or vendor does not present a systemic risk. But for a given financial institution, the best option is likely to go ‘all-in’ either on a particular cloud provider, or on a cloud-neutral, open-sourced PaaS based on Kubernetes.

Achieving Service Resilience on Cloud-Reliant Infrastructure

AWS CDK – why it’s worth looking at

[tl;dr AWS CDK provides a means for developers to consume compliant, reusable cloud infrastructure components in a way that matches their SDLC, improving developer experience and reducing the risk of silo’ing cloud infrastructure development and operations .]

Some weeks ago, Amazon launched the AWS Cloud Development Kit, or CDK. This article provides my initial (neutral) thoughts on the potential impact and relevance of the CDK for organizations building and deploying solutions on the cloud.

Developer Experience

First, does it work? As the CDK has been in beta for quite a while, when Amazon makes a product generally available, it means it meets a very high bar in terms of quality and stability, and that certainly is the case with the CDK. The examples all worked as expected, although the lambda code pipeline example took some mental gyrations to understand the full implications of what was actually being done – specifically, that the build/deploy pipeline created by the CDK can include running the CDK to generate templates that can be used as an input into that code-deployment pipeline – even if the resulting template is not deployed by the CDK.

All in all, the out-of-the-box experience for the CDK was excellent.

A polyglot framework

Secondly, the CDK is a software development framework. This means it uses ‘traditional’ programming languages (imperative, not declarative), it uses SDLC processes all application developers are familiar with (i.e., build/test/deploy cycles), and it provides many software abstractions that serve to hide (unnecessary) complexity from developers, while enabling developers to build ‘safe’ solutions.

The framework itself has been developed in Typescript, with an interesting technology called ‘jsii‘ used to generate native libraries in other programming languages (specifically, Java, Python, and C#/.NET as well as Javascript).

The polyglot nature of the framework is critical, as cloud infrastructure (as exposed to consumers) must be neutral to any specific programming language. But the most used languages have at least one popular framework for abstracting infrastructure necessary for building distributed applications (such as data stores, message queues, service discovery & routing, configuration, logging/tracing, caching, etc). For example, Java has the Spring framework, C# has .NET Core, Python has Django, Javascript has multiple frameworks based on NodeJS).

So, the question is, should developers now learn to use the CDK or focus instead on language-specific frameworks?

What the CDK is – and what it is not?

To answer this question, we need to be clear on what the CDK is, and what it is not. CDK is a compiler to generate CloudFormation code. If one considers CloudFormation templates to be the ‘assembly language’ for the AWS cloud ‘processor’, then what is important is that CDK generates high-quality CloudFormation templates – period.

To that extent, the CDK only needs to be as efficient as it takes to generate valid CloudFormation templates. It does not execute those templates, so CDK code will never have run-time performance sensitivities (except perhaps, as with traditional compilers, in build toolchains).

For this reason, Typescript seems to have been a pragmatic choice to implement the CDK in. Run-time performance is not the key factor here – rather, it is the creation of flexible, adaptable constructs that avoids the need for developers to write CloudFormation YAML/JSON. Languages that use the CDK libraries should expect the same performance criteria to apply.

Why CDK is necessary

After working through several of the examples, and observing the complexity of the CloudFormation code the CDK generates – and the simplicity of the example code – it is clear that having developers write CloudFormation template is no more sustainable than having developers write assembly language. CloudFormation (as with Azure Resource Manager and GCP Cloud Deployment Manager) is excellent for small, well-defined projects, but rapidly gets complex when expanded to many applications with complex infrastructure inter-dependencies.

In particular, ensuring the security and compliance of templates becomes very complex when templates are hand-crafted. While services like CloudSploit offer to statically scan CloudFormation templates for security breaches, it would be much better to ensure secure CloudFormation code was written in the first place.

Through Constructs, Stacks, and Apps, the CDK allows enterprise engineering teams to provide libraries of secure, compliant infrastructure components to developers that can be safely deployed.

For this reason, as well as the familiarity of CDK constructs to developers, CDK is likely to end up being more popular than having developers hand-craft CloudFormation templates. The complexity of CloudFormation risks enterprises splitting teams into specialists and non-specialists, reverting organizations back to silo’d infrastructure anti-patterns.

However, having specialist engineering teams focused on building, publishing and maintaining high-quality reusable constructs is a good thing, and this is likely what most organizations enterprise engineering teams should focus on (as well as the larger global community of CDK developers).

With respect to language-specific frameworks, perhaps it is only a matter of time before these frameworks generate the cloud-native templates that the software level abstractions can map directly onto. This may mean that the application footprint for such applications could get much smaller in future, as the framework abstractions are increasingly implemented by cloud constructs. Indeed, as observed by Adrian Cockroft, many of the open-source microservice components developed by Netflix ended up being absorbed by AWS, greatly simplifying the Netflix-specific code-base.

If this outlook proves correct, the correct approach for organizations already committed to a microservices framework would be to stick with it, rather than have their business-facing application developers learn CDK.

With respect to Terraform, the most popular cross-cloud provisioning, deployment and configuration tool, its principle benefit is consistent SDLC workflow across cloud providers. Organizations need to decide if a single end-to-end SDLC for application and infrastructure developers on a single cloud (using CDK) provides more benefits than a single infrastructure SDLC across multiple cloud providers, but with a different SDLC for application developers.

CDK & Serverless

For architectures which are fundamentally not ‘serverless’ in nature, the CDK presents a conflict: by allowing infrastructure to be specified and built as part of the developer lifecycle, where does responsibility for managing infrastructure lie?

The reality is, most organizations still exist in a ‘serverful’ world – where infrastructure environments, even if it’s cloud-based, is a ‘pet’ and not ‘cattle’. Environments tend to be created and managed over the long-term, especially where datastores are involved. Stacks tend to be stable across environments, changing only with new releases of software. Separate teams (from developers) are responsible for the ongoing health, security and cost of environments. These teams are likely to be much more comfortable with configuration and scripting than outright coding, using tools like Chef, Puppet, Ansible or AWS OpsWorks. They may prefer developers or architects to request infrastructure components via tools like AWS Service Catalog or ServiceNow, so that infrastructure code is firmly managed away from developers, and the benefits of SDLC-friendly CDK may be less obvious to them.

Generating and maintaining safe, secure and compliant cloud stacks is a vibrant area of growth, and CDK is unlikely to monopolise this – rather, it may spur the growth of 3rd party solutions. 3rd parties that aim to simplify and standardize cloud infrastructure management (such as Pulumi) will have a role to play, particularly for polyglot language and multi-cloud environments, but ‘serverful’ platform and infrastructure teams need to decide what infrastructure building blocks to expose to developers, and how.

With serverless, this dynamic changes significantly, and the CDK can safely become part of the development team’s SDLC. Indeed, a potential goal for enterprise’s moving towards a serverless target state (i.e., applications consisting of composable services with no fixed/bespoke infrastructure) is to use CDK constructs to define those business-level services as infrastructure components. The concept of a platform-as-a-service to integrate software-as-a-service is a concept worth exploring as this space matures, particularly with the advent of services like AWS EventBridge.

In the meantime, behind every ‘serverless’ service lie many servers..teams have many options as to how best to automate this underlying infrastructure, and CDK is another tool in the toolbox to enable this.

Conclusion

AWS CDK is ground-breaking technology that is a big step towards improving the developer experience and capabilities on (AWS) cloud. Other polyglot cloud providers will likely follow suit or risk widening the gap between cloud infrastructure teams and application development teams. Organizations should consider investing in building and publishing CDK construct libraries to be used by application teams – constructs which can be verified to be secure and with sufficient guardrails to allow less-experienced engineers to safely experiment with solutions.

In the meantime, as cloud platforms extend their capabilities, expect language-specific microservices frameworks to get simpler and smaller (or at least more modular in implementation), enabling application developers to fully exploit a given cloud provider’s platform services. Teams relying on these frameworks should understand and drive the roadmap for how these frameworks leverage cloud-native services, and ensure they align with their wider platform cloud/infrastructure automation strategy.

AWS CDK – why it’s worth looking at

Bending The Serverless Spoon

Do not try and bend the spoon. That’s impossible. Instead, only realize the truth… THERE IS NO SPOON. Then you will see that it not the spoon that bends, it is yourself.” — The Matrix

[tl;dr To change the world around them, organizations should change themselves by adopting serverless + agile as a target. IT organizations should embrace serverless to optimize and automate IT workflows and processes before introducing it for critical business applications.]

“Serverless” is the latest shiny new thing to come on the architectural scene. An excellent (opinionated) analysis on what ‘serverless’ means has been written by Jeremy Daly, a serverless evangelist – the basic conclusion being that ‘serverless’ is ultimately a methodology/culture/mindset.

If we accept that as a reasonable definition, how does this influence how we think about solution design and engineering, given that generations of computer engineers have grown up with servers front-and-center of design thinking?

In other words, how do we bend our way of thinking of a problem space to serverless-first, and use that understanding to help make better architectural decisions – especially with respect to virtual machines, containers, and orchestration, and distributed systems in general?

Worked Example

To provide some insight into the practicalities of building and running a serverless application, I used a worked example, “Building a Serverless App Using Athena and AWS Lambda” by Epsagon, a serverless monitoring specialist. This uses the open-source Serverless framework to simplify the creation of serverless infrastructure on a given cloud provider. This example uses AWS.

Note to those attempting to follow this exercise: not all the required code was provided in the version I used, so the tutorial does require some (javascript) coding skills to fill the gaps. The code that worked for me (with copious logging..) can be found here.

This worked example focuses on two reference-data oriented architectural patterns:

  • The transactional creation via a RESTful API of a uniquely identifiable ‘product’ with an ad-hoc set of attributes, including but not limited to ‘ProductId’, ‘Name’ and ‘Color’.
  • The ability to query all ‘products’ which share specific attributes – in this case, a shared name.

In addition, the ability to create/initialize shared state (in the form of a virtual database table) is also handled.

Problem-domain Non-Functional Characteristics

Conceptually, the architecture has the following elements:

  • Public, anonymous RESTful APIs for product creation and query
    • APIs could be defined in OpenAPI 3.0, but by default are created
  • Durable storage of product data information
    • Variable storage cost structure based on access frequency can be added through configuration
    • Long-term archiving/backup obligations can be met without using any other services.
  • Very low data management overhead
  • Highly resilient and available infrastructure
    • Additional multi-regional resilience can be added via Application Load Balancer and deploying Lambda functions to multiple regions
    • S3 and Athena are globally resilient
  • Scalable architecture
    • No fixed constraint on number of records that can be stored
    • No fixed constraint on number of concurrent users using the APIs (configurable)
    • No fixed constraint on the number of concurrent users querying Athena (configurable)
  • No servers to maintain (no networks, servers, operating systems, software, etc)
  • Costs based on utilization
    • If nobody is updating or querying the database, then no infrastructure is being used and no charges (beyond storage) are incurred
  • Secure through AWS IAM permissioning and S3 encryption.
    • Many more security authentication, authorization and encryption options available via API Gateway, AWS Lambda, AWS S3, and AWS Athena.
  • Comprehensive log monitoring via CloudWatch, with ability to add alerts, etc.

For a couple of days coding, that’s a lot of non-functional goodness..and overall the development experience was pretty good (albeit not CI/CD optimized..I used Microsoft’s Code IDE on a MacBook, and the locally installed serverless framework to deploy.) Of course, I needed to be online and connect to AWS, but this seemed like a minor issue (for this small app). I did not attempt to deploy any serverless mock services locally.

So, even for a slightly contrived use case like the above, there are clear benefits to using serverless.

Why bend the spoon?

There are a number of factors that typically need to be taken into consideration when designing solutions which tend to drive architectures away from ‘serverless’ towards ‘serverful’. Typically, these revolve around resource management ( i.e., network, compute, storage ) and state management (i.e., transactional state changes).

The fundamental issue that application architects need to deal with in any solution architecture is the ‘impedance mismatch’ between general purpose storage services, and applications. Applications and application developers fundamentally want to treat all their data objects as if they are always available, in-memory (i.e., fast to access) and globally consistent, forcing engineers to optimize infrastructure to meet that need. This generally precludes using general-purpose or managed services, and results in infrastructure being tightly coupled with specific application architectures.

The simple fact is that a traditional well-written, modular 3-tier (GUI, business logic, data store) monolithic architecture will always outperform a distributed system – for the set of users and use-cases it is designed for. But these architectures are (arguably) increasingly rare in enterprises for a number of reasons, including:

  • Business processes are increasing in complexity (aka features), consisting of multiple independently evolving enterprise functions that must also be highly digitally cohesive with each other.
  • More and more business functions are being provided by third-parties that need close (digital) integration with enterprise processes and systems, but are otherwise managed independently.
  • There are many, disparate consumers of (digital) process data outputs – in some cases enabling entirely new business lines or customer services.
  • (Digital) GUI users extend well outside the corporate network, to mobile devices as well as home networks, third-party provider networks, etc.

All of the above conspire to drive even the most well-architected monolithic application to the a ‘ball-of-mud‘ architecture.

Underpinning all of this is the real motivation behind modern (cloud-native) infrastructure: in a digital age, infrastructure needs to be capable of being ‘internet scale’ – supporting all 4.3+ billion humans and growing.

Such scale demands serverless thinking. However, businesses that do not aspire to internet-scale usage still have key concerns:

  • Ability to cope with sudden demand spikes in b2c services (e.g., due to marketing campaigns, etc), and increased or highly variable utilisation of b2b services (e.g., due to b2b customers going digital themselves)
  • Provide secure and robust services to their customers when they need it, that is resilient to risks
  • Ability to continuously innovate on products and services to retain customers and remain competitive
  • Comply with all regulatory obligations without impeding ability to change, including data privacy and protection
  • Ability to reorganize how internal capabilities are provisioned and provided with minimal impact to any of the above.

Without serverless thinking, meeting all of these sometimes conflicting needs, becomes very complex, and will consume ever more enterprise IT engineering capacity.

Note: for firms to really understand where serverless should fit in their overall investment strategy, Wardley Maps are a very useful strategic planning tool.

Bending the Spoon

Bending the spoon means rethinking how we architect systems. It fundamentally means closing the gap between models and implementation, and recognizing that where an architecture is deficient, the instinctive reaction to fix or change what you control needs to be overcome: i.e., drive the change to the team (or service provider) where the issue properly belongs. This requires out-of-the-box thinking – and perhaps is a decision that should not be taken by individual teams on their own unless they really understand their service boundaries.

This approach may require teams to scale back new features, or modify roadmaps, to accommodate what can currently be appropriately delivered by the team, and accepting what cannot.

Most firms fail at this – because typically senior management focus on the top-line output and not on the coherence of the value-chain enabling it. But this is what ‘being digital’ is all about.

Everyone wants to be serverless

The reality is, the goal of all infrastructure teams is to avoid developers having to worry about their infrastructure. So while technologies like Docker initially aimed to democratize deployment, infrastructure engineering teams are working to ensure developers never need to know how to build or manage a docker image, configure a virtual machine, manage a network or storage device, etc, etc. This even extends to hiding the specifics of IaaS services exposed by cloud providers.

Organizations that are evaluating Kubernetes, OpenFaaS, or Knative , or which use services such as AWS Fargate, AWS ECS, Azure Container Service, etc, ultimately want to ensure to minimize the knowledge developers need to have of the infrastructure they are working on.

Unfortunately for infrastructure teams, most developers still develop applications using the ‘serverful’ model – i.e., they want to know what containers are running where, how they are configured, how they interact, how they are discovered, etc. Developers also want to run containers on their own laptop whenever they can, and deploy applications to authorized environments whenever they need to.

Developers also build applications which require complex configuration which is often hand-constructed between and across environments, as performance or behavioural issues are identified and ‘patched’ (i.e., worked around instead of directing the problem to the ‘right’ team/codebase).

At the same time, developers do not want anything to do with servers…containers are as close as they want to get to infrastructure, but containers are just an abstraction of servers – they are most definitely not ‘serverless’.

To be Serverless, Be Agile

Serverless solutions are still in the early stages of maturity. For many problems that require a low-cost, resilient and always-available solution, but are not particularly performance sensitive (i.e., are naturally asynchronous and eventually consistent), then serverless solutions are ideal.

In particular, IT (and the proverbial shoes for the cobblers children) processes would benefit significantly from extensive use of serverless, as the management overhead of serverless solutions will be significantly less than other solutions. Integrating bespoke serverless solutions with workflows managed by tools like ServiceNow could be a significant game changer for existing IT organizations.

However, mainstream use of serverless technologies and solutions for business critical enterprise applications is still some way away – but if IT departments develop skills in it, it won’t be long before it finds its way into critical business solutions.

For broader use of serverless, firms need to be truly agile. Work for teams needs to come equally from other dependent teams as from top-down sources. Teams themselves need to be smaller (and ‘senior’ staff need to rethink their roles), and also be prepared to split or plateau. And feature roadmaps need to be as driven by capabilities as imagined needs.

Conclusion

Organizations already know they need to be ‘agile’. To truly change the world (bend the spoon), serverless and agile together will enable firms to change themselves, and so shape the world around them.

Unfortunately, for many organizations it is still easier to try to bend the spoon..for those who understand they need to change, adopting the ‘serverless’ mindset is key to success, even if – at least initially – true serverless solutions remain a challenge to realize in organizations dealing with legacy (serverful) architectures.

Bending The Serverless Spoon

The changing role of data lakes

[tl;dr A single data lake, data warehouse or data pipeline to “rule them all” is less useful in hybrid cloud environments, where it can be feasible to query ‘serverless’ cloud-native data sources directly rather than rely on traditional orchestrated batch extracts. Pipeline complexity can be reduced by open extensions to SQL such as the recently announced AWS PartiQL language. Opportunities exist to integrate enterprise human-oriented data governance and meta-data platforms with data pipelines using serverless technologies.]

The need for Data Lakes

A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale. The data lake concept was created to address a number of issues with traditional data analytics and reporting solutions, specifically:

  • the growing number of applications across an enterprise depending on a given dataset;
  • business and regulatory drivers for governing dataset discovery, quality, creation and/or consumption;
  • the increasing difficulty of IT teams to respond in a timely manner to growing business demand for access to high quality datasets.

The data lake allows data to be made available from its source without making any assumptions about its use. This is particularly critical when the data originates from batch extracts of load-sensitive OLTP databases, most of which are still operating on-premise. Streaming data pipelines, while growing in popularity, are not as common as batch-driven pipelines – although this should change over time as more digital platform architectures become more event-driven in nature.

Data lakes are a key component in data pipelines, a construct (or set of constructs) that provides consolidation of data from multiple sources and makes it available for use. A data pipeline can be orchestrated (via a scheduler) or choreographed (responding to events) – the more jobs a pipeline has to do, the more complex the orchestration or choreography, which has implications for supportability. So reducing the number of jobs a pipeline has to support is key to managing data pipeline complexity.

The Components of a Data Lake

A data lake consists of a few key components:

FeatureDescriptionVirtualTraditional
A storage repositoryDurable, resilient storage of data objects.NoYes
An ingestion mechanismA means to upload content to the repository (no transformation)NoYes
A tagging & metadata mechanismA means to associate metadata with data objects, including user-defined tags.YesYes
A metadata search mechanismA means to search objects in the data lake based on metadata and tags (not content)YesYes
A query engineA means to search the content of objects in the data lakeYesPartially
An access control mechanismA means to ensure that users can only access datasets and parts of data sets that they are entitled to see, and to audit all activity.YesYes

In effect, data lakes have become a kind of data warehouse – the main significant difference being that input sources into data lakes tend to be familiar files – CSVs, Avro, JSON, etc. from multiple sources rather than highly optimized domain-specific schemas – i.e., no assumptions are made about how (or why) the data in the data lake will be consumed. Data lakes also do not concern themselves with scheduling or orchestration.

Datawarehouses, datawarehouses everywhere…

For mature data use cases (i.e., situations where relatively stable, well-known data requirements exist), and where consistent high performance is material to meeting customer needs, data warehouses are still the best solution. A data warehouse stores and manages all of its data locally, and only relies on the data lake as an initial ingestion point.

A data warehouse will transform datasets to the form needed for the specific use cases it supports, and will optimize performance for the consumption of those datasets. Modern data warehouses will use ML/AI techniques to optimize performance rather than relying on human database specialists. But, as this approach is compute intensive, such solutions are more amenable to cloud environments than on-premise environments. Snowflake is an example of this model. As more traditional data warehouses (e.g., Oracle Exadata) move to the cloud, we can expect these to also get ‘smarter’ – however, data gravity will mean such solutions will need to be fundamentally multi-cloud compatible.

For on-premise data warehouses, the tendency is for business lines or functions to create ‘one data warehouse to rule them all’ – mainly because of the traditionally significant storage and compute infrastructure and resources necessary to support data warehouses. Consequently considerable effort is spent on defining and maintaining high performance, appropriately normalized, enterprise data models that can be used in as many enterprise use cases as possible.

In a hybrid/cloud world, multiple data warehouses become more feasible – and in fact, will be inevitable in larger organizations. As more enterprise data becomes available in these dynamically scalable, cloud-based (or HDFS/Hadoop based) data warehouses (such as AWS EMR, AWS Redshift, Snowflake, Google Big Query, Azure SQL Data Warehouse), ‘virtual data warehouses’ avoid the need to move data from its source for query handling, allowing data storage and egress costs to be kept to a minimum, especially if assisted by machine-learning techniques.

Virtual Data Warehouses

Virtual Data warehouse technologies have been around for a while, allowing users to manage and query multiple data sources through a common logical access point. For on-premise solutions, virtual data warehouses have limited use cases, as the cost/effort of scaling out in-house solutions can be prohibitive and not particularly agile in nature, precluding experimental use cases.

On hybrid or cloud environments, virtual data warehouses can leverage the scalability of cloud-native data warehouses, driving queries to the relevant engine for execution, and then leveraging its own scalable infrastructure for executing join queries.

Technologies like Dremio reflect the state of the art in cloud-based data warehouses, which push down queries to the source system where possible, but can process them in-memory directly from a data lake or other source if not.

However, there is one thing that all data warehouses have in common: they leverage SQL and (implicitly) a relational view of the data. Standard ANSI SQL queries are generally supported by all data warehouses, but may mean that some data cannot be queried if it is not in tabular form amenable to SQL processing.

Extending SQL with PartiQL

Enter PartiQL, an open-source project sponsored by Amazon to drive extensions to standard SQL that can cope with non-relational data types, including structured, unstructured, nested, and schemaless (NoSQL, Document).

Historically, all data ingested into a data lake had to be transformed into a format that could be queried by SQL-like commands or processed by typical data warehouse bulk-upload tools. This adds complexity to data pipelines (i.e., more jobs), and may also force premature schema design (i.e., forcing the design of an optimal schema before all critical use cases are fully understood).

PartiQL potentially allows tools such as Snowflake, Dremio (as well as the tools AWS uses internally) to query data using SQL-like syntax, but to also include non-relational data in those queries so they can avoid those separate transformation steps, aiding pipeline complexity reduction.

PartiQL claims to be fully ANSI-compliant, but extended in specific ways to support alternate data formats. While not an official ISO/ANSI standard, it may have the ability to become a de-facto standard – especially as the language has already been used in anger with success within AWS. This will provide a skill path for relational data warehouse experts to become proficient in leveraging modern data pipelines without committing to one specific vendor’s technology.

Technologies like PartiQL will make it much easier to include event-sourced streams into a data pipeline, as events are defined as nested or other non-relational structures. As more data pipelines become event driven rather than batch-driven, having a standard like PartiQL will be key. (It will be interesting to see if Confluent’s KSQL and PartiQL will converge to a single event-stream query standard.)

As PartiQL has only just been released, it’s too soon to tell how the big data ecosystem or ISO/ANSI will respond. Expect more on this topic in the future. For now, virtual data warehouses must rely on their proprietary SQL extensions.

Non-SQL Data Processing

Considerable investment is being made by third party vendors on advanced technology focused on making distributed, scalable processing of SQL (or SQL-like) queries fast and reliable with little or no human tuning required. As such, it is wise to pick a vendor demonstrating a clear strategy in this space, and continuing to invest in SQL as the lingua-franca of transformation logic.

However, for use cases for which SQL is not appropriate, distributed computing platforms like Spark are still needed. The expectation here is that such platforms will ingest data from a data lake, and output results into a data lake. In some cases, the distributed computing platform offers its own storage (e.g., HDFS), but increasingly it is more appropriate to question whether data needs to reside permanently in a HDFS cluster rather than in a data lake. For example, Amazon’s EMR service allows Hadoop clusters to be created ephemerally, and to consume their initial dataset from AWS S3 repositories or other data sources,

Enforcing Enterprise Data Collaboration and Governance

Note that all data warehouse solutions (virtual or not) must support some form of meta-data tagging and management used by their SQL query engines – otherwise they cannot act as a virtual database source (generally an ODBC end-point that applications can connect directly to). This tagging can be automated if sources included meta-data (e.g., field headers, Avro schema definitions, etc) , but can be enhanced by human tagging, which is increasingly augmented by machine-learning to help identify, for example, where data may be sensitive, etc.

But data governance needs extend beyond the needs of the virtual data warehouse query engines, and this is where there are still gaps to be filled in the current enterprise data management tools.

Tools from vendors like Alation, Waterline, Informatica, Collibra etc were created to augment people’s ability to properly tag content in the data-lake with meaningful information to make it discoverable and governable. Consistent tagging in principle allows tag-based governance rules to be defined to automatically enforce data governance policies in data consumers. This data, coupled with schema information which can be derived directly from data-sources, is all the information needed to allow users (or developers) to source the data they need in a secure, compliant way.

But meta-data for data governance has humans as the primary user (e.g. CDOs, business/data analysts, process owners, etc) – or, as Alation describes it – meta-data for human collaboration.

Currently, there is no accepted standards for ensuring the consistency of ‘meta-data for human collaboration’ with ‘meta-data for query execution’.

Ideally, the human-oriented tools would generate standard events that tools in the data pipeline could pick up and act on (via, for example, something like AWS EventBridge), thereby avoiding the need for data governance personnel to oversee multiple data pipelines directly…

Summary

With the advent of cloud-based managed compute and data storage services, a multi-data warehouse and pipeline strategy is viable and may even be desirable, potentially involving multiple data lakes.

Solutions like PartiQL have the potential to eliminate many transformation job phases and greatly simplify data pipeline complexity in a standardized way, leveraging existing SQL skills rather than requiring new skills.

To ensure consistent governance across multiple data pipelines, a serverless event-based approach to connecting human data governance solutions with cloud-native data pipeline solutions may be the way forward – for example, using AWS EventBridge to action events originating from SaaS-based data governance services with data pipelines.

The changing role of data lakes

Why AWS EventBridge changes everything..

“Events, dear boy, events”

Harold McMillan

[tl;dr AWS EventBridge may encourage SaaS businesses to formally define and manage public event models that other businesses can design into their workflows. In turn, this may enable businesses to achieve agility goals by decomposing their organizations into smaller, event-driven “cells” with workflows empowered by multiple SaaS capabilities.]

Last week, Amazon formally announced the launch of the AWS EventBridge service. What makes this announcement so special?

The biggest single technical benefit is the avoidance of the need for webhooks or polling APIs. (See here for a good explanation of the difference.)

Webhooks are generally not considered a scalable solution for SaaS services, as significant engineering is required to make it robust, and consuming applications need to be designed to handle web-hook API calls.

HTTP-based APIs exposed by 3rd party services can be polled by applications that need to know if state has changed, but this polling consumes resources even when nothing changes. Again, this has scalability issues on both the SaaS provider as well as the application consumer.

In both cases, the principle metaphor connecting both the SaaS and consuming application is the ‘service interface’ abstraction – i.e., executing an operation on a resource. As such, this is a technical solution to a technical problem.

From APIs to Events

While this ‘service-based’ model of distributed programming is extremely powerful, it is not an appropriate abstraction for connecting behaviors across multiple services in a value chain. To align with business-level concepts such as Business Process Modelling, event-driven architectures are becoming more and more popular to model complex workflows both within and between organizations.

This trend is accelerated by the desire of organizations to become more “agile”. Increasingly organizations are recognizing this must manifest itself as breaking down the organization into more manageable, semi-autonomous “cells” (see this article from McKinsey as an example). With cells, the event metaphor fits naturally: cells can decide which events they care about, and also decide what events they in turn create that other cells may use.

3rd party service providers (i.e., SaaS companies such as SalesForce, Workday, Office365, ServiceNow, Datadog, etc) empower organizational cells and enable them to achieve far more than a small cell otherwise could. The “cell” concept cannot be fully realized unless every cell has the ability to define and control how it uses these services to achieve its own mission.

In addition, as value/supply chains become more complex, and more (3rd party or internal) providers are embedded in those workflows , the need for a more natural, adaptable way of integrating processes has become evident.

But event-driven architectures require a common ‘bus’ – a target-neutral means to allow zero or more consumers express an interest in receiving events published on the bus. This historically has been impractical to do at scale between organizations (or even within organizations) without requiring all parties to agree on a neutral 3rd party to manage the bus, and at the additional risk of creating a change bottleneck: hence the historic preference for point-to-point HTTP-based standards.

Services like the AWS EventBridge for the first time allow autonomous SaaS solutions to publish a formal event model that can be consumed programmatically and seamlessly included in local (cell-specific) workflows. In addition, this event model can be neutral to the underlying technology and cloud provider.

How it works and what makes it different

The key feature of the EventBridge is the separation of the publisher from the consumer, and the way that business rules to manage the routing and transformation of events is handled.

Once an organization (or AWS account) has registered as a consumer with the publisher (the owner of the “event source”), a logical “event bus” is created to represent all events for that org/account. The consuming org/account can then setup whatever routing and transformation rules it needs for any internal consumers of those events, without any further dependency on the publishing organization. So consumer organizations/accounts have full control over what is published internally to consuming applications.

With appropriate guard-rails in place, individual teams (“cells”) can define and configure their own routing rules, and not rely on any centralized team – a key weakness in many legacy ESB solutions.

Note that the EventBridge has predefined service limits – it has a (reasonably – 400 events/sec) high throughput, but is a high latency service (0.5sec). So low-latency use cases such as electronic trading are not, as this point, an appropriate use case for EventBridge.

The use of EventBridge for internal enterprise event handling should be considered carefully: the 100 event buses per account essentially limits the number of publishers that can be handled by any one account to 100. For most use cases, this should be more than enough, but many large organizations may have many more than 100 ‘publishers’ publishing on their ESB. If each publisher can be viewed as a part of an end-to-end business value-stream, then any value-stream with more than 100 components (i.e., unique event models) is likely to be overly complex. In practice, a ‘publisher’ is likely to be an enterprise application: therefore some significant complexity reduction and consolidation (of event models, if not actual code) would be needed to ensure such organizations can use EventBridge internally effectively.

The AWS Way (also, the Cloud Way)

It’s worth noting that key to Amazon’s success is its ability to “eat its own dogfood“. Every service in Amazon and AWS is built atop other services. No service is allowed to get so big and bloated it cannot be managed effectively. Abstractions are ‘clean’ – rather than add bells and whistles to an existing service, a new service is created which leverages the underlying service or services.

AWS has consistently required every service to have and maintain an API model, which – for asynchronous/autonomous services – leads naturally to an event model. This in turn has made it natural for AWS EventBridge to come out-of-the-box with a number of events already emitted by AWS services that can be leveraged for customer solutions. (For now, many of these events are limited to generic CloudTrail-related events – specifically tracking API calls – but in the future it’s reasonable to expect more service-specific events to be made available.)

AWS does have one key advantage over other major cloud providers such as Google GCP and Microsoft Azure: it set out to build a business (an online marketplace) using these services. So it’s strategy was (and is) driven by its vision for how to build a globally scalable online business – not by the need to provide technology services to businesses. To this extent, it’s hard to see Google and Microsoft being anything other than followers of AWS’s lead.

A Prediction..

Businesses which also follow the Amazon-inspired growth/innovation and organization model will likely have a better chance of succeeding in the digital age. And it is for these businesses that EventBridge will have the most impact – far beyond the technological improvements afforded by the use of events vs webhooks/APIs.

Consequently, as more SaaS companies are on-boarded onto the AWS EventBridge eco-system, we can expect more event models to be published. Tools for managing and evolving event models will evolve and improve so they become more accessible and useful for non-traditional IT folks (i.e., process and workflow designers) – currently the only way right now to see event model definitions seems to be by actually creating business rules.

This increased focus on SaaS integrations may (perhaps) inspire firms to re-organize their internal capabilities along similar lines, as internal service providers, empowering cells across the organization and with a published and accessible software-driven event model – noting that while events may be published and received digitally, they can still be actioned by humans for non-digital processes (e.g., complex pricing decision making, responding to help desk requests, etc).

The roster of SaaS firms signing up to EventBridge over the coming months will hopefully bear out this prediction. A good sense of what services could be onboarded can be had by looking at all the SaaS (and IoT) services integrated by IFTTT.

In the meantime, it is time to explore the re-imagined integration opportunities afforded by AWS EventBridge..

Why AWS EventBridge changes everything..