Making sense of serverless

The now annual Serverlessconf NYC (2019) event was held in NYC in October. This was a great opportunity to assess the current state of ‘serverless’ – what it is, why it’s important, whether and how it should factor in enterprise cloud planning, and what challenges there are around the serverless space.

The content below is informed by presentations given during the conference, but opinions and conclusions are my own.

What exactly is serverless

‘Serverless’ architectures are often correlated with ‘functions-as-a-service’ (FaaS) but ‘serverless’ extends far beyond FaaS. At the simplest – and most pragmatic – level, ‘serverless’ architectures do not have any ‘servers’ (i.e., long-running virtual and/or physical machines) to manage – i.e., no OS patches, no upgrades, no capacity to manage, etc. Operational overhead related to infrastructure management is zero. This means development teams are 100% responsible for their serverless application architectures.

Of course, the servers do not disappear: they are merely 100% managed by ‘someone else’, with little or no engagement required between development teams and those who manage the serverless infrastructure.

Architecturally, serverless is more than simply no infrastructure to manage. ‘Pure’ serverless architectures exhibit the following characteristics:

  • Are event-driven
  • Are function-oriented
  • Use managed services
  • Are scalable to zero and up

It’s worth noting that managed services in a a serverless architecture may not themselves be using ‘serverless’ architectures. However, the technology used to deliver each managed service is not exposed in a specific serverless solution architecture.

Why is serverless important?

From a purely economic perspective, serverless enables the introduction of true value-driven businesses (see Simon Wardley’s perspective here), enabling a whole new economics around automation and infrastructure. Essentially, if no value is being delivered, then no cost is being incurred on a serverless architecture (assuming usage of services constitutes ‘value’ delivered).

Because of the traditionally high cost of supporting and maintaining application infrastructure, businesses have historically put a lot of effort into planning new software; new features tend to increase both change and operating costs, usually more than linearly. This makes businesses more change-averse over time, leading to over-planning, a lack of agility and significantly reduced pace of innovation delivery.

Currently (as of 2019), the fashion is to invest in cloud computing to dampen change and operating costs – although this investment is still predominantly in the IaaS space (i.e., using the cloud’s superior economics for compute, storage and networking). Enterprises moving beyond IaaS are faced with either committing to a specific cloud provider’s infrastructure PaaS solutions (such as AWS ECS, Fargate, etc), or investing a lot of effort in building and operating their own orchestration and runtime solutions (usually using Kubernetes as the enabler).

But the war to deliver customer value is being fought above the runtime or infrastructure PaaS, as suggested by Simon Wardley. The risk to many enterprises is that they win the battle for infrastructure PaaS but lose the war to deliver customer value. Indeed, one conference presentation (entitled ‘Killing Kubernetes’) gave a real-world example of how running down the Kubernetes path prematurely can cause a team to lose sight of the customer value to be delivered. (The team in the end decided to go full serverless, and ditch Kubernetes.)

Using a tool likely Wardley Maps enables clarity of thought with respect to critical platform-level components, and which battles it makes sense to fight, vs leveraging what industry innovation will provide.

As Ben Kehoe describes, the point of serverless is to provide focus on business value – it’s not about functions, technology, cost or operations.

Changing Build vs Buy Mindsets

Many enterprises have a policy of ‘buy over build’ – i.e., buy (and customize) a solution rather than build a solution. Customized off-the-shelf solutions have their advantages, but often ultimately lead to businesses being constrained by vendor roadmaps, or by the cost of upgrading/keeping pace with vendor advances. In particular, vendor software is optimized for configurability, whereas what enterprises need is extensible software rather than configurable software.

Serverless provides organizations which do not have a depth in engineering expertise with a path towards ‘build over buy’. Functions and workflows proprietary to a business can be done with serverless functions, while use of managed services can minimize the need for infrastructure expertise. Integrating 3rd party software-as-a-service solution also becomes second nature in a serverless environment, particularly with the advent of integration tools such as AWS EventBridge. Such an architecture is readily extensible, and better suited to meet enterprise needs.

Serverless and the Enterprise

Serverless architectures are being actively used by all kinds of businesses with great success (aCloudGuru, a sponsor of ServerlessConf NYC, is just one example of a successful serverless user). Architectures which rely exclusively on managed services (such as AWS S3, Aurora, Lambda, StepFunctions, DynamoDB, etc) can be considered serverless.

Many enterprises are partially serverless on the cloud, choosing to leverage a cloud provider’s managed service offering as part of a ‘serverful’ solution architecture (e.g., using S3 with EC2). But these architectures do not provide the full benefits of a true ‘serverless’ architecture, as considerable effort is still required to manage the non-serverless elements.

It should be noted as well that just because traditional enterprise software is made available as a ‘managed service’, doesn’t mean that the enterprise overhead of managing that service is reduced: if the cloud provider still exposes all the configurable aspects of the software, there will not be a significant benefit in moving to the managed service. (Microsoft enterprise applications offered on Azure seem to suffer from this affliction.)

Fundamentally, serverless is not yet ready to take on all enterprise workloads – there are many constraints and conflicts between standardizing the serverless runtimes (necessary to allow them to be managed efficiently at scale) and the customization needs of enterprises. In particular, how systems manage state with sufficient performance is likely to remain a challenge – although this is certainly solvable, as emerging architectural best practices for a serverless world establish themselves (the traditional model-view-controller model being a poor model for serverless applications).

For most enterprises, therefore, business solutions should be planned as if performant platform solutions exists, being clear on what the functions are, what the managed services are, and what platform capabilities are assumed. These can then drive further investment decisions to build out (or buy/rent) these capabilities. A ‘serverless first’ mindset is key to this.

Underpinning all these is organizational design – in particular, the concepts and ideas espoused in Team Topologies map very well to this approach.

Decomposing the Monolith

A key use case for serverless is enabling the decomposition of legacy monolithic architectures. Most enterprises do not have the skills or expertise to successfully migrate complex monolithic architectures to microservices, as this requires some skills in developing on and managing highly distributed systems. While technologies like CloudFoundry and SpringBoot go a long way towards minimizing the cognitive load for application developers, organizations require considerable investment to make these technologies available as true managed services across an enterprise.

Serverless offers a route to decompose monolithic architectures without first building out the full capabilities needed to deploy serverful microservice architectures. It allows enterprises to experiment with service-based business solutions without incurring significant or hard-to-reverse costs in infrastructure and/or skills. Once a decomposed architecture begins to prove its worth, it may be unavoidable (for now) to move to serverful microservices at the back-end to scale out, but the business value proposition should be clear by then.

Serverless Challenges

Serverless architectures have their own challenges that organizations need to be prepared to handle, which are different from the challenges that building serverful architectures have.

Key challenges exist around:

  • Security
  • Local development and testing
  • Debugging, tracing, monitoring and alerting
  • Limit Management
  • Resilience
  • Lock-in
  • Integration testing
  • Serverless infrastructure-as-code

The above are the challenges specifically raised during the conference. Other challenges may yet reveal themselves.

Security

Serverless requires a different security model than traditional infrastructure. Specifically, security for serverless centers around security of functions, security of data, and security of configuration.

Key attack surfaces for serverless are event data injection, unauthorized deployments, and dependency poisoning. In particular, over-privileged permissions present a significant surface attack area. A good list of attack surfaces is published by Palo Alto/Puresec, a sponsor of the conference.

Serverless components therefore need their own security solutions, as part of an over-arching defense-in-depth security strategy.

Local development and testing

By definition, managed services cannot be available on local (laptop/desktop) development environments, as neither are serverless runtimes such as lambda. Instead, development is expected to happen directly on the cloud, which can cause issues for developers who are periodically disconnected from the internet.

For some development teams, the ability to code and test away from the cloud is important, and in this regard, cloud providers are beginning to standardize more on OCI containers in their serverless runtimes, and allow developers to run these containers locally on their laptop, as well as on standard orchestrated environments such as Kubernetes. Azure and GCP seems to be leading the way in this space, but AWS is always improving its lambda run-times and offering developers more ways to customize them, so this may eventually lead to AWS offering the same features.

The challenge, however, will be to maintain the benefits of serverless while avoiding requiring teams end up managing containers as the new ‘servers’…a trap many teams are likely to fall into.

Debugging, tracing, monitoring and alerting

The challenges here are not unique to serverless – microservices architectures have these challenges in spades. While cloud providers typically provide managed services to assist with these (e.g, AWS X-Ray, AWS CloudWatch, etc), a rich eco-system of 3rd parties also help to address these needs.

In general, while it is possible to get by with provider-native solutions, it may be best to augment team capabilities with a vendor solution, such as Lumigo, Serverless framework, Datadog, Epsagon, etc.

Limit Management

All serverless services have limits, usually defined per account. This protects rogue applications from over-loading lambdas or managed services (such as AWS DynamoDB).

Usually, limits can be increased, but may need a service request to the cloud provider. Limits can also be imposed per account at an enterprise level (for example, via AWS Organizational Units).

It is important that the service limits are known and understood, as incorrectly assuming no limits may have a material impact on a solution architecture. While serverless solutions can scale, they cannot scale infinitely.

Resilience

Resilience for managed services is different from resilience of functions-as-a-service. Managed services need to be available at all times – but the manner and means by which such services maintain availability is generally opaque to the user. Some services may be truly global, but cloud providers tend to make managed service resilient within a specific region (through multiple availability-zones in a region), which requires solution architectures to allow for redundancy across multiple regions in the event a single region fails in its entirety. Recovery in these scenarios may not need to be 100% automated, dependending on recovery time objectives.

For functions-as-a-service (lambdas), if an invocation fails, it should be safe for the runtime to try again (i.e., idempotent processing of events). So the runtime provides most of the resilience.

However, if a lambda depends on a ‘traditional’ service (i.e., not in itself dynamically scalable), there may be resilience issues. For example, a lambda connecting to a traditional relational database via SQL may run out of available server-side connections.

Resource constraints applies to any API which is not fronting a serverless architecture. So lambdas need to ensure sufficient resilience (e.g., circuit breaker pattern) is built-in so that constraints in other APIs do not cause the lambda to fail.

Lock-in

Many enterprises are reluctant to use a particular cloud providers serverless model as they tend to be very proprietary and cloud-provider specific, and therefore moving to another cloud, or enabling a solution to run on any cloud-provider, could involve considerable re-engineering expense.

Firms which are constrained by regulatory or other drivers to avoid provider lock-in have options available. Firms can use multi-cloud serverless frameworks such as Serverless. In addition, there are vendors appearing in the multi-cloud messaging space, with vendors like TriggerMesh offering a serverless multi-cloud event bus.

Some cloud providers are making source code for their lambda services available publically – for example Google Cloud Functions and Azure Functions. Open-source serverless solutions such as Google’s Knative and OpenFaaS are also available. In addition, some vendors, such as Platform9 provide a completely independent solution for lambdas, for organizations which want to deploy lambdas internally – for example, on Kubernetes.

Other mechanisms to minimize the effect of lock-in include the use of standard OCI or docker containers to host serverless functions, which may allow containers to run in other orchestration environments without requiring significant rework. (This doesn’t really help if the container relies on external provider-specific managed services, however.)

Regardless of steps taken to avoid lock-in, some cloud providers may include managed services that may be proprietary to them: once software is built to leverage such a managed service, you have a form of lock-in (in much the same way, for example, you may be locked-in to Oracle or Microsoft databases once you commit to using proprietary features of them).

As such, focusing on avoiding lock-in is, for many firms, going to result in unnecessary complexity. It may be better to exploit a given cloud provider, and manage the business risk associated with a complete provider outage. For regulated services, however, regulators may want to ensure regulated firms are not overly concentrated in one provider.

Integration testing

Integration testing is never easy to fully automate – it is partially reason why there is so much focus on microservices, as each microservice is an independently testable and deployable component. The same applies for lambdas. But each lambda may itself depend on multiple managed services, so how to test those? An excellent piece on serverless testing by Paul Johnston describes the challenge well:

The test boundaries for unit testing a FaaS Function appears to be very close to an integration test versus a component test within a microservice approach.

In essence, because all serverless features are available through APIs, it *should* be easier to build and maintain integration tests, but for now it is still harder than it ought to be.

Serverless Infrastructure as Code

There is a growing sense of dissatisfaction with the limitations of traditional YAML-based configuration languages with respect to serverless – in particular, that the lifecycle and dependencies of serverless resources are not properly represented in existing infrastructure configuration languages. Ben Kehoe gives a flavor of the issues, but this is a complex topic likely to get more attention in the future.

Summary

The key value proposition of serverless is that it permits application developers to focus more on delivering customer value, and to spend less time dealing with infrastructure concerns such as managing servers.

The time is right for organizations to start entering the serverless mindset, and to assess business solutions in the strategic context offered by serverless – whether that means ultimately using external services or designing internal services in a serverless way.

ServerlessConf 2019 was informative, and the presentations were generally accessible to a wide audience. For many presentations, it was not necessary to be a cloud engineer to understand the content and to appreciate the potential transformational opportunities of serverless in the coming years.

I hope that in future events, a broader coalition of business strategic planners and do-ers will be in attendance. It is definitely not a Kubecon, but engineering advances made at events like Kubecon will make the serverless vision possible, while freeing serverless practitioners from the complexities of managing containers, orchestrators and servers.

Making sense of serverless

AWS CDK – why it’s worth looking at

[tl;dr AWS CDK provides a means for developers to consume compliant, reusable cloud infrastructure components in a way that matches their SDLC, improving developer experience and reducing the risk of silo’ing cloud infrastructure development and operations .]

Some weeks ago, Amazon launched the AWS Cloud Development Kit, or CDK. This article provides my initial (neutral) thoughts on the potential impact and relevance of the CDK for organizations building and deploying solutions on the cloud.

Developer Experience

First, does it work? As the CDK has been in beta for quite a while, when Amazon makes a product generally available, it means it meets a very high bar in terms of quality and stability, and that certainly is the case with the CDK. The examples all worked as expected, although the lambda code pipeline example took some mental gyrations to understand the full implications of what was actually being done – specifically, that the build/deploy pipeline created by the CDK can include running the CDK to generate templates that can be used as an input into that code-deployment pipeline – even if the resulting template is not deployed by the CDK.

All in all, the out-of-the-box experience for the CDK was excellent.

A polyglot framework

Secondly, the CDK is a software development framework. This means it uses ‘traditional’ programming languages (imperative, not declarative), it uses SDLC processes all application developers are familiar with (i.e., build/test/deploy cycles), and it provides many software abstractions that serve to hide (unnecessary) complexity from developers, while enabling developers to build ‘safe’ solutions.

The framework itself has been developed in Typescript, with an interesting technology called ‘jsii‘ used to generate native libraries in other programming languages (specifically, Java, Python, and C#/.NET as well as Javascript).

The polyglot nature of the framework is critical, as cloud infrastructure (as exposed to consumers) must be neutral to any specific programming language. But the most used languages have at least one popular framework for abstracting infrastructure necessary for building distributed applications (such as data stores, message queues, service discovery & routing, configuration, logging/tracing, caching, etc). For example, Java has the Spring framework, C# has .NET Core, Python has Django, Javascript has multiple frameworks based on NodeJS).

So, the question is, should developers now learn to use the CDK or focus instead on language-specific frameworks?

What the CDK is – and what it is not?

To answer this question, we need to be clear on what the CDK is, and what it is not. CDK is a compiler to generate CloudFormation code. If one considers CloudFormation templates to be the ‘assembly language’ for the AWS cloud ‘processor’, then what is important is that CDK generates high-quality CloudFormation templates – period.

To that extent, the CDK only needs to be as efficient as it takes to generate valid CloudFormation templates. It does not execute those templates, so CDK code will never have run-time performance sensitivities (except perhaps, as with traditional compilers, in build toolchains).

For this reason, Typescript seems to have been a pragmatic choice to implement the CDK in. Run-time performance is not the key factor here – rather, it is the creation of flexible, adaptable constructs that avoids the need for developers to write CloudFormation YAML/JSON. Languages that use the CDK libraries should expect the same performance criteria to apply.

Why CDK is necessary

After working through several of the examples, and observing the complexity of the CloudFormation code the CDK generates – and the simplicity of the example code – it is clear that having developers write CloudFormation template is no more sustainable than having developers write assembly language. CloudFormation (as with Azure Resource Manager and GCP Cloud Deployment Manager) is excellent for small, well-defined projects, but rapidly gets complex when expanded to many applications with complex infrastructure inter-dependencies.

In particular, ensuring the security and compliance of templates becomes very complex when templates are hand-crafted. While services like CloudSploit offer to statically scan CloudFormation templates for security breaches, it would be much better to ensure secure CloudFormation code was written in the first place.

Through Constructs, Stacks, and Apps, the CDK allows enterprise engineering teams to provide libraries of secure, compliant infrastructure components to developers that can be safely deployed.

For this reason, as well as the familiarity of CDK constructs to developers, CDK is likely to end up being more popular than having developers hand-craft CloudFormation templates. The complexity of CloudFormation risks enterprises splitting teams into specialists and non-specialists, reverting organizations back to silo’d infrastructure anti-patterns.

However, having specialist engineering teams focused on building, publishing and maintaining high-quality reusable constructs is a good thing, and this is likely what most organizations enterprise engineering teams should focus on (as well as the larger global community of CDK developers).

With respect to language-specific frameworks, perhaps it is only a matter of time before these frameworks generate the cloud-native templates that the software level abstractions can map directly onto. This may mean that the application footprint for such applications could get much smaller in future, as the framework abstractions are increasingly implemented by cloud constructs. Indeed, as observed by Adrian Cockroft, many of the open-source microservice components developed by Netflix ended up being absorbed by AWS, greatly simplifying the Netflix-specific code-base.

If this outlook proves correct, the correct approach for organizations already committed to a microservices framework would be to stick with it, rather than have their business-facing application developers learn CDK.

With respect to Terraform, the most popular cross-cloud provisioning, deployment and configuration tool, its principle benefit is consistent SDLC workflow across cloud providers. Organizations need to decide if a single end-to-end SDLC for application and infrastructure developers on a single cloud (using CDK) provides more benefits than a single infrastructure SDLC across multiple cloud providers, but with a different SDLC for application developers.

CDK & Serverless

For architectures which are fundamentally not ‘serverless’ in nature, the CDK presents a conflict: by allowing infrastructure to be specified and built as part of the developer lifecycle, where does responsibility for managing infrastructure lie?

The reality is, most organizations still exist in a ‘serverful’ world – where infrastructure environments, even if it’s cloud-based, is a ‘pet’ and not ‘cattle’. Environments tend to be created and managed over the long-term, especially where datastores are involved. Stacks tend to be stable across environments, changing only with new releases of software. Separate teams (from developers) are responsible for the ongoing health, security and cost of environments. These teams are likely to be much more comfortable with configuration and scripting than outright coding, using tools like Chef, Puppet, Ansible or AWS OpsWorks. They may prefer developers or architects to request infrastructure components via tools like AWS Service Catalog or ServiceNow, so that infrastructure code is firmly managed away from developers, and the benefits of SDLC-friendly CDK may be less obvious to them.

Generating and maintaining safe, secure and compliant cloud stacks is a vibrant area of growth, and CDK is unlikely to monopolise this – rather, it may spur the growth of 3rd party solutions. 3rd parties that aim to simplify and standardize cloud infrastructure management (such as Pulumi) will have a role to play, particularly for polyglot language and multi-cloud environments, but ‘serverful’ platform and infrastructure teams need to decide what infrastructure building blocks to expose to developers, and how.

With serverless, this dynamic changes significantly, and the CDK can safely become part of the development team’s SDLC. Indeed, a potential goal for enterprise’s moving towards a serverless target state (i.e., applications consisting of composable services with no fixed/bespoke infrastructure) is to use CDK constructs to define those business-level services as infrastructure components. The concept of a platform-as-a-service to integrate software-as-a-service is a concept worth exploring as this space matures, particularly with the advent of services like AWS EventBridge.

In the meantime, behind every ‘serverless’ service lie many servers..teams have many options as to how best to automate this underlying infrastructure, and CDK is another tool in the toolbox to enable this.

Conclusion

AWS CDK is ground-breaking technology that is a big step towards improving the developer experience and capabilities on (AWS) cloud. Other polyglot cloud providers will likely follow suit or risk widening the gap between cloud infrastructure teams and application development teams. Organizations should consider investing in building and publishing CDK construct libraries to be used by application teams – constructs which can be verified to be secure and with sufficient guardrails to allow less-experienced engineers to safely experiment with solutions.

In the meantime, as cloud platforms extend their capabilities, expect language-specific microservices frameworks to get simpler and smaller (or at least more modular in implementation), enabling application developers to fully exploit a given cloud provider’s platform services. Teams relying on these frameworks should understand and drive the roadmap for how these frameworks leverage cloud-native services, and ensure they align with their wider platform cloud/infrastructure automation strategy.

AWS CDK – why it’s worth looking at

Why AWS EventBridge changes everything..

“Events, dear boy, events”

Harold McMillan

[tl;dr AWS EventBridge may encourage SaaS businesses to formally define and manage public event models that other businesses can design into their workflows. In turn, this may enable businesses to achieve agility goals by decomposing their organizations into smaller, event-driven “cells” with workflows empowered by multiple SaaS capabilities.]

Last week, Amazon formally announced the launch of the AWS EventBridge service. What makes this announcement so special?

The biggest single technical benefit is the avoidance of the need for webhooks or polling APIs. (See here for a good explanation of the difference.)

Webhooks are generally not considered a scalable solution for SaaS services, as significant engineering is required to make it robust, and consuming applications need to be designed to handle web-hook API calls.

HTTP-based APIs exposed by 3rd party services can be polled by applications that need to know if state has changed, but this polling consumes resources even when nothing changes. Again, this has scalability issues on both the SaaS provider as well as the application consumer.

In both cases, the principle metaphor connecting both the SaaS and consuming application is the ‘service interface’ abstraction – i.e., executing an operation on a resource. As such, this is a technical solution to a technical problem.

From APIs to Events

While this ‘service-based’ model of distributed programming is extremely powerful, it is not an appropriate abstraction for connecting behaviors across multiple services in a value chain. To align with business-level concepts such as Business Process Modelling, event-driven architectures are becoming more and more popular to model complex workflows both within and between organizations.

This trend is accelerated by the desire of organizations to become more “agile”. Increasingly organizations are recognizing this must manifest itself as breaking down the organization into more manageable, semi-autonomous “cells” (see this article from McKinsey as an example). With cells, the event metaphor fits naturally: cells can decide which events they care about, and also decide what events they in turn create that other cells may use.

3rd party service providers (i.e., SaaS companies such as SalesForce, Workday, Office365, ServiceNow, Datadog, etc) empower organizational cells and enable them to achieve far more than a small cell otherwise could. The “cell” concept cannot be fully realized unless every cell has the ability to define and control how it uses these services to achieve its own mission.

In addition, as value/supply chains become more complex, and more (3rd party or internal) providers are embedded in those workflows , the need for a more natural, adaptable way of integrating processes has become evident.

But event-driven architectures require a common ‘bus’ – a target-neutral means to allow zero or more consumers express an interest in receiving events published on the bus. This historically has been impractical to do at scale between organizations (or even within organizations) without requiring all parties to agree on a neutral 3rd party to manage the bus, and at the additional risk of creating a change bottleneck: hence the historic preference for point-to-point HTTP-based standards.

Services like the AWS EventBridge for the first time allow autonomous SaaS solutions to publish a formal event model that can be consumed programmatically and seamlessly included in local (cell-specific) workflows. In addition, this event model can be neutral to the underlying technology and cloud provider.

How it works and what makes it different

The key feature of the EventBridge is the separation of the publisher from the consumer, and the way that business rules to manage the routing and transformation of events is handled.

Once an organization (or AWS account) has registered as a consumer with the publisher (the owner of the “event source”), a logical “event bus” is created to represent all events for that org/account. The consuming org/account can then setup whatever routing and transformation rules it needs for any internal consumers of those events, without any further dependency on the publishing organization. So consumer organizations/accounts have full control over what is published internally to consuming applications.

With appropriate guard-rails in place, individual teams (“cells”) can define and configure their own routing rules, and not rely on any centralized team – a key weakness in many legacy ESB solutions.

Note that the EventBridge has predefined service limits – it has a (reasonably – 400 events/sec) high throughput, but is a high latency service (0.5sec). So low-latency use cases such as electronic trading are not, as this point, an appropriate use case for EventBridge.

The use of EventBridge for internal enterprise event handling should be considered carefully: the 100 event buses per account essentially limits the number of publishers that can be handled by any one account to 100. For most use cases, this should be more than enough, but many large organizations may have many more than 100 ‘publishers’ publishing on their ESB. If each publisher can be viewed as a part of an end-to-end business value-stream, then any value-stream with more than 100 components (i.e., unique event models) is likely to be overly complex. In practice, a ‘publisher’ is likely to be an enterprise application: therefore some significant complexity reduction and consolidation (of event models, if not actual code) would be needed to ensure such organizations can use EventBridge internally effectively.

The AWS Way (also, the Cloud Way)

It’s worth noting that key to Amazon’s success is its ability to “eat its own dogfood“. Every service in Amazon and AWS is built atop other services. No service is allowed to get so big and bloated it cannot be managed effectively. Abstractions are ‘clean’ – rather than add bells and whistles to an existing service, a new service is created which leverages the underlying service or services.

AWS has consistently required every service to have and maintain an API model, which – for asynchronous/autonomous services – leads naturally to an event model. This in turn has made it natural for AWS EventBridge to come out-of-the-box with a number of events already emitted by AWS services that can be leveraged for customer solutions. (For now, many of these events are limited to generic CloudTrail-related events – specifically tracking API calls – but in the future it’s reasonable to expect more service-specific events to be made available.)

AWS does have one key advantage over other major cloud providers such as Google GCP and Microsoft Azure: it set out to build a business (an online marketplace) using these services. So it’s strategy was (and is) driven by its vision for how to build a globally scalable online business – not by the need to provide technology services to businesses. To this extent, it’s hard to see Google and Microsoft being anything other than followers of AWS’s lead.

A Prediction..

Businesses which also follow the Amazon-inspired growth/innovation and organization model will likely have a better chance of succeeding in the digital age. And it is for these businesses that EventBridge will have the most impact – far beyond the technological improvements afforded by the use of events vs webhooks/APIs.

Consequently, as more SaaS companies are on-boarded onto the AWS EventBridge eco-system, we can expect more event models to be published. Tools for managing and evolving event models will evolve and improve so they become more accessible and useful for non-traditional IT folks (i.e., process and workflow designers) – currently the only way right now to see event model definitions seems to be by actually creating business rules.

This increased focus on SaaS integrations may (perhaps) inspire firms to re-organize their internal capabilities along similar lines, as internal service providers, empowering cells across the organization and with a published and accessible software-driven event model – noting that while events may be published and received digitally, they can still be actioned by humans for non-digital processes (e.g., complex pricing decision making, responding to help desk requests, etc).

The roster of SaaS firms signing up to EventBridge over the coming months will hopefully bear out this prediction. A good sense of what services could be onboarded can be had by looking at all the SaaS (and IoT) services integrated by IFTTT.

In the meantime, it is time to explore the re-imagined integration opportunities afforded by AWS EventBridge..

Why AWS EventBridge changes everything..

Transforming IT: From a solution-driven model to a capability-driven model

[tl;dr Moving from a solution-oriented to a capability-oriented model for software development is necessary to enable enterprises to achieve agility, but has substantial impacts on how enterprises organise themselves to support this transition.]

Most organisations which manage software change as part of their overall change portfolio take a project-oriented approach to delivery: the project goals are set up front, and a solution architecture and delivery plan are created in order to achieve the project goals.

Most organisations also fix project portfolios on a yearly basis, and deviating from this plan can often very difficult for organisations to cope with – at least partly because such plans are intrinsically tied into financial planning and cost-saving techniques such as capitalisation of expenses, etc, which reduce bottom-line cost to the firm of the investment (even if it says nothing about the value added).

As the portfolio of change projects rise every year, due to many extraneous factors (business opportunities, revenue protection, regulatory demand, maintenance, exploration, digital initiatives,  etc), cross-project dependency management becomes increasingly difficult. It becomes even more complex to manage solution architecture dependencies within that overall dependency framework.

What results is a massive set of compromises that ends up with building solutions that are sub-optimal for pretty much every project, and an investment in technology that is so enterprise-specific, that no other organisation could possibly derive any significant value from it.

While it is possible that even that sub-optimal technology can yield significant value to the organisation as a whole, this benefit may be short lived, as the cost-effective ability to change the architecture must inevitably decrease over time, reducing agility and therefore the ability to compete.

So a balance needs to be struck, between delivering enterprise value (even at the expense of individual projects) while maintaining relative technical and business agility. By relative I mean relative to peers in the same competitive sector…sectors which are themselves being disrupted by innovative technology firms which are very specialist and agile within their domain.

The concept of ‘capabilities’ realised through technology ‘products’, in addition to the traditional project/program management approach, is key to this. In particular, it recognises the following key trends:

  • Infrastructure- and platform-as-a-service
  • Increasingly tech-savvy work-force
  • Increasing controls on IT by regulators, auditors, etc
  • Closer integration of business functions led by ‘digital’ initiatives
  • The replacement of the desktop by mobile & IoT (Internet of Things)
  • The tension between innovation and standards in large organisations

Enterprises are adapting to all the above by recognising that the IT function cannot be responsible for both technical delivery and ensuring that all technology-dependent initiatives realise the value they were intended to realise.

As a result, many aspects of IT project and programme management are no longer driven out of the ‘core’ IT function, but by domain-specific change management functions. IT itself must consolidate its activities to focus on those activities that can only be performed by highly qualified and expert technologists.

The inevitable consequence of this transformation is that IT becomes more product driven, where a given product may support many projects. As such, IT needs to be clear on how to govern change for that product, to lead it in a direction that is most appropriate for the enterprise as a whole, and not just for any particular project or business line.

A product must provide capabilities to the stakeholders or users of that product. In the past, those capabilities were entirely decided by whatever IT built and delivered: if IT delivered something that in practice wasn’t entirely fit for purpose, then business functions had no alternative but to find ways to work around the system deficiencies – usually creating more complexity (through end-user-developed applications in tools like Excel etc) and more expense (through having to hire more people).

By taking a capability-based approach to product development, however, IT can give business functions more options and ways to work around inevitable IT shortfalls without compromising controls or data integrity – e.g., through controlled APIs and services, etc.

So, while solutions may explode in number and complexity, the number of products can be controlled – with individual businesses being more directly accountable for the complexity they create, rather than ‘IT’.

This approach requires a step-change in how traditional IT organisations manage change. Techniques from enterprise architecture, scaled agile, and DevOps are all key enablers for this new model of structuring the IT organisation.

In particular, except for product-strategy (where IT must be the leader), IT must get out of the business of deciding the relative value/importance of individual product changes requested by projects, which historically IT has been required to do. By imposing a governance structure to control the ‘epics’ and ‘stories’ that drive product evolution, projects and stakeholders have some transparency into when the work they need will be done, and demand can be balanced fairly across stakeholders in accordance with their ability to pay.

If changes implemented by IT do not end up delivering value, it should not be because IT delivered the wrong thing, but rather the right thing was delivered for the wrong reason. As long as IT maintains its product roadmap and vision, such mis-steps can be tolerated. But they cannot be tolerated if every change weakens the ability of the product platform to change.

Firms which successfully balance between the project and product view of their technology landscape will find that productivity increases, complexity is reduced and agility increases massively. This model also lends itself nicely to bounded domain development, microservices, use of container technologies and automated build/deployment – all of which will likely feature strongly in the enterprise technology platform of the future.

The changes required to support this are significant..in terms of financial governance, delivery oversight, team collaborations, and the roles of senior managers and leaders. But organisations must be prepared to do this transition, as historical approaches to enterprise IT software development are clearly unsustainable.

Transforming IT: From a solution-driven model to a capability-driven model

Culture, Collaboration & Capabilities vs People, Process & Technology

[TL;DR The term ‘people, process and technology’ has been widely understood to represent the main dimensions impacting how organisations can differentiate themselves in a fast-changing technology-enabled world. This article argues that this expression may be misinterpreted with the best of intentions, leading to undesirable/unintended outcomes. The alternative, ‘culture, collaboration and capability’ is proposed.]

People, process & technology

When teams, functions or organisations are under-performing, the underlying issues can usually be narrowed down to one or more of the dimensions of people, process and technology.

Unfortunately, these terms can lead to an incorrect focus. Specifically,

  • ‘People’ can be understood to mean individuals who are under-performing or somehow disruptive to overall performance
  • ‘Process’ can be understood to mean formal business processes, leading to a focus on business process design
  • ‘Technology’ can be understood to mean engineering or legacy technology challenges which are resolvable only by replacing or updating existing technology

In many cases, this may in fact be the approach needed: fire the disruptive individual, redesign business processes using Six Sigma experts, or find another vendor selling technology that will solve all your engineering challenges.

In general, however, these approaches are neither practical nor desirable. Removing people can be fraught with challenges, and should only be used as a last resort. Firms using this as a way to solve problems will rapidly build up a culture of distrust and self-preservation.

Redesigning business processes using Six Sigma or other techniques may work well in very mature, well understood, highly automatable situations. However, in most dynamic business situations, no sooner has the process been optimised than it requires changing again. In addition, highly optimised processes may cause the so-called ‘local optimisation’ problem, where the sum of the optimised parts yields something far from an optimised whole.

Technology is generally not easy to replace: some technologies are significantly embedded in an organisation, with a large investment in people and skills to support the technology. But technologies change faster than people can adapt, and business environments change even quicker than technology changes. So replacing technologies come at a massive cost (and risk) of replacing functionally rich existing systems with relatively immature new technology, and replacing existing people with people who may be familiar with the technology, but less so with your organisation. And what to do with the folks who have invested so much of their careers in the ‘old’ technology? (Back to the ‘people’ problem.)

A new meme

In order to effect change within a team, department or organisation, the focus on ‘people, process and technology’ needs to be adapted to ‘culture, collaboration and capabilities’. The following sections lays out what the subtle difference is and how it could change how one approaches solving certain types of performance challenges.

Culture

When we talk about ‘people’, we are really not talking about individuals, but about cultures. The culture of a team, department or organisation has a more significant impact on how people collectively perform than anything else.

Changing culture is hard: for a ‘bad’ culture caught early enough, it may be possible to simply replace the most senior person who is (consciously or not) leading the creation of the undesirable cultural characteristics. But once a culture is established, even replacing senior leadership does not guarantee it will change.

Changing culture requires the willing participation of most of the people involved. For this to happen, people need to realise that there is a problem, and they need to be open to new cultural leadership. Then, it is mainly a case of finding the right leadership to establish the new norms and carry them forward – something which can be difficult for some senior managers to do, particularly when they have an arms-length approach to management.

Typical ‘bad’ cultures in a (technology) organisation include poor practices such as lack of testing discipline, poor collaboration with other groups focused on different concerns (such as stability, infrastructure, etc), a lack of transparency into how work is done, or even a lack of collaboration within members of the same team (i.e., a ‘hero’ based approach to development).

Changing these can be notoriously difficult, especially if the firm is highly dependent on this team and what it does.

Collaboration

Processes are, ultimately, a way to formalise how different people collaborate at different times. Processes formalise collaborations, but often collaborations happen before the process is formalised – especially in high-performance teams who are aware of their environment and are open to collaboration.

Many challenges in teams, departments or organisations can be boiled down to collaboration challenges. People not understanding how (or even if) they should be collaborating, how often, how closely, etc.

In most organisations, ‘cooperation’ is a necessity: there are many different functions, most of which depend on each other. So there is a minimum level of cooperation in order to get certain things done. But this cooperation does not necessarily extend to collaboration, which is cooperation based on trust and a deeper understanding of the reasons why a collaboration is important.

Collaboration ultimately serves to strengthen relationships and improve the overall performance of the team, department or organisation.

Collaborations can be captured formally using business process design notation (such as BPMN) but often these treat roles as machines, not people, and can lead to forgetting the underlying goal: people need to collaborate in order to meet the goals of the organisation. Process design often aims to define people’s roles so narrowly that the individuals may as well be a machine – and as technology advances, this is exactly what is happening in many cases.

People will naturally resist this; defining processes in terms of collaborations will change the perspective and result in a more sustainable and productive outcome.

Capabilities

Much has been written here about ‘capabilities’, particularly when it comes to architecture. In this article, I am narrowing my definition to anything that allows an individual (or group of individuals) to perform better than they otherwise would.

From a technology perspective, particular technologies provide developers with capabilities they would not have otherwise. These capabilities allow developers to offer value to other people who need software developed to help them do their job, and who in turn offer capabilities to other people who need those people to perform that job.

When a given capability is ‘broken’ (for example, where people do not understand a particular technology very well, and so it limits their capabilities rather than expands them), then it ripples up to everybody who depends directly or indirectly on that capability: systems become unstable, change takes a long time to implement, users of systems become frustrated and unable to do their jobs, the clients of those users become frustrated at  people being paid to do a job not being able to do it.

In the worst case, this can bring a firm to its knees, unable to survive in an increasingly dynamic, fast-changing world where the weakest firms do not survive long.

Technology should *always* provide a capability: the capability to deliver value in the right hands. When it is no longer able to achieve that role in the eyes of the people who depend on it (or when the ‘right hands’ simply cannot be found), then it is time to move on quickly.

Conclusion

Many of todays innovations in technology revolves around culture, collaboration and capabilities. An agile, disciplined culture, where collaboration between systems reflects collaborations between departments and vice-versa, and where technologies provide people with the capabilities they need to do their jobs, is what every firm strives for (or should be striving for).

For new startups, much of this is a given – this is, after all, how they differentiate themselves against more established players. For larger organisations that have been around for a while, the challenge has been, and continues to be, how to drive continuous improvement and change along these three axes, while remaining sensitive to the capacity of the firm to absorb disruption to the people and technologies that those firms have relied on to get them to the (presumably) successful state they are in today.

Get it wrong and firms could rapidly lose market share and become over-taken by their upstart competitors. Get it right, and those upstart competitors will be bought out by the newly agile established players.

Culture, Collaboration & Capabilities vs People, Process & Technology

Scaled Agile needs Slack

[tl;dr In order to effectively scale agile, organisations need to ensure that a portion of team capacity is explicitly set aside for enterprise priorities. A re-imagined enterprise architecture capability is a key factor in enabling scaled agile success.]

What’s the problem?

From an architectural perspective, Agile methodologies are heavily dependent on business- (or function-) aligned product owners, which tend to be very focused on *their* priorities – and not the enterprise’s priorities (i.e., other functions or businesses that may benefit from the work the team is doing).

This results in very inward-focused development, and where dependencies on other parts of the organisation are identified, these (in the absence of formal architecture governance) tend to be minimised where possible, if necessary through duplicative development. And other teams requiring access to the team’s resources (e.g., databases, applications, etc) are served on a best-effort basis – often causing those teams to seek other solutions instead, or work without formal support from the team they depend on.

This, of course, leads to architectural complexity, leading to reduced agility all round.

The Solution?

If we accept the premise that, from an architectural perspective, teams are the main consideration (it is where domain and technical knowledge resides), then the question is how to get the right demand to the right teams, in as scalable, agile manner as possible?

In agile teams, the product backlog determines their work schedule. The backlog usually has a long list of items awaiting prioritisation, and part of the Agile processes is to be constantly prioritising this backlog to ensure high-value tasks are done first.

Well known management research such as The Mythical Man Month has influenced Agile’s goal to keep team sizes small (e.g., 5-9 people for scrum). So when new work comes, adding people is generally not a scalable option.

So, how to reconcile the enterprise’s needs with the Agile team’s needs?

One approach would be to ensure that every team pays an ‘enterprise’ tax – i.e., in prioritising backlog items, at least, say, 20% of work-in-progress items must be for the benefit of other teams. (Needless to say, such work should be done in such a way as to preserve product architectural integrity.)

20% may seem like a lot – especially when there is so much work to be done for immediate priorities – but it cuts both ways. If *every* team allows 20% of their backlog to be for other teams, then every team has the possibility of using capacity from other teams – in effect, increasing their capacity by much more than they could do on their own. And by doing so they are helping achieve enterprise goals, reducing overall complexity and maximising reuse – resulting in a reduction in project schedule over-runs, higher quality resulting architecture, and overall reduced cost of change.

Slack does not mean Under-utilisation

The concept of ‘Slack’ is well described in the book ‘Slack: Getting Past Burn-out, Busywork, and the Myth of Total Efficiency‘. In effect, in an Agile sense, we are talking about organisational slack, and not team slack. Teams, in fact, will continue to be 100% utilised, as long as their backlog consists of more high-value items then they can deliver. The backlog owner – e.g., scrum master – can obviously embed local team slack into how a particular team’s backlog is managed.

Implications for Project & Financial Management

Project managers are used to getting funding to deliver a project, and then to be able to bring all those resources to bear to deliver that project. The problem is, that is neither agile, nor does it scale – in an enterprise environment, it leads to increasingly complex architectures, resulting in projects getting increasingly more expensive, increasingly late, or delivering increasingly poor quality.

It is difficult for a project manager to accept that 20% of their budget may actually be supporting other (as yet unknown) projects. So perhaps the solution here is to have Enterprise Architecture account for the effective allocation of that spending in an agile way? (i.e., based on how teams are prioritising and delivering those enterprise items on their backlog). An idea worth considering..

Note that the situation is a little different for planned cross-business initiatives, where product owners must actively prioritise the needs of those initiatives alongside their local needs. Such planned work does not count in the 20% enterprise allowance, but rather counts as part of how the team’s cost to the enterprise is formally funded. It may result in a temporary increase in resources on the team, but in this case discipline around ‘staff liquidity’ is required to ensure the team can still function optimally after the temporary resource boost has gone.

The challenge regarding project-oriented financial planning is that, once a project’s goals have been achieved, what’s left is the team and underlying architecture – both of which need to be managed over time. So some dissociation between transitory project goals and longer term team and architecture goals is necessary to manage complexity.

For smaller, non-strategic projects – i.e., no incoming enterprise dependencies – the technology can be maintained on a lights-on basis.

Enterprise architecture can be seen as a means to asses the relevance of a team’s work to the enterprise – i.e., managing both incoming and outgoing team dependencies.  The higher the enterprise relevance of the team, the more critical the team must be managed well over time – i.e., team structure changes must be carefully managed, and not left entirely to the discretion of individual managers.

Conclusion

By ensuring that every project that purports to be Agile has a mandatory allowance for enterprise resource requirements, teams can have confidence that there is a route for them to get their dependencies addressed through other agile teams, in a manner that is independent of annual budget planning processes or short-term individual business priorities.

The effectiveness of this approach can be governed and evaluated by Enterprise Architecture, which would then allow enterprise complexity goals to be addressed without concentrating such spending within the central EA function.

In summary, to effectively scale agile, an effective (and possibly rethought) enterprise architecture capability is needed.

Scaled Agile needs Slack

Making good architectural moves

[tl;dr In Every change is an opportunity to make the ‘right’ architectural move to improve complexity management and to maintain an acceptable overall cost of change.]

Accompanying every new project, business requirement or product feature is an implicit or explicit ‘architectural move’ – i.e., a change to your overall architecture that moves it from a starting state to another (possibly interim) state.

The art of good architecture is making the ‘right’ architectural moves over time. The art of enterprise architecture is being able to effectively identify and communicate what a ‘right’ move actually looks like from an enterprise perspective, rather than leaving such decisions solely to the particular implementation team – who, it must be assumed, are best qualified to identify the right move from the perspective of the relevant domain.

The challenge is the limited inputs that enterprise architects have, namely:

  • Accumulated skill/knowledge/experience from past projects, including any architectural artefacts
  • A view of the current enterprise priorities based on the portfolio of projects underway
  • A corporate strategy and (ideally) individual business strategies, including a view of the environment the enterprise operates in (e.g., regulatory, commercial, technological, etc)

From these inputs, architects need to guide the overall architecture of the enterprise to ensure every project or deliverable results in a ‘good’ move – or at least not a ‘bad’ move.

In this situation, it is difficult if not impossible to measure the actual success of an architecture capability. This is because, in many respects, the beneficiaries of a ‘good’ enterprise architecture (at least initially) are the next deliverables (projects/requirements/features), and only rarely the current deliverables.

Since the next projects to be done is generally an unknown (i.e., the business situation may change between the time the current projects complete and the time the next projects start), it is rational for people to focus exclusively on delivering the current projects. Which makes it even more important the current projects are in some way delivering the ‘right’ architectural moves.

In many organisations, the typical engagement with enterprise architecture is often late in the architectural development process – i.e., at a ‘toll-gate’ or formal architectural review. And the focus is often on ‘compliance’ with enterprise standards, principles and guidelines. Given that such guidelines can get quite detailed, it can get quite difficult for anyone building a project start architecture (PSA) to come up with an architecture that will fully comply: the first priority is to develop an architecture that will work, and is feasible to execute within the project constraints of time, budget and resources.

Only then does it make sense to formally apply architectural constraints from enterprise architecture – at least some of which may negatively impact the time, cost, resource or feasibility needs of the project – albeit to the presumed benefit of the overall landscape. Hence the need for board-level sponsorship for such reviews, as otherwise the project’s needs will almost always trump enterprise needs.

The approach espoused by an interesting new book, Chess and the Art of Enterprise Architecture, is that enterprise architects need to focus more on design and less on principles, guidelines, roadmaps, etc. Such an approach involves enterprise architects more closely in the creation and evolution of project (start) architectures, which represents the architectural basis for all the work the project does (although it does not necessarily lay out the detailed solution architecture).

This approach is also acceptable for planning processes which are more agile than waterfall. In particular, it acknowledges that not every architectural ‘move’ is necessarily independently ‘usable’ by end users of an agile process. In fact, some user stories may require several architectural moves to fully implement. The question is whether the user story is itself validated enough to merit doing the architectural moves necessary to enable it, as otherwise those architectural moves may be premature.

The alternative, then, is to ‘prototype’ the user story,  so users can evaluate it – but at the cost of non-conformance with the project architecture. This is also known as ‘technical debt’, and where teams are mature and disciplined enough to pay down technical debt when needed, it is a good approach. But users (and sometimes product owners) struggle to tell the difference between an (apparently working) prototype and a production solution that is fully architecturally compliant, and it often happens that project teams move on to the next visible deliverable without removing the technical debt.

In applications where the end-user is a person or set of persons, this may be acceptable in the short term, but where the end-user could be another application (interacting via, for example, an API invoked by either a GUI or an automated process), then such technical debt will likely cause serious problems if not addressed. At the various least, it will make future changes harder (unmanaged dependencies, lack of automated testing), and may present significant scalability challenges.

So, what exactly constitutes a ‘good’ architectural move? In general, this is what the project start architecture should aim to capture. A good basic principle could be that architectural commitments should be postponed for as long as possible, by taking steps to minimise the impact of changed architectural decisions (this is a ‘real-option‘ approach to architectural change management). Over the long term, this reduces the cost of change.

In addition, project start architectures may need to make explicit where architectural commitments must be made (e.g., for a specific database, PaaS or integration solution, etc) – i.e., areas where change will be expensive.

Other things the project start architecture may wish to capture or at least address (as part of enterprise design) could include:

  • Cataloging data semantics and usage
    • to support data governance and big data initiatives
  • Management of business context & scope (business area, product, entity, processes, etc)
    • to minimize unnecessary redundancy and process duplication
  • Controlled exposure of data and behaviour to other domains
    • to better manage dependencies and relationships with other domains
  • Compliance with enterprise policies around security and data management
    • to address operational risk
  • Automated build, test & deploy processes
    • to ensure continued agility and responsiveness to change, and maximise operational stability
  • Minimal lock-in to specific solution architectures (minimise solution architecture impact on other domains)
    • to minimize vendor lock-in and maximize solution options

The Chess book mentioned above includes a good description of a PSA, in the context of a PRINCE2 project framework. Note that the approach also works for Agile, but the PSA should set the boundaries within which the agile team can operate: if those boundaries must be crossed to deliver a user story, then enterprise design architects should be brought back into the discussion to establish the best way forward.

In summary, every change is an opportunity to make the ‘right’ architectural move to improve complexity management and to maintain an acceptable overall cost of change.

Making good architectural moves