Bending The Serverless Spoon

Do not try and bend the spoon. That’s impossible. Instead, only realize the truth… THERE IS NO SPOON. Then you will see that it not the spoon that bends, it is yourself.” — The Matrix

[tl;dr To change the world around them, organizations should change themselves by adopting serverless + agile as a target. IT organizations should embrace serverless to optimize and automate IT workflows and processes before introducing it for critical business applications.]

“Serverless” is the latest shiny new thing to come on the architectural scene. An excellent (opinionated) analysis on what ‘serverless’ means has been written by Jeremy Daly, a serverless evangelist – the basic conclusion being that ‘serverless’ is ultimately a methodology/culture/mindset.

If we accept that as a reasonable definition, how does this influence how we think about solution design and engineering, given that generations of computer engineers have grown up with servers front-and-center of design thinking?

In other words, how do we bend our way of thinking of a problem space to serverless-first, and use that understanding to help make better architectural decisions – especially with respect to virtual machines, containers, and orchestration, and distributed systems in general?

Worked Example

To provide some insight into the practicalities of building and running a serverless application, I used a worked example, “Building a Serverless App Using Athena and AWS Lambda” by Epsagon, a serverless monitoring specialist. This uses the open-source Serverless framework to simplify the creation of serverless infrastructure on a given cloud provider. This example uses AWS.

Note to those attempting to follow this exercise: not all the required code was provided in the version I used, so the tutorial does require some (javascript) coding skills to fill the gaps. The code that worked for me (with copious logging..) can be found here.

This worked example focuses on two reference-data oriented architectural patterns:

  • The transactional creation via a RESTful API of a uniquely identifiable ‘product’ with an ad-hoc set of attributes, including but not limited to ‘ProductId’, ‘Name’ and ‘Color’.
  • The ability to query all ‘products’ which share specific attributes – in this case, a shared name.

In addition, the ability to create/initialize shared state (in the form of a virtual database table) is also handled.

Problem-domain Non-Functional Characteristics

Conceptually, the architecture has the following elements:

  • Public, anonymous RESTful APIs for product creation and query
    • APIs could be defined in OpenAPI 3.0, but by default are created
  • Durable storage of product data information
    • Variable storage cost structure based on access frequency can be added through configuration
    • Long-term archiving/backup obligations can be met without using any other services.
  • Very low data management overhead
  • Highly resilient and available infrastructure
    • Additional multi-regional resilience can be added via Application Load Balancer and deploying Lambda functions to multiple regions
    • S3 and Athena are globally resilient
  • Scalable architecture
    • No fixed constraint on number of records that can be stored
    • No fixed constraint on number of concurrent users using the APIs (configurable)
    • No fixed constraint on the number of concurrent users querying Athena (configurable)
  • No servers to maintain (no networks, servers, operating systems, software, etc)
  • Costs based on utilization
    • If nobody is updating or querying the database, then no infrastructure is being used and no charges (beyond storage) are incurred
  • Secure through AWS IAM permissioning and S3 encryption.
    • Many more security authentication, authorization and encryption options available via API Gateway, AWS Lambda, AWS S3, and AWS Athena.
  • Comprehensive log monitoring via CloudWatch, with ability to add alerts, etc.

For a couple of days coding, that’s a lot of non-functional goodness..and overall the development experience was pretty good (albeit not CI/CD optimized..I used Microsoft’s Code IDE on a MacBook, and the locally installed serverless framework to deploy.) Of course, I needed to be online and connect to AWS, but this seemed like a minor issue (for this small app). I did not attempt to deploy any serverless mock services locally.

So, even for a slightly contrived use case like the above, there are clear benefits to using serverless.

Why bend the spoon?

There are a number of factors that typically need to be taken into consideration when designing solutions which tend to drive architectures away from ‘serverless’ towards ‘serverful’. Typically, these revolve around resource management ( i.e., network, compute, storage ) and state management (i.e., transactional state changes).

The fundamental issue that application architects need to deal with in any solution architecture is the ‘impedance mismatch’ between general purpose storage services, and applications. Applications and application developers fundamentally want to treat all their data objects as if they are always available, in-memory (i.e., fast to access) and globally consistent, forcing engineers to optimize infrastructure to meet that need. This generally precludes using general-purpose or managed services, and results in infrastructure being tightly coupled with specific application architectures.

The simple fact is that a traditional well-written, modular 3-tier (GUI, business logic, data store) monolithic architecture will always outperform a distributed system – for the set of users and use-cases it is designed for. But these architectures are (arguably) increasingly rare in enterprises for a number of reasons, including:

  • Business processes are increasing in complexity (aka features), consisting of multiple independently evolving enterprise functions that must also be highly digitally cohesive with each other.
  • More and more business functions are being provided by third-parties that need close (digital) integration with enterprise processes and systems, but are otherwise managed independently.
  • There are many, disparate consumers of (digital) process data outputs – in some cases enabling entirely new business lines or customer services.
  • (Digital) GUI users extend well outside the corporate network, to mobile devices as well as home networks, third-party provider networks, etc.

All of the above conspire to drive even the most well-architected monolithic application to the a ‘ball-of-mud‘ architecture.

Underpinning all of this is the real motivation behind modern (cloud-native) infrastructure: in a digital age, infrastructure needs to be capable of being ‘internet scale’ – supporting all 4.3+ billion humans and growing.

Such scale demands serverless thinking. However, businesses that do not aspire to internet-scale usage still have key concerns:

  • Ability to cope with sudden demand spikes in b2c services (e.g., due to marketing campaigns, etc), and increased or highly variable utilisation of b2b services (e.g., due to b2b customers going digital themselves)
  • Provide secure and robust services to their customers when they need it, that is resilient to risks
  • Ability to continuously innovate on products and services to retain customers and remain competitive
  • Comply with all regulatory obligations without impeding ability to change, including data privacy and protection
  • Ability to reorganize how internal capabilities are provisioned and provided with minimal impact to any of the above.

Without serverless thinking, meeting all of these sometimes conflicting needs, becomes very complex, and will consume ever more enterprise IT engineering capacity.

Note: for firms to really understand where serverless should fit in their overall investment strategy, Wardley Maps are a very useful strategic planning tool.

Bending the Spoon

Bending the spoon means rethinking how we architect systems. It fundamentally means closing the gap between models and implementation, and recognizing that where an architecture is deficient, the instinctive reaction to fix or change what you control needs to be overcome: i.e., drive the change to the team (or service provider) where the issue properly belongs. This requires out-of-the-box thinking – and perhaps is a decision that should not be taken by individual teams on their own unless they really understand their service boundaries.

This approach may require teams to scale back new features, or modify roadmaps, to accommodate what can currently be appropriately delivered by the team, and accepting what cannot.

Most firms fail at this – because typically senior management focus on the top-line output and not on the coherence of the value-chain enabling it. But this is what ‘being digital’ is all about.

Everyone wants to be serverless

The reality is, the goal of all infrastructure teams is to avoid developers having to worry about their infrastructure. So while technologies like Docker initially aimed to democratize deployment, infrastructure engineering teams are working to ensure developers never need to know how to build or manage a docker image, configure a virtual machine, manage a network or storage device, etc, etc. This even extends to hiding the specifics of IaaS services exposed by cloud providers.

Organizations that are evaluating Kubernetes, OpenFaaS, or Knative , or which use services such as AWS Fargate, AWS ECS, Azure Container Service, etc, ultimately want to ensure to minimize the knowledge developers need to have of the infrastructure they are working on.

Unfortunately for infrastructure teams, most developers still develop applications using the ‘serverful’ model – i.e., they want to know what containers are running where, how they are configured, how they interact, how they are discovered, etc. Developers also want to run containers on their own laptop whenever they can, and deploy applications to authorized environments whenever they need to.

Developers also build applications which require complex configuration which is often hand-constructed between and across environments, as performance or behavioural issues are identified and ‘patched’ (i.e., worked around instead of directing the problem to the ‘right’ team/codebase).

At the same time, developers do not want anything to do with servers…containers are as close as they want to get to infrastructure, but containers are just an abstraction of servers – they are most definitely not ‘serverless’.

To be Serverless, Be Agile

Serverless solutions are still in the early stages of maturity. For many problems that require a low-cost, resilient and always-available solution, but are not particularly performance sensitive (i.e., are naturally asynchronous and eventually consistent), then serverless solutions are ideal.

In particular, IT (and the proverbial shoes for the cobblers children) processes would benefit significantly from extensive use of serverless, as the management overhead of serverless solutions will be significantly less than other solutions. Integrating bespoke serverless solutions with workflows managed by tools like ServiceNow could be a significant game changer for existing IT organizations.

However, mainstream use of serverless technologies and solutions for business critical enterprise applications is still some way away – but if IT departments develop skills in it, it won’t be long before it finds its way into critical business solutions.

For broader use of serverless, firms need to be truly agile. Work for teams needs to come equally from other dependent teams as from top-down sources. Teams themselves need to be smaller (and ‘senior’ staff need to rethink their roles), and also be prepared to split or plateau. And feature roadmaps need to be as driven by capabilities as imagined needs.

Conclusion

Organizations already know they need to be ‘agile’. To truly change the world (bend the spoon), serverless and agile together will enable firms to change themselves, and so shape the world around them.

Unfortunately, for many organizations it is still easier to try to bend the spoon..for those who understand they need to change, adopting the ‘serverless’ mindset is key to success, even if – at least initially – true serverless solutions remain a challenge to realize in organizations dealing with legacy (serverful) architectures.

Bending The Serverless Spoon

The changing role of data lakes

[tl;dr A single data lake, data warehouse or data pipeline to “rule them all” is less useful in hybrid cloud environments, where it can be feasible to query ‘serverless’ cloud-native data sources directly rather than rely on traditional orchestrated batch extracts. Pipeline complexity can be reduced by open extensions to SQL such as the recently announced AWS PartiQL language. Opportunities exist to integrate enterprise human-oriented data governance and meta-data platforms with data pipelines using serverless technologies.]

The need for Data Lakes

A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale. The data lake concept was created to address a number of issues with traditional data analytics and reporting solutions, specifically:

  • the growing number of applications across an enterprise depending on a given dataset;
  • business and regulatory drivers for governing dataset discovery, quality, creation and/or consumption;
  • the increasing difficulty of IT teams to respond in a timely manner to growing business demand for access to high quality datasets.

The data lake allows data to be made available from its source without making any assumptions about its use. This is particularly critical when the data originates from batch extracts of load-sensitive OLTP databases, most of which are still operating on-premise. Streaming data pipelines, while growing in popularity, are not as common as batch-driven pipelines – although this should change over time as more digital platform architectures become more event-driven in nature.

Data lakes are a key component in data pipelines, a construct (or set of constructs) that provides consolidation of data from multiple sources and makes it available for use. A data pipeline can be orchestrated (via a scheduler) or choreographed (responding to events) – the more jobs a pipeline has to do, the more complex the orchestration or choreography, which has implications for supportability. So reducing the number of jobs a pipeline has to support is key to managing data pipeline complexity.

The Components of a Data Lake

A data lake consists of a few key components:

FeatureDescriptionVirtualTraditional
A storage repositoryDurable, resilient storage of data objects.NoYes
An ingestion mechanismA means to upload content to the repository (no transformation)NoYes
A tagging & metadata mechanismA means to associate metadata with data objects, including user-defined tags.YesYes
A metadata search mechanismA means to search objects in the data lake based on metadata and tags (not content)YesYes
A query engineA means to search the content of objects in the data lakeYesPartially
An access control mechanismA means to ensure that users can only access datasets and parts of data sets that they are entitled to see, and to audit all activity.YesYes

In effect, data lakes have become a kind of data warehouse – the main significant difference being that input sources into data lakes tend to be familiar files – CSVs, Avro, JSON, etc. from multiple sources rather than highly optimized domain-specific schemas – i.e., no assumptions are made about how (or why) the data in the data lake will be consumed. Data lakes also do not concern themselves with scheduling or orchestration.

Datawarehouses, datawarehouses everywhere…

For mature data use cases (i.e., situations where relatively stable, well-known data requirements exist), and where consistent high performance is material to meeting customer needs, data warehouses are still the best solution. A data warehouse stores and manages all of its data locally, and only relies on the data lake as an initial ingestion point.

A data warehouse will transform datasets to the form needed for the specific use cases it supports, and will optimize performance for the consumption of those datasets. Modern data warehouses will use ML/AI techniques to optimize performance rather than relying on human database specialists. But, as this approach is compute intensive, such solutions are more amenable to cloud environments than on-premise environments. Snowflake is an example of this model. As more traditional data warehouses (e.g., Oracle Exadata) move to the cloud, we can expect these to also get ‘smarter’ – however, data gravity will mean such solutions will need to be fundamentally multi-cloud compatible.

For on-premise data warehouses, the tendency is for business lines or functions to create ‘one data warehouse to rule them all’ – mainly because of the traditionally significant storage and compute infrastructure and resources necessary to support data warehouses. Consequently considerable effort is spent on defining and maintaining high performance, appropriately normalized, enterprise data models that can be used in as many enterprise use cases as possible.

In a hybrid/cloud world, multiple data warehouses become more feasible – and in fact, will be inevitable in larger organizations. As more enterprise data becomes available in these dynamically scalable, cloud-based (or HDFS/Hadoop based) data warehouses (such as AWS EMR, AWS Redshift, Snowflake, Google Big Query, Azure SQL Data Warehouse), ‘virtual data warehouses’ avoid the need to move data from its source for query handling, allowing data storage and egress costs to be kept to a minimum, especially if assisted by machine-learning techniques.

Virtual Data Warehouses

Virtual Data warehouse technologies have been around for a while, allowing users to manage and query multiple data sources through a common logical access point. For on-premise solutions, virtual data warehouses have limited use cases, as the cost/effort of scaling out in-house solutions can be prohibitive and not particularly agile in nature, precluding experimental use cases.

On hybrid or cloud environments, virtual data warehouses can leverage the scalability of cloud-native data warehouses, driving queries to the relevant engine for execution, and then leveraging its own scalable infrastructure for executing join queries.

Technologies like Dremio reflect the state of the art in cloud-based data warehouses, which push down queries to the source system where possible, but can process them in-memory directly from a data lake or other source if not.

However, there is one thing that all data warehouses have in common: they leverage SQL and (implicitly) a relational view of the data. Standard ANSI SQL queries are generally supported by all data warehouses, but may mean that some data cannot be queried if it is not in tabular form amenable to SQL processing.

Extending SQL with PartiQL

Enter PartiQL, an open-source project sponsored by Amazon to drive extensions to standard SQL that can cope with non-relational data types, including structured, unstructured, nested, and schemaless (NoSQL, Document).

Historically, all data ingested into a data lake had to be transformed into a format that could be queried by SQL-like commands or processed by typical data warehouse bulk-upload tools. This adds complexity to data pipelines (i.e., more jobs), and may also force premature schema design (i.e., forcing the design of an optimal schema before all critical use cases are fully understood).

PartiQL potentially allows tools such as Snowflake, Dremio (as well as the tools AWS uses internally) to query data using SQL-like syntax, but to also include non-relational data in those queries so they can avoid those separate transformation steps, aiding pipeline complexity reduction.

PartiQL claims to be fully ANSI-compliant, but extended in specific ways to support alternate data formats. While not an official ISO/ANSI standard, it may have the ability to become a de-facto standard – especially as the language has already been used in anger with success within AWS. This will provide a skill path for relational data warehouse experts to become proficient in leveraging modern data pipelines without committing to one specific vendor’s technology.

Technologies like PartiQL will make it much easier to include event-sourced streams into a data pipeline, as events are defined as nested or other non-relational structures. As more data pipelines become event driven rather than batch-driven, having a standard like PartiQL will be key. (It will be interesting to see if Confluent’s KSQL and PartiQL will converge to a single event-stream query standard.)

As PartiQL has only just been released, it’s too soon to tell how the big data ecosystem or ISO/ANSI will respond. Expect more on this topic in the future. For now, virtual data warehouses must rely on their proprietary SQL extensions.

Non-SQL Data Processing

Considerable investment is being made by third party vendors on advanced technology focused on making distributed, scalable processing of SQL (or SQL-like) queries fast and reliable with little or no human tuning required. As such, it is wise to pick a vendor demonstrating a clear strategy in this space, and continuing to invest in SQL as the lingua-franca of transformation logic.

However, for use cases for which SQL is not appropriate, distributed computing platforms like Spark are still needed. The expectation here is that such platforms will ingest data from a data lake, and output results into a data lake. In some cases, the distributed computing platform offers its own storage (e.g., HDFS), but increasingly it is more appropriate to question whether data needs to reside permanently in a HDFS cluster rather than in a data lake. For example, Amazon’s EMR service allows Hadoop clusters to be created ephemerally, and to consume their initial dataset from AWS S3 repositories or other data sources,

Enforcing Enterprise Data Collaboration and Governance

Note that all data warehouse solutions (virtual or not) must support some form of meta-data tagging and management used by their SQL query engines – otherwise they cannot act as a virtual database source (generally an ODBC end-point that applications can connect directly to). This tagging can be automated if sources included meta-data (e.g., field headers, Avro schema definitions, etc) , but can be enhanced by human tagging, which is increasingly augmented by machine-learning to help identify, for example, where data may be sensitive, etc.

But data governance needs extend beyond the needs of the virtual data warehouse query engines, and this is where there are still gaps to be filled in the current enterprise data management tools.

Tools from vendors like Alation, Waterline, Informatica, Collibra etc were created to augment people’s ability to properly tag content in the data-lake with meaningful information to make it discoverable and governable. Consistent tagging in principle allows tag-based governance rules to be defined to automatically enforce data governance policies in data consumers. This data, coupled with schema information which can be derived directly from data-sources, is all the information needed to allow users (or developers) to source the data they need in a secure, compliant way.

But meta-data for data governance has humans as the primary user (e.g. CDOs, business/data analysts, process owners, etc) – or, as Alation describes it – meta-data for human collaboration.

Currently, there is no accepted standards for ensuring the consistency of ‘meta-data for human collaboration’ with ‘meta-data for query execution’.

Ideally, the human-oriented tools would generate standard events that tools in the data pipeline could pick up and act on (via, for example, something like AWS EventBridge), thereby avoiding the need for data governance personnel to oversee multiple data pipelines directly…

Summary

With the advent of cloud-based managed compute and data storage services, a multi-data warehouse and pipeline strategy is viable and may even be desirable, potentially involving multiple data lakes.

Solutions like PartiQL have the potential to eliminate many transformation job phases and greatly simplify data pipeline complexity in a standardized way, leveraging existing SQL skills rather than requiring new skills.

To ensure consistent governance across multiple data pipelines, a serverless event-based approach to connecting human data governance solutions with cloud-native data pipeline solutions may be the way forward – for example, using AWS EventBridge to action events originating from SaaS-based data governance services with data pipelines.

The changing role of data lakes

Why AWS EventBridge changes everything..

“Events, dear boy, events”

Harold McMillan

[tl;dr AWS EventBridge may encourage SaaS businesses to formally define and manage public event models that other businesses can design into their workflows. In turn, this may enable businesses to achieve agility goals by decomposing their organizations into smaller, event-driven “cells” with workflows empowered by multiple SaaS capabilities.]

Last week, Amazon formally announced the launch of the AWS EventBridge service. What makes this announcement so special?

The biggest single technical benefit is the avoidance of the need for webhooks or polling APIs. (See here for a good explanation of the difference.)

Webhooks are generally not considered a scalable solution for SaaS services, as significant engineering is required to make it robust, and consuming applications need to be designed to handle web-hook API calls.

HTTP-based APIs exposed by 3rd party services can be polled by applications that need to know if state has changed, but this polling consumes resources even when nothing changes. Again, this has scalability issues on both the SaaS provider as well as the application consumer.

In both cases, the principle metaphor connecting both the SaaS and consuming application is the ‘service interface’ abstraction – i.e., executing an operation on a resource. As such, this is a technical solution to a technical problem.

From APIs to Events

While this ‘service-based’ model of distributed programming is extremely powerful, it is not an appropriate abstraction for connecting behaviors across multiple services in a value chain. To align with business-level concepts such as Business Process Modelling, event-driven architectures are becoming more and more popular to model complex workflows both within and between organizations.

This trend is accelerated by the desire of organizations to become more “agile”. Increasingly organizations are recognizing this must manifest itself as breaking down the organization into more manageable, semi-autonomous “cells” (see this article from McKinsey as an example). With cells, the event metaphor fits naturally: cells can decide which events they care about, and also decide what events they in turn create that other cells may use.

3rd party service providers (i.e., SaaS companies such as SalesForce, Workday, Office365, ServiceNow, Datadog, etc) empower organizational cells and enable them to achieve far more than a small cell otherwise could. The “cell” concept cannot be fully realized unless every cell has the ability to define and control how it uses these services to achieve its own mission.

In addition, as value/supply chains become more complex, and more (3rd party or internal) providers are embedded in those workflows , the need for a more natural, adaptable way of integrating processes has become evident.

But event-driven architectures require a common ‘bus’ – a target-neutral means to allow zero or more consumers express an interest in receiving events published on the bus. This historically has been impractical to do at scale between organizations (or even within organizations) without requiring all parties to agree on a neutral 3rd party to manage the bus, and at the additional risk of creating a change bottleneck: hence the historic preference for point-to-point HTTP-based standards.

Services like the AWS EventBridge for the first time allow autonomous SaaS solutions to publish a formal event model that can be consumed programmatically and seamlessly included in local (cell-specific) workflows. In addition, this event model can be neutral to the underlying technology and cloud provider.

How it works and what makes it different

The key feature of the EventBridge is the separation of the publisher from the consumer, and the way that business rules to manage the routing and transformation of events is handled.

Once an organization (or AWS account) has registered as a consumer with the publisher (the owner of the “event source”), a logical “event bus” is created to represent all events for that org/account. The consuming org/account can then setup whatever routing and transformation rules it needs for any internal consumers of those events, without any further dependency on the publishing organization. So consumer organizations/accounts have full control over what is published internally to consuming applications.

With appropriate guard-rails in place, individual teams (“cells”) can define and configure their own routing rules, and not rely on any centralized team – a key weakness in many legacy ESB solutions.

Note that the EventBridge has predefined service limits – it has a (reasonably – 400 events/sec) high throughput, but is a high latency service (0.5sec). So low-latency use cases such as electronic trading are not, as this point, an appropriate use case for EventBridge.

The use of EventBridge for internal enterprise event handling should be considered carefully: the 100 event buses per account essentially limits the number of publishers that can be handled by any one account to 100. For most use cases, this should be more than enough, but many large organizations may have many more than 100 ‘publishers’ publishing on their ESB. If each publisher can be viewed as a part of an end-to-end business value-stream, then any value-stream with more than 100 components (i.e., unique event models) is likely to be overly complex. In practice, a ‘publisher’ is likely to be an enterprise application: therefore some significant complexity reduction and consolidation (of event models, if not actual code) would be needed to ensure such organizations can use EventBridge internally effectively.

The AWS Way (also, the Cloud Way)

It’s worth noting that key to Amazon’s success is its ability to “eat its own dogfood“. Every service in Amazon and AWS is built atop other services. No service is allowed to get so big and bloated it cannot be managed effectively. Abstractions are ‘clean’ – rather than add bells and whistles to an existing service, a new service is created which leverages the underlying service or services.

AWS has consistently required every service to have and maintain an API model, which – for asynchronous/autonomous services – leads naturally to an event model. This in turn has made it natural for AWS EventBridge to come out-of-the-box with a number of events already emitted by AWS services that can be leveraged for customer solutions. (For now, many of these events are limited to generic CloudTrail-related events – specifically tracking API calls – but in the future it’s reasonable to expect more service-specific events to be made available.)

AWS does have one key advantage over other major cloud providers such as Google GCP and Microsoft Azure: it set out to build a business (an online marketplace) using these services. So it’s strategy was (and is) driven by its vision for how to build a globally scalable online business – not by the need to provide technology services to businesses. To this extent, it’s hard to see Google and Microsoft being anything other than followers of AWS’s lead.

A Prediction..

Businesses which also follow the Amazon-inspired growth/innovation and organization model will likely have a better chance of succeeding in the digital age. And it is for these businesses that EventBridge will have the most impact – far beyond the technological improvements afforded by the use of events vs webhooks/APIs.

Consequently, as more SaaS companies are on-boarded onto the AWS EventBridge eco-system, we can expect more event models to be published. Tools for managing and evolving event models will evolve and improve so they become more accessible and useful for non-traditional IT folks (i.e., process and workflow designers) – currently the only way right now to see event model definitions seems to be by actually creating business rules.

This increased focus on SaaS integrations may (perhaps) inspire firms to re-organize their internal capabilities along similar lines, as internal service providers, empowering cells across the organization and with a published and accessible software-driven event model – noting that while events may be published and received digitally, they can still be actioned by humans for non-digital processes (e.g., complex pricing decision making, responding to help desk requests, etc).

The roster of SaaS firms signing up to EventBridge over the coming months will hopefully bear out this prediction. A good sense of what services could be onboarded can be had by looking at all the SaaS (and IoT) services integrated by IFTTT.

In the meantime, it is time to explore the re-imagined integration opportunities afforded by AWS EventBridge..

Why AWS EventBridge changes everything..

Why IoT is changing Enterprise Architecture

[tl;dr The discipline of enterprise architecture as an aid to business strategy execution has failed for most organizations, but is finding a new lease of life in the Internet of Things.]

The strategic benefits of having an enterprise-architecture based approach to organizational change – at least in terms of business models and shared capabilities needed to support those models – have been the subject of much discussion in recent years.

However, enterprise architecture as a practice (as espoused by The Open Group and others) has never managed to break beyond it’s role as an IT-focused endeavor.

In the meantime, less technology-minded folks are beginning to discuss business strategy using terms like ‘modularity’, which is a first step towards bridging the gap between the business folks and the technology folks. And technology-minded folks are looking at disruptive business strategy through the lens of the ‘Internet of Things‘.

Business Model Capability Decomposition

Just like manufacturing-based industries decomposed their supply-chains over the past 30+ years (driving an increasingly modular view of manufacturing), knowledge-based industries are going through a similar transformation.

Fundamentally, knowledge based industries are based on the transfer and management of human knowledge or understanding. So, for example, you pay for something, there is an understanding on both sides that that payment has happened. Technology allows such information to be captured and managed at scale.

But the ‘units’ of knowledge have only slowly been standardized, and knowledge-based organizations are incentivized to ensure they are the only ones to be able to act on the information they have gathered – to often disastrous social and economic consequences (e.g., the financial crisis of 2008).

Hence, regulators are stepping into to ensure that at least some of this ‘knowledge’ is available in a form that allows governments to ensure such situations do not arise again.

In the FinTech world, every service provided by big banks is being attacked by nimble competitors able to take advantage of new, more meaningful technology-enabled means of engaging with customers, and who are willing to make at least some of this information more accessible so that they can participate in a more flexible, dynamic ecosystem.

For these upstart FinTech firms, they often have a stark choice to make in order to succeed. Assuming they have cleared the first hurdle of actually having a product people want, at some point, they must decide whether they are competing directly with the big banks, or if they are providing a key part of the electronic financial knowledge ecosystem that big banks must (eventually) be part of.

In the end, what matters is their approach to data: how they capture it (i.e., ‘UX’), what they do with it, how they manage it, and how it is leveraged for competitive and commercial advantage (without falling foul of privacy laws etc). Much of the rest is noise from businesses trying to get attention in an increasingly crowded space.

Historically, many ‘enterprise architecture’ or strategy departments fail to have impact because firms do not treat data (or information, or knowledge) as an asset, but rather as something to be freely and easily created and shunted around, leaving a trail of complexity and lost opportunity cost wherever it goes. So this attitude must change before ‘enterprise architecture’ as a concept will have a role in boardroom discussions, and firms change how they position IT in their business strategy. (Regulators are certainly driving this for certain sectors like finance and health.)

Internet of Things

Why does the Internet Of Things (IoT) matter, and where does IoT fit into all this?

At one level, IoT presents a large opportunity for firms which see the potential implied by the technologies underpinning IoT; the technology can provide a significant level of convenience and safety to many aspects of a modern, digitally enabled life.

But fundamentally, IoT is about having a large number of autonomous actors collaborating in some way to deliver a particular service, which is of value to some set of interested stakeholders.

But this sounds a lot like what a ‘company’ is. So IoT is, in effect, a company where the actors are technology actors rather than human actors. They need some level of orchestration. They need a common language for communication. And they need laws/protocols that govern what’s permitted and what is not.

If enterprise architecture is all about establishing the functional, data and protocol boundaries between discrete capabilities within an organization, then EA for IoT is the same thing but for technical components, such as sensors or driverless cars, etc.

So IoT seems a much more natural fit for EA thinking than traditional organizations, especially as, unlike departments in traditional ‘human’ companies, technical components like standards: they like fixed protocols, fixed functional boundaries and well-defined data sets. And while the ‘things’ themselves may not be organic, their behavior in such an environment could exhibit ‘organic’ characteristics.

So, IoT and and the benefits of an enterprise architecture-oriented approach to business strategy do seem like a match made in heaven.

The Converged Enterprise

For information-based industries in particular, there appears to be an inevitable convergence: as IoT and the standards, protocols and governance underpinning it mature, so too will the ‘modular’ aspects of existing firms operating models, and the eco-system of technology-enabled platforms will mature along with it. Firms will be challenged to deliver value by picking the most capable components in the eco-system around which to deliver unique service propositions – and the most successful of those solutions will themselves become the basis for future eco-systems (a Darwinian view of software evolution, if you will).

The converged enterprise will consist of a combination of human and technical capabilities collaborating in well-defined ways. Some capabilities will be highly human, others highly technical, some will be in-house, some will be part of a wider platform eco-system.

In such an organization, enterprise architects will find a natural home. In the meantime, enterprise architects must choose their starting point, behavioral or structural: focusing first on decomposed business capabilities and finishing with IoT (behavioral->structural), or focusing first on IoT and finishing with business capabilities (structural->behavioral).

Technical Footnote

I am somewhat intrigued at how the OSGi Alliance has over the years shifted its focus from basic Java applications, to discrete embedded systems, to enterprise systems and now to IoT. OSGi, (disappointingly, IMO), has had a patchy record changing how firms build enterprise software – much of this is to do with a culture of undisciplined dependency management in the software industry which is very, very hard to break.

IoT raises the bar on dependency management: you simply cannot comprehensively test software updates to potentially hundreds of thousands or millions of components running that software. The ability to reliably change modules without forcing a test of all dependent instantiated components is a necessity. As enterprises get more complex and digitally interdependent, standards such as OSGi will become more critical to the plumbing of enabling technologies. But as evidenced above, for folks who have tried and failed to change how enterprises treat their technology-enabled business strategy, it’s a case of FIETIOT – Failed in Enterprise, Trying IoT. And this, indeed, seems a far more rational use of an enterprise architect’s time, as things currently stand.

 

 

 

 

 

 

Why IoT is changing Enterprise Architecture

Transforming IT: From a solution-driven model to a capability-driven model

[tl;dr Moving from a solution-oriented to a capability-oriented model for software development is necessary to enable enterprises to achieve agility, but has substantial impacts on how enterprises organise themselves to support this transition.]

Most organisations which manage software change as part of their overall change portfolio take a project-oriented approach to delivery: the project goals are set up front, and a solution architecture and delivery plan are created in order to achieve the project goals.

Most organisations also fix project portfolios on a yearly basis, and deviating from this plan can often very difficult for organisations to cope with – at least partly because such plans are intrinsically tied into financial planning and cost-saving techniques such as capitalisation of expenses, etc, which reduce bottom-line cost to the firm of the investment (even if it says nothing about the value added).

As the portfolio of change projects rise every year, due to many extraneous factors (business opportunities, revenue protection, regulatory demand, maintenance, exploration, digital initiatives,  etc), cross-project dependency management becomes increasingly difficult. It becomes even more complex to manage solution architecture dependencies within that overall dependency framework.

What results is a massive set of compromises that ends up with building solutions that are sub-optimal for pretty much every project, and an investment in technology that is so enterprise-specific, that no other organisation could possibly derive any significant value from it.

While it is possible that even that sub-optimal technology can yield significant value to the organisation as a whole, this benefit may be short lived, as the cost-effective ability to change the architecture must inevitably decrease over time, reducing agility and therefore the ability to compete.

So a balance needs to be struck, between delivering enterprise value (even at the expense of individual projects) while maintaining relative technical and business agility. By relative I mean relative to peers in the same competitive sector…sectors which are themselves being disrupted by innovative technology firms which are very specialist and agile within their domain.

The concept of ‘capabilities’ realised through technology ‘products’, in addition to the traditional project/program management approach, is key to this. In particular, it recognises the following key trends:

  • Infrastructure- and platform-as-a-service
  • Increasingly tech-savvy work-force
  • Increasing controls on IT by regulators, auditors, etc
  • Closer integration of business functions led by ‘digital’ initiatives
  • The replacement of the desktop by mobile & IoT (Internet of Things)
  • The tension between innovation and standards in large organisations

Enterprises are adapting to all the above by recognising that the IT function cannot be responsible for both technical delivery and ensuring that all technology-dependent initiatives realise the value they were intended to realise.

As a result, many aspects of IT project and programme management are no longer driven out of the ‘core’ IT function, but by domain-specific change management functions. IT itself must consolidate its activities to focus on those activities that can only be performed by highly qualified and expert technologists.

The inevitable consequence of this transformation is that IT becomes more product driven, where a given product may support many projects. As such, IT needs to be clear on how to govern change for that product, to lead it in a direction that is most appropriate for the enterprise as a whole, and not just for any particular project or business line.

A product must provide capabilities to the stakeholders or users of that product. In the past, those capabilities were entirely decided by whatever IT built and delivered: if IT delivered something that in practice wasn’t entirely fit for purpose, then business functions had no alternative but to find ways to work around the system deficiencies – usually creating more complexity (through end-user-developed applications in tools like Excel etc) and more expense (through having to hire more people).

By taking a capability-based approach to product development, however, IT can give business functions more options and ways to work around inevitable IT shortfalls without compromising controls or data integrity – e.g., through controlled APIs and services, etc.

So, while solutions may explode in number and complexity, the number of products can be controlled – with individual businesses being more directly accountable for the complexity they create, rather than ‘IT’.

This approach requires a step-change in how traditional IT organisations manage change. Techniques from enterprise architecture, scaled agile, and DevOps are all key enablers for this new model of structuring the IT organisation.

In particular, except for product-strategy (where IT must be the leader), IT must get out of the business of deciding the relative value/importance of individual product changes requested by projects, which historically IT has been required to do. By imposing a governance structure to control the ‘epics’ and ‘stories’ that drive product evolution, projects and stakeholders have some transparency into when the work they need will be done, and demand can be balanced fairly across stakeholders in accordance with their ability to pay.

If changes implemented by IT do not end up delivering value, it should not be because IT delivered the wrong thing, but rather the right thing was delivered for the wrong reason. As long as IT maintains its product roadmap and vision, such mis-steps can be tolerated. But they cannot be tolerated if every change weakens the ability of the product platform to change.

Firms which successfully balance between the project and product view of their technology landscape will find that productivity increases, complexity is reduced and agility increases massively. This model also lends itself nicely to bounded domain development, microservices, use of container technologies and automated build/deployment – all of which will likely feature strongly in the enterprise technology platform of the future.

The changes required to support this are significant..in terms of financial governance, delivery oversight, team collaborations, and the roles of senior managers and leaders. But organisations must be prepared to do this transition, as historical approaches to enterprise IT software development are clearly unsustainable.

Transforming IT: From a solution-driven model to a capability-driven model

Scaled Agile needs Slack

[tl;dr In order to effectively scale agile, organisations need to ensure that a portion of team capacity is explicitly set aside for enterprise priorities. A re-imagined enterprise architecture capability is a key factor in enabling scaled agile success.]

What’s the problem?

From an architectural perspective, Agile methodologies are heavily dependent on business- (or function-) aligned product owners, which tend to be very focused on *their* priorities – and not the enterprise’s priorities (i.e., other functions or businesses that may benefit from the work the team is doing).

This results in very inward-focused development, and where dependencies on other parts of the organisation are identified, these (in the absence of formal architecture governance) tend to be minimised where possible, if necessary through duplicative development. And other teams requiring access to the team’s resources (e.g., databases, applications, etc) are served on a best-effort basis – often causing those teams to seek other solutions instead, or work without formal support from the team they depend on.

This, of course, leads to architectural complexity, leading to reduced agility all round.

The Solution?

If we accept the premise that, from an architectural perspective, teams are the main consideration (it is where domain and technical knowledge resides), then the question is how to get the right demand to the right teams, in as scalable, agile manner as possible?

In agile teams, the product backlog determines their work schedule. The backlog usually has a long list of items awaiting prioritisation, and part of the Agile processes is to be constantly prioritising this backlog to ensure high-value tasks are done first.

Well known management research such as The Mythical Man Month has influenced Agile’s goal to keep team sizes small (e.g., 5-9 people for scrum). So when new work comes, adding people is generally not a scalable option.

So, how to reconcile the enterprise’s needs with the Agile team’s needs?

One approach would be to ensure that every team pays an ‘enterprise’ tax – i.e., in prioritising backlog items, at least, say, 20% of work-in-progress items must be for the benefit of other teams. (Needless to say, such work should be done in such a way as to preserve product architectural integrity.)

20% may seem like a lot – especially when there is so much work to be done for immediate priorities – but it cuts both ways. If *every* team allows 20% of their backlog to be for other teams, then every team has the possibility of using capacity from other teams – in effect, increasing their capacity by much more than they could do on their own. And by doing so they are helping achieve enterprise goals, reducing overall complexity and maximising reuse – resulting in a reduction in project schedule over-runs, higher quality resulting architecture, and overall reduced cost of change.

Slack does not mean Under-utilisation

The concept of ‘Slack’ is well described in the book ‘Slack: Getting Past Burn-out, Busywork, and the Myth of Total Efficiency‘. In effect, in an Agile sense, we are talking about organisational slack, and not team slack. Teams, in fact, will continue to be 100% utilised, as long as their backlog consists of more high-value items then they can deliver. The backlog owner – e.g., scrum master – can obviously embed local team slack into how a particular team’s backlog is managed.

Implications for Project & Financial Management

Project managers are used to getting funding to deliver a project, and then to be able to bring all those resources to bear to deliver that project. The problem is, that is neither agile, nor does it scale – in an enterprise environment, it leads to increasingly complex architectures, resulting in projects getting increasingly more expensive, increasingly late, or delivering increasingly poor quality.

It is difficult for a project manager to accept that 20% of their budget may actually be supporting other (as yet unknown) projects. So perhaps the solution here is to have Enterprise Architecture account for the effective allocation of that spending in an agile way? (i.e., based on how teams are prioritising and delivering those enterprise items on their backlog). An idea worth considering..

Note that the situation is a little different for planned cross-business initiatives, where product owners must actively prioritise the needs of those initiatives alongside their local needs. Such planned work does not count in the 20% enterprise allowance, but rather counts as part of how the team’s cost to the enterprise is formally funded. It may result in a temporary increase in resources on the team, but in this case discipline around ‘staff liquidity’ is required to ensure the team can still function optimally after the temporary resource boost has gone.

The challenge regarding project-oriented financial planning is that, once a project’s goals have been achieved, what’s left is the team and underlying architecture – both of which need to be managed over time. So some dissociation between transitory project goals and longer term team and architecture goals is necessary to manage complexity.

For smaller, non-strategic projects – i.e., no incoming enterprise dependencies – the technology can be maintained on a lights-on basis.

Enterprise architecture can be seen as a means to asses the relevance of a team’s work to the enterprise – i.e., managing both incoming and outgoing team dependencies.  The higher the enterprise relevance of the team, the more critical the team must be managed well over time – i.e., team structure changes must be carefully managed, and not left entirely to the discretion of individual managers.

Conclusion

By ensuring that every project that purports to be Agile has a mandatory allowance for enterprise resource requirements, teams can have confidence that there is a route for them to get their dependencies addressed through other agile teams, in a manner that is independent of annual budget planning processes or short-term individual business priorities.

The effectiveness of this approach can be governed and evaluated by Enterprise Architecture, which would then allow enterprise complexity goals to be addressed without concentrating such spending within the central EA function.

In summary, to effectively scale agile, an effective (and possibly rethought) enterprise architecture capability is needed.

Scaled Agile needs Slack

Making good architectural moves

[tl;dr In Every change is an opportunity to make the ‘right’ architectural move to improve complexity management and to maintain an acceptable overall cost of change.]

Accompanying every new project, business requirement or product feature is an implicit or explicit ‘architectural move’ – i.e., a change to your overall architecture that moves it from a starting state to another (possibly interim) state.

The art of good architecture is making the ‘right’ architectural moves over time. The art of enterprise architecture is being able to effectively identify and communicate what a ‘right’ move actually looks like from an enterprise perspective, rather than leaving such decisions solely to the particular implementation team – who, it must be assumed, are best qualified to identify the right move from the perspective of the relevant domain.

The challenge is the limited inputs that enterprise architects have, namely:

  • Accumulated skill/knowledge/experience from past projects, including any architectural artefacts
  • A view of the current enterprise priorities based on the portfolio of projects underway
  • A corporate strategy and (ideally) individual business strategies, including a view of the environment the enterprise operates in (e.g., regulatory, commercial, technological, etc)

From these inputs, architects need to guide the overall architecture of the enterprise to ensure every project or deliverable results in a ‘good’ move – or at least not a ‘bad’ move.

In this situation, it is difficult if not impossible to measure the actual success of an architecture capability. This is because, in many respects, the beneficiaries of a ‘good’ enterprise architecture (at least initially) are the next deliverables (projects/requirements/features), and only rarely the current deliverables.

Since the next projects to be done is generally an unknown (i.e., the business situation may change between the time the current projects complete and the time the next projects start), it is rational for people to focus exclusively on delivering the current projects. Which makes it even more important the current projects are in some way delivering the ‘right’ architectural moves.

In many organisations, the typical engagement with enterprise architecture is often late in the architectural development process – i.e., at a ‘toll-gate’ or formal architectural review. And the focus is often on ‘compliance’ with enterprise standards, principles and guidelines. Given that such guidelines can get quite detailed, it can get quite difficult for anyone building a project start architecture (PSA) to come up with an architecture that will fully comply: the first priority is to develop an architecture that will work, and is feasible to execute within the project constraints of time, budget and resources.

Only then does it make sense to formally apply architectural constraints from enterprise architecture – at least some of which may negatively impact the time, cost, resource or feasibility needs of the project – albeit to the presumed benefit of the overall landscape. Hence the need for board-level sponsorship for such reviews, as otherwise the project’s needs will almost always trump enterprise needs.

The approach espoused by an interesting new book, Chess and the Art of Enterprise Architecture, is that enterprise architects need to focus more on design and less on principles, guidelines, roadmaps, etc. Such an approach involves enterprise architects more closely in the creation and evolution of project (start) architectures, which represents the architectural basis for all the work the project does (although it does not necessarily lay out the detailed solution architecture).

This approach is also acceptable for planning processes which are more agile than waterfall. In particular, it acknowledges that not every architectural ‘move’ is necessarily independently ‘usable’ by end users of an agile process. In fact, some user stories may require several architectural moves to fully implement. The question is whether the user story is itself validated enough to merit doing the architectural moves necessary to enable it, as otherwise those architectural moves may be premature.

The alternative, then, is to ‘prototype’ the user story,  so users can evaluate it – but at the cost of non-conformance with the project architecture. This is also known as ‘technical debt’, and where teams are mature and disciplined enough to pay down technical debt when needed, it is a good approach. But users (and sometimes product owners) struggle to tell the difference between an (apparently working) prototype and a production solution that is fully architecturally compliant, and it often happens that project teams move on to the next visible deliverable without removing the technical debt.

In applications where the end-user is a person or set of persons, this may be acceptable in the short term, but where the end-user could be another application (interacting via, for example, an API invoked by either a GUI or an automated process), then such technical debt will likely cause serious problems if not addressed. At the various least, it will make future changes harder (unmanaged dependencies, lack of automated testing), and may present significant scalability challenges.

So, what exactly constitutes a ‘good’ architectural move? In general, this is what the project start architecture should aim to capture. A good basic principle could be that architectural commitments should be postponed for as long as possible, by taking steps to minimise the impact of changed architectural decisions (this is a ‘real-option‘ approach to architectural change management). Over the long term, this reduces the cost of change.

In addition, project start architectures may need to make explicit where architectural commitments must be made (e.g., for a specific database, PaaS or integration solution, etc) – i.e., areas where change will be expensive.

Other things the project start architecture may wish to capture or at least address (as part of enterprise design) could include:

  • Cataloging data semantics and usage
    • to support data governance and big data initiatives
  • Management of business context & scope (business area, product, entity, processes, etc)
    • to minimize unnecessary redundancy and process duplication
  • Controlled exposure of data and behaviour to other domains
    • to better manage dependencies and relationships with other domains
  • Compliance with enterprise policies around security and data management
    • to address operational risk
  • Automated build, test & deploy processes
    • to ensure continued agility and responsiveness to change, and maximise operational stability
  • Minimal lock-in to specific solution architectures (minimise solution architecture impact on other domains)
    • to minimize vendor lock-in and maximize solution options

The Chess book mentioned above includes a good description of a PSA, in the context of a PRINCE2 project framework. Note that the approach also works for Agile, but the PSA should set the boundaries within which the agile team can operate: if those boundaries must be crossed to deliver a user story, then enterprise design architects should be brought back into the discussion to establish the best way forward.

In summary, every change is an opportunity to make the ‘right’ architectural move to improve complexity management and to maintain an acceptable overall cost of change.

Making good architectural moves