Bending The Serverless Spoon

Do not try and bend the spoon. That’s impossible. Instead, only realize the truth… THERE IS NO SPOON. Then you will see that it not the spoon that bends, it is yourself.” — The Matrix

[tl;dr To change the world around them, organizations should change themselves by adopting serverless + agile as a target. IT organizations should embrace serverless to optimize and automate IT workflows and processes before introducing it for critical business applications.]

“Serverless” is the latest shiny new thing to come on the architectural scene. An excellent (opinionated) analysis on what ‘serverless’ means has been written by Jeremy Daly, a serverless evangelist – the basic conclusion being that ‘serverless’ is ultimately a methodology/culture/mindset.

If we accept that as a reasonable definition, how does this influence how we think about solution design and engineering, given that generations of computer engineers have grown up with servers front-and-center of design thinking?

In other words, how do we bend our way of thinking of a problem space to serverless-first, and use that understanding to help make better architectural decisions – especially with respect to virtual machines, containers, and orchestration, and distributed systems in general?

Worked Example

To provide some insight into the practicalities of building and running a serverless application, I used a worked example, “Building a Serverless App Using Athena and AWS Lambda” by Epsagon, a serverless monitoring specialist. This uses the open-source Serverless framework to simplify the creation of serverless infrastructure on a given cloud provider. This example uses AWS.

Note to those attempting to follow this exercise: not all the required code was provided in the version I used, so the tutorial does require some (javascript) coding skills to fill the gaps. The code that worked for me (with copious logging..) can be found here.

This worked example focuses on two reference-data oriented architectural patterns:

  • The transactional creation via a RESTful API of a uniquely identifiable ‘product’ with an ad-hoc set of attributes, including but not limited to ‘ProductId’, ‘Name’ and ‘Color’.
  • The ability to query all ‘products’ which share specific attributes – in this case, a shared name.

In addition, the ability to create/initialize shared state (in the form of a virtual database table) is also handled.

Problem-domain Non-Functional Characteristics

Conceptually, the architecture has the following elements:

  • Public, anonymous RESTful APIs for product creation and query
    • APIs could be defined in OpenAPI 3.0, but by default are created
  • Durable storage of product data information
    • Variable storage cost structure based on access frequency can be added through configuration
    • Long-term archiving/backup obligations can be met without using any other services.
  • Very low data management overhead
  • Highly resilient and available infrastructure
    • Additional multi-regional resilience can be added via Application Load Balancer and deploying Lambda functions to multiple regions
    • S3 and Athena are globally resilient
  • Scalable architecture
    • No fixed constraint on number of records that can be stored
    • No fixed constraint on number of concurrent users using the APIs (configurable)
    • No fixed constraint on the number of concurrent users querying Athena (configurable)
  • No servers to maintain (no networks, servers, operating systems, software, etc)
  • Costs based on utilization
    • If nobody is updating or querying the database, then no infrastructure is being used and no charges (beyond storage) are incurred
  • Secure through AWS IAM permissioning and S3 encryption.
    • Many more security authentication, authorization and encryption options available via API Gateway, AWS Lambda, AWS S3, and AWS Athena.
  • Comprehensive log monitoring via CloudWatch, with ability to add alerts, etc.

For a couple of days coding, that’s a lot of non-functional goodness..and overall the development experience was pretty good (albeit not CI/CD optimized..I used Microsoft’s Code IDE on a MacBook, and the locally installed serverless framework to deploy.) Of course, I needed to be online and connect to AWS, but this seemed like a minor issue (for this small app). I did not attempt to deploy any serverless mock services locally.

So, even for a slightly contrived use case like the above, there are clear benefits to using serverless.

Why bend the spoon?

There are a number of factors that typically need to be taken into consideration when designing solutions which tend to drive architectures away from ‘serverless’ towards ‘serverful’. Typically, these revolve around resource management ( i.e., network, compute, storage ) and state management (i.e., transactional state changes).

The fundamental issue that application architects need to deal with in any solution architecture is the ‘impedance mismatch’ between general purpose storage services, and applications. Applications and application developers fundamentally want to treat all their data objects as if they are always available, in-memory (i.e., fast to access) and globally consistent, forcing engineers to optimize infrastructure to meet that need. This generally precludes using general-purpose or managed services, and results in infrastructure being tightly coupled with specific application architectures.

The simple fact is that a traditional well-written, modular 3-tier (GUI, business logic, data store) monolithic architecture will always outperform a distributed system – for the set of users and use-cases it is designed for. But these architectures are (arguably) increasingly rare in enterprises for a number of reasons, including:

  • Business processes are increasing in complexity (aka features), consisting of multiple independently evolving enterprise functions that must also be highly digitally cohesive with each other.
  • More and more business functions are being provided by third-parties that need close (digital) integration with enterprise processes and systems, but are otherwise managed independently.
  • There are many, disparate consumers of (digital) process data outputs – in some cases enabling entirely new business lines or customer services.
  • (Digital) GUI users extend well outside the corporate network, to mobile devices as well as home networks, third-party provider networks, etc.

All of the above conspire to drive even the most well-architected monolithic application to the a ‘ball-of-mud‘ architecture.

Underpinning all of this is the real motivation behind modern (cloud-native) infrastructure: in a digital age, infrastructure needs to be capable of being ‘internet scale’ – supporting all 4.3+ billion humans and growing.

Such scale demands serverless thinking. However, businesses that do not aspire to internet-scale usage still have key concerns:

  • Ability to cope with sudden demand spikes in b2c services (e.g., due to marketing campaigns, etc), and increased or highly variable utilisation of b2b services (e.g., due to b2b customers going digital themselves)
  • Provide secure and robust services to their customers when they need it, that is resilient to risks
  • Ability to continuously innovate on products and services to retain customers and remain competitive
  • Comply with all regulatory obligations without impeding ability to change, including data privacy and protection
  • Ability to reorganize how internal capabilities are provisioned and provided with minimal impact to any of the above.

Without serverless thinking, meeting all of these sometimes conflicting needs, becomes very complex, and will consume ever more enterprise IT engineering capacity.

Note: for firms to really understand where serverless should fit in their overall investment strategy, Wardley Maps are a very useful strategic planning tool.

Bending the Spoon

Bending the spoon means rethinking how we architect systems. It fundamentally means closing the gap between models and implementation, and recognizing that where an architecture is deficient, the instinctive reaction to fix or change what you control needs to be overcome: i.e., drive the change to the team (or service provider) where the issue properly belongs. This requires out-of-the-box thinking – and perhaps is a decision that should not be taken by individual teams on their own unless they really understand their service boundaries.

This approach may require teams to scale back new features, or modify roadmaps, to accommodate what can currently be appropriately delivered by the team, and accepting what cannot.

Most firms fail at this – because typically senior management focus on the top-line output and not on the coherence of the value-chain enabling it. But this is what ‘being digital’ is all about.

Everyone wants to be serverless

The reality is, the goal of all infrastructure teams is to avoid developers having to worry about their infrastructure. So while technologies like Docker initially aimed to democratize deployment, infrastructure engineering teams are working to ensure developers never need to know how to build or manage a docker image, configure a virtual machine, manage a network or storage device, etc, etc. This even extends to hiding the specifics of IaaS services exposed by cloud providers.

Organizations that are evaluating Kubernetes, OpenFaaS, or Knative , or which use services such as AWS Fargate, AWS ECS, Azure Container Service, etc, ultimately want to ensure to minimize the knowledge developers need to have of the infrastructure they are working on.

Unfortunately for infrastructure teams, most developers still develop applications using the ‘serverful’ model – i.e., they want to know what containers are running where, how they are configured, how they interact, how they are discovered, etc. Developers also want to run containers on their own laptop whenever they can, and deploy applications to authorized environments whenever they need to.

Developers also build applications which require complex configuration which is often hand-constructed between and across environments, as performance or behavioural issues are identified and ‘patched’ (i.e., worked around instead of directing the problem to the ‘right’ team/codebase).

At the same time, developers do not want anything to do with servers…containers are as close as they want to get to infrastructure, but containers are just an abstraction of servers – they are most definitely not ‘serverless’.

To be Serverless, Be Agile

Serverless solutions are still in the early stages of maturity. For many problems that require a low-cost, resilient and always-available solution, but are not particularly performance sensitive (i.e., are naturally asynchronous and eventually consistent), then serverless solutions are ideal.

In particular, IT (and the proverbial shoes for the cobblers children) processes would benefit significantly from extensive use of serverless, as the management overhead of serverless solutions will be significantly less than other solutions. Integrating bespoke serverless solutions with workflows managed by tools like ServiceNow could be a significant game changer for existing IT organizations.

However, mainstream use of serverless technologies and solutions for business critical enterprise applications is still some way away – but if IT departments develop skills in it, it won’t be long before it finds its way into critical business solutions.

For broader use of serverless, firms need to be truly agile. Work for teams needs to come equally from other dependent teams as from top-down sources. Teams themselves need to be smaller (and ‘senior’ staff need to rethink their roles), and also be prepared to split or plateau. And feature roadmaps need to be as driven by capabilities as imagined needs.

Conclusion

Organizations already know they need to be ‘agile’. To truly change the world (bend the spoon), serverless and agile together will enable firms to change themselves, and so shape the world around them.

Unfortunately, for many organizations it is still easier to try to bend the spoon..for those who understand they need to change, adopting the ‘serverless’ mindset is key to success, even if – at least initially – true serverless solutions remain a challenge to realize in organizations dealing with legacy (serverful) architectures.

Bending The Serverless Spoon

The changing role of data lakes

[tl;dr A single data lake, data warehouse or data pipeline to “rule them all” is less useful in hybrid cloud environments, where it can be feasible to query ‘serverless’ cloud-native data sources directly rather than rely on traditional orchestrated batch extracts. Pipeline complexity can be reduced by open extensions to SQL such as the recently announced AWS PartiQL language. Opportunities exist to integrate enterprise human-oriented data governance and meta-data platforms with data pipelines using serverless technologies.]

The need for Data Lakes

A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale. The data lake concept was created to address a number of issues with traditional data analytics and reporting solutions, specifically:

  • the growing number of applications across an enterprise depending on a given dataset;
  • business and regulatory drivers for governing dataset discovery, quality, creation and/or consumption;
  • the increasing difficulty of IT teams to respond in a timely manner to growing business demand for access to high quality datasets.

The data lake allows data to be made available from its source without making any assumptions about its use. This is particularly critical when the data originates from batch extracts of load-sensitive OLTP databases, most of which are still operating on-premise. Streaming data pipelines, while growing in popularity, are not as common as batch-driven pipelines – although this should change over time as more digital platform architectures become more event-driven in nature.

Data lakes are a key component in data pipelines, a construct (or set of constructs) that provides consolidation of data from multiple sources and makes it available for use. A data pipeline can be orchestrated (via a scheduler) or choreographed (responding to events) – the more jobs a pipeline has to do, the more complex the orchestration or choreography, which has implications for supportability. So reducing the number of jobs a pipeline has to support is key to managing data pipeline complexity.

The Components of a Data Lake

A data lake consists of a few key components:

FeatureDescriptionVirtualTraditional
A storage repositoryDurable, resilient storage of data objects.NoYes
An ingestion mechanismA means to upload content to the repository (no transformation)NoYes
A tagging & metadata mechanismA means to associate metadata with data objects, including user-defined tags.YesYes
A metadata search mechanismA means to search objects in the data lake based on metadata and tags (not content)YesYes
A query engineA means to search the content of objects in the data lakeYesPartially
An access control mechanismA means to ensure that users can only access datasets and parts of data sets that they are entitled to see, and to audit all activity.YesYes

In effect, data lakes have become a kind of data warehouse – the main significant difference being that input sources into data lakes tend to be familiar files – CSVs, Avro, JSON, etc. from multiple sources rather than highly optimized domain-specific schemas – i.e., no assumptions are made about how (or why) the data in the data lake will be consumed. Data lakes also do not concern themselves with scheduling or orchestration.

Datawarehouses, datawarehouses everywhere…

For mature data use cases (i.e., situations where relatively stable, well-known data requirements exist), and where consistent high performance is material to meeting customer needs, data warehouses are still the best solution. A data warehouse stores and manages all of its data locally, and only relies on the data lake as an initial ingestion point.

A data warehouse will transform datasets to the form needed for the specific use cases it supports, and will optimize performance for the consumption of those datasets. Modern data warehouses will use ML/AI techniques to optimize performance rather than relying on human database specialists. But, as this approach is compute intensive, such solutions are more amenable to cloud environments than on-premise environments. Snowflake is an example of this model. As more traditional data warehouses (e.g., Oracle Exadata) move to the cloud, we can expect these to also get ‘smarter’ – however, data gravity will mean such solutions will need to be fundamentally multi-cloud compatible.

For on-premise data warehouses, the tendency is for business lines or functions to create ‘one data warehouse to rule them all’ – mainly because of the traditionally significant storage and compute infrastructure and resources necessary to support data warehouses. Consequently considerable effort is spent on defining and maintaining high performance, appropriately normalized, enterprise data models that can be used in as many enterprise use cases as possible.

In a hybrid/cloud world, multiple data warehouses become more feasible – and in fact, will be inevitable in larger organizations. As more enterprise data becomes available in these dynamically scalable, cloud-based (or HDFS/Hadoop based) data warehouses (such as AWS EMR, AWS Redshift, Snowflake, Google Big Query, Azure SQL Data Warehouse), ‘virtual data warehouses’ avoid the need to move data from its source for query handling, allowing data storage and egress costs to be kept to a minimum, especially if assisted by machine-learning techniques.

Virtual Data Warehouses

Virtual Data warehouse technologies have been around for a while, allowing users to manage and query multiple data sources through a common logical access point. For on-premise solutions, virtual data warehouses have limited use cases, as the cost/effort of scaling out in-house solutions can be prohibitive and not particularly agile in nature, precluding experimental use cases.

On hybrid or cloud environments, virtual data warehouses can leverage the scalability of cloud-native data warehouses, driving queries to the relevant engine for execution, and then leveraging its own scalable infrastructure for executing join queries.

Technologies like Dremio reflect the state of the art in cloud-based data warehouses, which push down queries to the source system where possible, but can process them in-memory directly from a data lake or other source if not.

However, there is one thing that all data warehouses have in common: they leverage SQL and (implicitly) a relational view of the data. Standard ANSI SQL queries are generally supported by all data warehouses, but may mean that some data cannot be queried if it is not in tabular form amenable to SQL processing.

Extending SQL with PartiQL

Enter PartiQL, an open-source project sponsored by Amazon to drive extensions to standard SQL that can cope with non-relational data types, including structured, unstructured, nested, and schemaless (NoSQL, Document).

Historically, all data ingested into a data lake had to be transformed into a format that could be queried by SQL-like commands or processed by typical data warehouse bulk-upload tools. This adds complexity to data pipelines (i.e., more jobs), and may also force premature schema design (i.e., forcing the design of an optimal schema before all critical use cases are fully understood).

PartiQL potentially allows tools such as Snowflake, Dremio (as well as the tools AWS uses internally) to query data using SQL-like syntax, but to also include non-relational data in those queries so they can avoid those separate transformation steps, aiding pipeline complexity reduction.

PartiQL claims to be fully ANSI-compliant, but extended in specific ways to support alternate data formats. While not an official ISO/ANSI standard, it may have the ability to become a de-facto standard – especially as the language has already been used in anger with success within AWS. This will provide a skill path for relational data warehouse experts to become proficient in leveraging modern data pipelines without committing to one specific vendor’s technology.

Technologies like PartiQL will make it much easier to include event-sourced streams into a data pipeline, as events are defined as nested or other non-relational structures. As more data pipelines become event driven rather than batch-driven, having a standard like PartiQL will be key. (It will be interesting to see if Confluent’s KSQL and PartiQL will converge to a single event-stream query standard.)

As PartiQL has only just been released, it’s too soon to tell how the big data ecosystem or ISO/ANSI will respond. Expect more on this topic in the future. For now, virtual data warehouses must rely on their proprietary SQL extensions.

Non-SQL Data Processing

Considerable investment is being made by third party vendors on advanced technology focused on making distributed, scalable processing of SQL (or SQL-like) queries fast and reliable with little or no human tuning required. As such, it is wise to pick a vendor demonstrating a clear strategy in this space, and continuing to invest in SQL as the lingua-franca of transformation logic.

However, for use cases for which SQL is not appropriate, distributed computing platforms like Spark are still needed. The expectation here is that such platforms will ingest data from a data lake, and output results into a data lake. In some cases, the distributed computing platform offers its own storage (e.g., HDFS), but increasingly it is more appropriate to question whether data needs to reside permanently in a HDFS cluster rather than in a data lake. For example, Amazon’s EMR service allows Hadoop clusters to be created ephemerally, and to consume their initial dataset from AWS S3 repositories or other data sources,

Enforcing Enterprise Data Collaboration and Governance

Note that all data warehouse solutions (virtual or not) must support some form of meta-data tagging and management used by their SQL query engines – otherwise they cannot act as a virtual database source (generally an ODBC end-point that applications can connect directly to). This tagging can be automated if sources included meta-data (e.g., field headers, Avro schema definitions, etc) , but can be enhanced by human tagging, which is increasingly augmented by machine-learning to help identify, for example, where data may be sensitive, etc.

But data governance needs extend beyond the needs of the virtual data warehouse query engines, and this is where there are still gaps to be filled in the current enterprise data management tools.

Tools from vendors like Alation, Waterline, Informatica, Collibra etc were created to augment people’s ability to properly tag content in the data-lake with meaningful information to make it discoverable and governable. Consistent tagging in principle allows tag-based governance rules to be defined to automatically enforce data governance policies in data consumers. This data, coupled with schema information which can be derived directly from data-sources, is all the information needed to allow users (or developers) to source the data they need in a secure, compliant way.

But meta-data for data governance has humans as the primary user (e.g. CDOs, business/data analysts, process owners, etc) – or, as Alation describes it – meta-data for human collaboration.

Currently, there is no accepted standards for ensuring the consistency of ‘meta-data for human collaboration’ with ‘meta-data for query execution’.

Ideally, the human-oriented tools would generate standard events that tools in the data pipeline could pick up and act on (via, for example, something like AWS EventBridge), thereby avoiding the need for data governance personnel to oversee multiple data pipelines directly…

Summary

With the advent of cloud-based managed compute and data storage services, a multi-data warehouse and pipeline strategy is viable and may even be desirable, potentially involving multiple data lakes.

Solutions like PartiQL have the potential to eliminate many transformation job phases and greatly simplify data pipeline complexity in a standardized way, leveraging existing SQL skills rather than requiring new skills.

To ensure consistent governance across multiple data pipelines, a serverless event-based approach to connecting human data governance solutions with cloud-native data pipeline solutions may be the way forward – for example, using AWS EventBridge to action events originating from SaaS-based data governance services with data pipelines.

The changing role of data lakes

Why AWS EventBridge changes everything..

“Events, dear boy, events”

Harold McMillan

[tl;dr AWS EventBridge may encourage SaaS businesses to formally define and manage public event models that other businesses can design into their workflows. In turn, this may enable businesses to achieve agility goals by decomposing their organizations into smaller, event-driven “cells” with workflows empowered by multiple SaaS capabilities.]

Last week, Amazon formally announced the launch of the AWS EventBridge service. What makes this announcement so special?

The biggest single technical benefit is the avoidance of the need for webhooks or polling APIs. (See here for a good explanation of the difference.)

Webhooks are generally not considered a scalable solution for SaaS services, as significant engineering is required to make it robust, and consuming applications need to be designed to handle web-hook API calls.

HTTP-based APIs exposed by 3rd party services can be polled by applications that need to know if state has changed, but this polling consumes resources even when nothing changes. Again, this has scalability issues on both the SaaS provider as well as the application consumer.

In both cases, the principle metaphor connecting both the SaaS and consuming application is the ‘service interface’ abstraction – i.e., executing an operation on a resource. As such, this is a technical solution to a technical problem.

From APIs to Events

While this ‘service-based’ model of distributed programming is extremely powerful, it is not an appropriate abstraction for connecting behaviors across multiple services in a value chain. To align with business-level concepts such as Business Process Modelling, event-driven architectures are becoming more and more popular to model complex workflows both within and between organizations.

This trend is accelerated by the desire of organizations to become more “agile”. Increasingly organizations are recognizing this must manifest itself as breaking down the organization into more manageable, semi-autonomous “cells” (see this article from McKinsey as an example). With cells, the event metaphor fits naturally: cells can decide which events they care about, and also decide what events they in turn create that other cells may use.

3rd party service providers (i.e., SaaS companies such as SalesForce, Workday, Office365, ServiceNow, Datadog, etc) empower organizational cells and enable them to achieve far more than a small cell otherwise could. The “cell” concept cannot be fully realized unless every cell has the ability to define and control how it uses these services to achieve its own mission.

In addition, as value/supply chains become more complex, and more (3rd party or internal) providers are embedded in those workflows , the need for a more natural, adaptable way of integrating processes has become evident.

But event-driven architectures require a common ‘bus’ – a target-neutral means to allow zero or more consumers express an interest in receiving events published on the bus. This historically has been impractical to do at scale between organizations (or even within organizations) without requiring all parties to agree on a neutral 3rd party to manage the bus, and at the additional risk of creating a change bottleneck: hence the historic preference for point-to-point HTTP-based standards.

Services like the AWS EventBridge for the first time allow autonomous SaaS solutions to publish a formal event model that can be consumed programmatically and seamlessly included in local (cell-specific) workflows. In addition, this event model can be neutral to the underlying technology and cloud provider.

How it works and what makes it different

The key feature of the EventBridge is the separation of the publisher from the consumer, and the way that business rules to manage the routing and transformation of events is handled.

Once an organization (or AWS account) has registered as a consumer with the publisher (the owner of the “event source”), a logical “event bus” is created to represent all events for that org/account. The consuming org/account can then setup whatever routing and transformation rules it needs for any internal consumers of those events, without any further dependency on the publishing organization. So consumer organizations/accounts have full control over what is published internally to consuming applications.

With appropriate guard-rails in place, individual teams (“cells”) can define and configure their own routing rules, and not rely on any centralized team – a key weakness in many legacy ESB solutions.

Note that the EventBridge has predefined service limits – it has a (reasonably – 400 events/sec) high throughput, but is a high latency service (0.5sec). So low-latency use cases such as electronic trading are not, as this point, an appropriate use case for EventBridge.

The use of EventBridge for internal enterprise event handling should be considered carefully: the 100 event buses per account essentially limits the number of publishers that can be handled by any one account to 100. For most use cases, this should be more than enough, but many large organizations may have many more than 100 ‘publishers’ publishing on their ESB. If each publisher can be viewed as a part of an end-to-end business value-stream, then any value-stream with more than 100 components (i.e., unique event models) is likely to be overly complex. In practice, a ‘publisher’ is likely to be an enterprise application: therefore some significant complexity reduction and consolidation (of event models, if not actual code) would be needed to ensure such organizations can use EventBridge internally effectively.

The AWS Way (also, the Cloud Way)

It’s worth noting that key to Amazon’s success is its ability to “eat its own dogfood“. Every service in Amazon and AWS is built atop other services. No service is allowed to get so big and bloated it cannot be managed effectively. Abstractions are ‘clean’ – rather than add bells and whistles to an existing service, a new service is created which leverages the underlying service or services.

AWS has consistently required every service to have and maintain an API model, which – for asynchronous/autonomous services – leads naturally to an event model. This in turn has made it natural for AWS EventBridge to come out-of-the-box with a number of events already emitted by AWS services that can be leveraged for customer solutions. (For now, many of these events are limited to generic CloudTrail-related events – specifically tracking API calls – but in the future it’s reasonable to expect more service-specific events to be made available.)

AWS does have one key advantage over other major cloud providers such as Google GCP and Microsoft Azure: it set out to build a business (an online marketplace) using these services. So it’s strategy was (and is) driven by its vision for how to build a globally scalable online business – not by the need to provide technology services to businesses. To this extent, it’s hard to see Google and Microsoft being anything other than followers of AWS’s lead.

A Prediction..

Businesses which also follow the Amazon-inspired growth/innovation and organization model will likely have a better chance of succeeding in the digital age. And it is for these businesses that EventBridge will have the most impact – far beyond the technological improvements afforded by the use of events vs webhooks/APIs.

Consequently, as more SaaS companies are on-boarded onto the AWS EventBridge eco-system, we can expect more event models to be published. Tools for managing and evolving event models will evolve and improve so they become more accessible and useful for non-traditional IT folks (i.e., process and workflow designers) – currently the only way right now to see event model definitions seems to be by actually creating business rules.

This increased focus on SaaS integrations may (perhaps) inspire firms to re-organize their internal capabilities along similar lines, as internal service providers, empowering cells across the organization and with a published and accessible software-driven event model – noting that while events may be published and received digitally, they can still be actioned by humans for non-digital processes (e.g., complex pricing decision making, responding to help desk requests, etc).

The roster of SaaS firms signing up to EventBridge over the coming months will hopefully bear out this prediction. A good sense of what services could be onboarded can be had by looking at all the SaaS (and IoT) services integrated by IFTTT.

In the meantime, it is time to explore the re-imagined integration opportunities afforded by AWS EventBridge..

Why AWS EventBridge changes everything..

The cloudy future of data management & governance

[tl;dr The cloud enables novel ways of handling an expected explosion in data store types and instances, allowing stakeholders to know exactly what data is where at all times without human process dependencies.]

Data management & governance is a big and growing concerns for more and more organizations of all sizes. Effective data management is critical for compliance, resilience, and innovation.

Data governance is necessary to know what data you have, when you got it, where it came from, where it is being used, and whether it is of good quality or not.

While the field is relatively mature, the rise of cloud-based services and service-enabled infrastructure will, I believe, fundamentally change the nature of how data is managed in the future and enable greater agility if leveraged effectively.

Data Management Meta-Data

Data and application architects are concerned about ensuring that applications use the most appropriate data storage solution for the problem being solved. To better manage cost and complexity, firms tend to converge on a handful of data management standards (such as Oracle or SQL Server for databases; NFS or NTFS for filesystems; Netezza, Terradata for data warehousing, Hadoop/HDFS for data processing, etc). Expertise is concentrated around central teams that manage provisioning, deployments, and operations for each platform. This introduces dependencies that project teams must plan around. This also requires forward planning and long-term commitment – so not particularly agile.

Keeping up with data storage technology is a challenge – technologies like key/value stores, graph databases, columnar databases, object stores, and document databases exist as these represent varying datasets in a more natural way for applications to consume, reducing or eliminating the ‘impedance mismatch‘ between how applications view state and how that state is stored.

In particular, may datastore technologies are used to scaling up rather than out; i.e., the only way to make them perform faster is to add more CPU/memory, or faster IO hardware. While this keeps applications simpler, it require significant forward planning and longer-term commitments to scale up, and is out of the control of application development teams. Cloud-based services can typically handle scale-out transparently, although applications may need to be aware of the data dimensions across which scale out happens (e.g., sharding by primary key, etc).

Fulfilling provisioning requests for a new datastore on-premise is mostly ticket driven, but fulfillment is still mostly by humans and not by software within enterprises – which means an “infrastructure-as-code” approach is not feasible.

Data Store Manageability vs Application Complexity

Most firms decide that it is better to simplify the data landscape such that fewer datastore solutions are available, but to resource those solutions so that they are properly supported to handle business critical production workloads with maximum efficiency.

The trade-off is in the applications themselves, where the data storage solutions available end up driving the application architecture, rather than the application architecture (i.e., requirements) dictating the most appropriate data store solution, which would result in the lowest impedance mismatch.

A typical example of an impedance mismatch are object-oriented applications (written in, say C++ or Java) which use relational databases. Here, object/relational mapping technologies such as Hibernate or Gigaspaces are used to map the application view of the data (which likes to view data as in-memory objects) to the relational view. These middle layers, while useful for naturally relational data, can be overly expensive to maintain and operate if what your application really needs is a more appropriate type of datastore (e.g., graph).

This mismatch gets exacerbated in a microservices environment where each microservice is responsible for its own persistence, and individual microservices are written in the language most appropriate for the problem domain. Typical imperative, object-oriented languages implementing transactional systems will lean heavily towards relational databases and ORMs, whereas applications dealing with multi-media, graphs, very-large objects, or simple key/value pairs will not benefit from this architecture.

The rise of event-driven architectures (in particular, transactional ‘sagas‘, and ‘aggregates‘ from DDD) will also tend to move architectures away from ‘kitchen-sink’ business object definitions maintained in a single code-base into multiple discrete but overlapping schemas maintained by different code-bases, and triggered by common or related events. This will ultimately lead to an increase in the number of independently managed datastores in an organisation, all of which need management and governance across multiple environments.

For on-premise solutions, the pressure to keep the number of datastore options down, while dealing with an explosion in instances, is going to limit application data architecture choices, increase application complexity (to cope with datastore impedance mismatch), and reduce the benefits from migrating to a microservices architecture (shared datastores favor a monolithic architecture).

Cloud Changes Everything

So how does cloud fundamentally change how we deal with data management and governance? The most obvious benefit cloud brings is around the variety of data storage services available, covering all the typical use cases applications need. Capacity and provisioning is no longer an operational concern, as it is handled by the cloud provider. So data store resource requirements can now be formulated in code (e.g., in CloudFormation, Terraform, etc).

This, in principle, allows applications (microservices) to choose the most appropriate storage solution for their problem domain, and to minimize the need for long-term forward planning.

Using code to specify and provision database services also has another advantage: cloud service providers typically offer the means to tag all instantiated services with your own meta-data. So you can define and implement your own data management tagging standards, and enforce these using tools provided by the cloud provider. These can be particularly useful when integrating with established data discovery tools, which depend on reliable meta-data sources. For example, tags can be defined based on a data ontology defined by the chief data office (see my previous article on CDO).

These mechanisms can be highly automated via service catalogs (such as AWS Service Catalog or ServiceNow), which allow compliant stacks to be provisioned without requiring developers to directly access the cloud providers APIs.

Let a thousand flowers bloom

The obvious downside to letting teams select their storage needs is the likely explosion of data stores – even if they are selected from a managed service catalog. But the expectation is that each distinct store would be relatively simple – at least compared to relational stores which support many application use cases and queries in a single database.

In on-premise situations, data integration is also a real challenge – usually addressed by a myriad of ad-hoc jobs and processes whose purpose is to extract data from one system and send it to another (i.e., ETL). Usually no meta-data exists around these processes, except that afforded by proprietary ETL systems.

In best case integration scenarios, ‘glue’ data flows are implemented in enterprise service buses that generally will have some form of governance attached – but which usually has the undesirable side-effect of introducing yet another dependency for development teams which needs planning and resourcing. Ideally, teams want to be able to use ‘dumb’ pipes for messaging, and be able to self-serve their message governance, such that enterprise data governance tools can still know what data is being published/consumed, and by whom.

Cloud provides two main game-changing capabilities to manage data complexity management at scale. Specifically:

  • All resources that manage data can be tagged with appropriate meta-data – without needing to, for example, examine tables or know anything about the specifics about the data service. This can also extend to messaging services.
  • Serverless functions (e.g., AWS Lambda, Azure Functions, etc) can be used to implement ‘glue’ logic, and can themselves be tagged and managed in an automated way. Serverless functions can also be used to do more intelligent updates of data management meta-data – for example, update a specific repository when a particular service is instantiated, etc. Serverless functions can be viewed as on-demand microservices which may have their own data stores – usually provided via a managed service.

Data, Data Everywhere

By adopting a cloud-enabled microservice architecture, using datastore services provisioned by code, applying event driven architecture, leveraging serverless functions, and engaging with the chief data officer for meta-data standards, it will be possible to have an unprecedented up-to-date view of what data exists in an organization and where. It may even address static views of data in motion (through tagging queue and notification topic resources). The data would be maintained via policies and rules implemented in service catalog templates and lambda functions triggered automatically by cloud configuration changes, so it would always be current and correct.

The CDO, as well as data and enterprise architects, would be the chief consumer of this metadata – either directly or as inputs into other applications, such as data governance tools, etc.

Conclusion

The ultimate goal is to avoid data management and governance processes which rely on reactive human (IT) input to maintain high-quality data management metadata. Reliable metadata can give rise to a whole new range of capabilities for stakeholders across the enterprise, and finally take IT out of the loop for business-as-usual data management queries, freeing up valuable resources for building even more data-driven applications.

The cloudy future of data management & governance

The future of modularity is..serverless

[tl;dr As platform solutions evolve and improve, the pressure for firms to reduce costs, increase agility and be resilient to failure will drive teams to adopt modern infrastructure platform solutions, and in the process decompose and simplify monoliths, adopt microservices and ultimately pave the way to building naturally modular systems on serverless platforms.]

“Modularity” – the (de)composition of complex systems into independently composable or replaceable components without sacrificing performance, security or usability – is an architectural holy grail.

Businesses may be modular (commonly expressed through capability maps), and IT systems can be modular. IT modularity can also be described as SOA (Service Oriented Architecture), although because of many aborted attempts at (commercializing) SOA in the past, the term is no longer in fashion. Ideally, the relationship between business ‘modules’ and IT application modules should be fully aligned (assuming the business itself has a coherent underlying business architecture).

Microservices are the latest manifestation of SOA, but this is born from a fundamentally different way of thinking about how applications are developed, tested, deployed and operated – without the need for proprietary vendor software.

Serverless takes takes the microservices concept one step further, by removing the need for developers (or, indeed, operators) to worry about looking after individual servers – whether virtual or physical.

A brief history of microservices

Commercial manifestations of microservices have been around for quite a while – for example Spring Boot, or OSGi for Java – but these have very commercial roots, and implement a framework tied to a particular language. Firms may successfully implement these technologies, but they will need to have already gone through much of the microservices stone soup journey. It is not possible to ‘buy’ a microservices culture from a technology vendor.

Because microservices are intended to be independently testable and deployable components, a microservices architecture inherently rejects the notion of a common framework for implementing/supporting the microservices natures of an application. This therefore puts the onus on the infrastructure platform to provide all the capabilities needed to build and run microservices.

So, capabilities like naming, discovery, orchestration, encryption, load balancing, retries, tracing, logging, monitoring, etc which used to be handled by language-specific frameworks are now increasingly the province of the ‘platform’. This greatly reduces the need for complex, hard-to-learn frameworks, but places a lot of responsibility on the platform, which must handle these requirements in a language-neutral way.

Currently, the most popular ‘platforms’ are the major cloud providers (Azure, Google, AWS, Digital Ocean, etc), IaaS vendors (e.g., VMWare, HPE), core platform building blocks such as Kubernetes, and platform solutions such as Pivotal Cloud Foundry,  Open Shift and Mesophere. (IBM’s BlueMix/Cloud is likely to be superseded by Red Hat’s Open Shift.)

The latter solutions previously had their own underlying platform solutions (e.g., OSGi for BlueMix, Bosh for PKS), but most platform vendors have now shifted to use Kubernetes under the hood. These solutions are intended to work in multiple cloud environments or on-premise, and therefore in principle allow developers to avoid caring about whether their applications are deployed on-premise or on-cloud in an IaaS-neutral way.

Decomposing Monolithic Architectures

With the capabilities these platforms offer, developers will be incentivized to decompose their applications into logical, distributed functional components, because the marginal additional cost of maintaining/monitoring each new process is relatively low (albeit definitely not zero). This approach is naturally amenable to supporting event driven architectures, as well as more conventional RESTful and RPC architectures (such as gRPC), as running processes can be mapped naturally to APIs, services and messages.

But not all processes need to be running constantly – and indeed, many processes are ‘out-of-band’ processes, which serve as ‘glue’ to connect events that happen in one system to another system: if events are relatively infrequent (e.g., less than one every few seconds), then no resources need to be used in-between events. So provisioning long-running docker containers etc may be overkill for many of these processes – especially if the ‘state’ required by those processes can be made available in a low-latency, highly available long-running infrastructure service such as a high-performance database or cache.

Functions on Demand

Enter ‘serverless’, which aims to specify the resources required to execute a single piece of code (basically a functional monolith) on-demand in a single package – roughly the equivalent of, for example, a declarative service in OSGi. The runtime in which the code runs is not the concern of the developer in a serverless architecture. There are no VMs, containers or side-cars – only functions communicating via APIs and events.

Currently, the serverless offerings by the major cloud providers are really only intended for ‘significant’ functions which justify the separate allocation of compute, storage and network resources needed to run them. A popular use case are ‘transformational’ functions which convert binary data from one form to another – e.g., create a thumbnail image from a full image – which may temporarily require a lot of CPU or memory. In contrast, an OSGi Declarative Service, for example, could be instantiated by the runtime inside the same process/memory space as the calling service – a handy technique for validating a modular architecture without worrying about the increased failure modes of a distributed system, while allowing the system to be readily scaled out in the future.

Modular Architectures vs Distributed Architectures

Serverless functions can be viewed as ‘modules’ by another name – albeit modules that happen to require separately allocated memory, compute and storage to the calling component. While this is a natural fit for browser-based applications, it is not a great fit for monolithic applications that would benefit from modular architectures, but not necessarily benefit from distributed architectures. For legacy applications, the key architectural question is whether it is necessary or appropriate to modularize the application prior to distributing the application or migrating it to an orchestration platform such as Kubernetes, AWS ECS, etc.

As things currently stand, the most appropriate (lowest risk) migration route for complex monolithic applications is likely to be a migration of some form to one of the orchestrated platforms identified above. By allowing the platform to take care of ‘non-functional’ features (such as naming, resilience, etc), perhaps the monolith can be simplified. Over time, the monolith can then be decomposed into modular ‘microservices’ aligned by APIs or events, and perhaps eventually some functionality could decompose into true serverless functions.

Serverless and Process Ownership

Concurrently with decomposing the monolith, a (significant) subset of features – mainly those not built directly using the application code-base, or which straddle two applications – may be meaningfully moved to serverless solutions without depending on the functional decomposition of the monolith.

It’s interesting to note that such an architectural move may allow process owners to own these serverless functions, rather than relying on application owners, where often, in large enterprises, it isn’t even clear which application owner should own a piece of ‘glue’ code, or be accountable when such code breaks due to a change in a dependent system.

In particular, existing ‘glue’ code which relies on centralized enterprise service buses or equivalent would benefit massively from being migrated to a serverless architecture. This not only empowers teams that look after the processes the glue code supports, but also ensures optimal infrastructure resource allocation, as ESBs can often be heavy consumers of infrastructure resources. (Note that a centralized messaging system may still be needed, but this would be a ‘dumb pipe’, and should itself be offered as a service.)

Serverless First Architecture

Ultimately, nirvana for most application developers and businesses, is a ‘serverless-first’ architecture, where delivery velocity is only limited by the capabilities of the development team, and solutions scale both in function and in usage seamlessly without significant re-engineering. It is fair to say that serverless is a long way from achieving this nirvana (technologies like ‘AIOps‘ have a long way to go), and most teams still have to shift from monolithic to modular and distributed thinking, while still knowing when a monolith is still the most appropriate solution for a given problem.

As platform solutions improve and mature, however, and the pressure mounts on businesses whose value proposition is not in the platform engineering space to reduce costs, increase agility and be increasingly resilient to failures of all kinds, the path from monolith to orchestrated microservices to serverless (and perhaps ‘low-code’) applications seems inevitable.

The future of modularity is..serverless

Message Evolution in High Performance Messaging Environments

tl;dr

Moving to an event-driven architecture in a high-performance environment has specific needs that do not yet have widely standardized solutions, and as such require a high degree of focus on both software engineering and business architecture.

Context

Event- or message-driven applications exist in at least two contexts – an application-specific context, and a domain or enterprise context. For high-performance applications, latency is typically more sensitive within the application context, and less sensitive in the domain/enterprise context.

For the purposes of this article, the application-specific context is assumed to related to components that are typically deployed together when a new feature is released – i.e., there is high coupling and high cohesion between components. All application context components are generally tested and deployed together as a unit.

Application Context

The high-coupling and high cohesion of the application context is usually a compromise to achieve the low latency performance requirements, as normal microservice architecture best practice states that services are independently deployable and hence loosely coupled. This impacts the overall agility of the architecture, but fundamentally, with automated configuration, testing and deployment, it should be manageable by a ‘pizza-sized’ team without losing the integrity of the platform or the performance of the network communications.

A given application can reasonably require a single/common version of a highly optimized serialization library, and related message schema definitions, to be used by all components at any given time, as enforcing such deployments can be guaranteed through deployment/configuration automation processes.

In general, these deployments do not require meta-data to be included in messages, as the meta-data will be explicit in the application code. High-performance binary-encoded protocols like Protocol Buffers and Avro can handle a certain amount of schema evolution, but in general for highly coupled applications, all components should use the same version of serialized objects and be deployed simultaneously.

Application messaging contexts can get complicated, and may include both low-latency and normal-latency scenarios, with different degrees of cohesion and coupling. The key point is that messaging and deployment decisions lie within one ‘pizza-size’ team and are not subject to wider enterprise governance. Eventually, when a single pizza-size team becomes insufficient for the application complexity, technical debt relating to these decisions will be revealed when teams split, and will eventually need to be addressed, as governance, agility and latency needs may all be in conflict.

Domain/Enterprise Context

The domain context may include many pizza-sized application teams. For maximum agility, these disparate applications should have relatively low (functional) cohesion and low coupling, but each may consist of a number of high cohesion microservices – the level of coupling depending on the extent to which message codecs are optimised for network performance or agility.

To maximize decoupling and ensure maximum independence of testing, deployment, configuration and operation, the codecs used in the enterprise context should be as flexible as possible.

A consequence of this approach is that, to ensure decoupling, there will in many cases need to be some translation between events in the application context to events in the domain/enterprise context. Typically this can be done by a separate event-driven component which can do the necessary translations, consuming from one messaging channel and publishing to another. The additional overhead of this should be weighed against the agility cost of trying to maintain application schema consistency at enterprise scale, which can initially be considerable as teams begin to adopt event-driven architecture. (The legacy of Enterprise Service Buses, and the messaging bottlenecks they often cause, show how extreme this cost can be.)

Message Standards & Governance

In general, there is a trade-off between agility and data architecture compliance at the domain/enterprise level. In order to avoid the insertion of another technology team between producers and consumers, it is generally best to follow the microservice best practice of ‘dumb pipes and smart end-points’ – i.e., any compliance with standards is not enforced by the messaging infrastructure but instead at the application (or ‘project’) level.

It is feasible to develop run-time tools to assess the data architecture compliance of messages over the bus – in many circumstances this may offer the best balance between compliance and agility, especially if they run in lower environments prior to production deployment.

Enterprise Message Characteristics

Messages in the enterprise domain have some specific characteristics:

  • No assumptions should be made about the consuming applications, in terms of the languages, libraries or frameworks used, except to the extent that the serialization mechanism is supported.
  • Messages should be readily readable by authorized human readers irrespective of schema versions (in particular, operations & support staff).
  • Messages (or parts of messages) should not be readable by unauthorized readers (human or computer)

When it comes to choosing serialization technology, it is all about compromises. There is no silver bullet. There are trade-offs on performance vs flexibility vs readability vs precision etc. An excellent read on this topic is Martin Kleppman’s book ‘Design Data Intensive Applications‘.

Schema Evolution using a Message Bus

Schema evolution is one of the biggest challenges any technology team has to deal with. Databases, events/messages and RESTful APIs all require schemas to be managed.

Microservices aims to minimize the complexity of managing database schema evolution through ensuring any applications that depend on a particular dataset access that data only through a microservice. In effect, the database has only one reader/writer, and so schema evolution can be tied to deployments of that one application – much easier to manage.

However, this pushes inter-application schema evolution to the messaging or API layer. For breaking schema changes (i.e., a new version of a schema is incompatible with a prior version), two principle approaches can be considered:

  • A single ‘channel’ (topic or URI) handles all schema versions, and each consumer must be able to handle all schema versions received over that channel
  • Each schema version has its own message channel (topic or URL), and new consumers are created specifically to consume messages from that channel.

Note that the approaches above only discuss the technical aspects of decoding incompatible message versions: it does not address semantic changes, which can only be resolved at the application level.

Single Channel

In this case, the consumer must have multiple versions of the deserializer ‘built-in’, so it can interpret the version header and invoke the correct deserializer. For many languages, such as Java, this is difficult, as it requires supporting multiple versions of the same classes in the same process.

It is, in principle, doable using OSGi, but otherwise, the consuming application may be forced to delegate incoming messages to other processes for deserializing, which could be expensive. Alternatively, the IDL parser for the serializer could generate unique encoding/decoding classes for each version of a schema so they could reside in the same process. However, message meta-data indicating the correct schema version would need to be very reliable to ensure this works well.

Multiple Channel

In this case, each new (breaking) schema definition will have its own channel (topic/URI), such that new processes specifically built to consume those schema messages can be deployed that subscribe to that channel.

This avoids the need for delegating deserialization, and may be easier to debug when issues occur. However, it can add additional complexity to channel/topic namespaces, and mechanisms may need to be in place to ensure all expected consumers were running and that there are no accidental ‘orphan’ messages being published (i.e., messages for which there is no consumer active).

Architectural Implications of EDA

Fundamentally, to handle an enterprise-wide event-driven architecture, organizations must be fully committed to implementing a microservices architecture. At the simplest level, this means that the cost and overhead of deploying operating new application components is very low. This means that orchestration, configuration, logging, monitoring and control mechanisms are all standardized across all deployed components, so that there is no operational resistance to deploying separate processes for each message type and/or channel as needed, or to deploy various adaptors/gateways to cope with potentially multiple incompatible enterprise consumers.

Implementing any form of an EDA architecture without addressing the above will likely not substantially improve business agility or lead-time reduction. Instead, it could lead to increased co-dependencies across components, reducing overall system availability and stability, and requiring coordinated integration testing and deployments on a periodic basis (every 2-3 months, for example).

Conclusion

The points above are based on observing the architectural evolution of systems I have directly been involved with, the challenges teams faced by multiple teams moving to an event driven architecture, and the lessons learned from this process.

While a number of issues are squarely in the technical domain, some of the hardest decisions relate to what should be considered in the ‘application’ domain vs what belongs in the ‘enterprise’ domain. Usually, there will be strong business drivers behind merging applications previously in separate domains into a single domain – and this will have implications on team size, message standards, etc. Fundamentally, however, IT should not attempt to draw technical message or data boundaries around applications that are not directly aligned with business architecture goals.

In essence, if business architects and/or product owners are not directly involved in dictating messaging standards (including semantic definitions of fields) across applications, then Conways Law applies: messaging standards remain local to the teams that use them, with many message flows existing as bi-lateral agreements between applications.

This naturally gives rise to a ‘spaghetti’ architecture, but if this reflects how business processes are actually aligned and communicate, and the business is happy with this, then all IT can do is manage it, not eliminate it.

Message Evolution in High Performance Messaging Environments

Becoming a financial ‘super-power’ through emerging technologies

Recently, the Tabb Forum published an article (login needed) proposing 4 key emerging technology strategies that would enable market participants to keep pace with a trading environment that is constantly changing.

Specifically, the emerging technologies are:

  • AI for Risk Management
  • Increased focus on data and analytics
  • Accepting public cloud
  • Being open to industry standardization

It is worth noting that in the author’s view, these emerging technologies are seen as table stakes, rather than differentiators – i.e., that the market will pivot around growing use of these technologies, rather than individual firms having (temporary) advantage by using them. In the long term, this seems plausible, but does not (IMO) preclude opportunistic use cases in the short/medium term, and firms which can effectively use these technologies will basically acquire banking ‘super-powers’.

AI for Risk Management

The key idea here is that AI could be used to augment current risk management processes, and offer new insights into where market participants may be carrying risk, or provide new insights into who to trade with and what to trade.

Current risk management processes are brute force, involving complex calculations with multiple inputs, and with many outputs for different scenarios. In addition, human judgement is needed to apply various adjustments (known as XVA) to model-computed valuations to account for trade-specific context.

For AI to be used effectively for risk management, certain key technical capabilities need to be in place – specifically:

  • Data lineage, semantics and quality management
  • Feedback loops between pre-trade, trade and post-trade analytics

Many financial firms are already addressing data lineage, semantics and quality management through meeting compliance with regulations such as BCBS239. However, these capabilities need to be infused into a firm’s architecture (processes as well as technology) for it to be useful for AI use cases. Currently, the tools available are in generally not highly integrated with each other or with systems that depend on them, and human processes around these tools are still maturing.

With respect to developing machine learning models, an AI system needs to understand what outcomes happened in the past in order to make predictions or suggestions. For most traders today, such knowledge is encapsulated in complex spreadsheets that are amended by them over time as new insights are discovered by the trader. These spreadsheets are often proprietary to the trader, and over time become increasingly unmaintainable as calculations become interdependent on each other, and changing one calculation has a high chance of breaking another. This impedes a traders ability to keep their models aligned with their understanding of the markets.

Clearly another approach is needed. A key challenge is how to augment a trader’s capabilities with AI-enabled tooling, without at the same time suggesting the trader is himself/herself surplus to requirements. (This is a challenge all AI solutions face.)

One approach would require that at some level AI algorithms are biased towards individual traders decision-making and learning processes, and tying the use of such algorithms to the continued employment of the trader.

Brute-force AI learning based on all data passing through pre-trade (including market data), trade and post-trade systems is possible, but the skills in selecting critical data points for different contexts are at least as valuable as basic trading skills, and the infrastructure cost of doing this is still considerable.

Increased Focus on Data and Analytics

The key point being made by the author is that managing data should be a key strategic function, rather than being left to individual areas to manage on a per-application basis.

Again, this is tied to efforts relating to data lineage, semantics and quality. Efforts in this space can be directed to specific areas (such as risk management), but every function has growing needs for analytics that traditional warehouse and analytics based solutions cannot keep pace with – especially if, as is increasingly the case, every functional domain wishes to have their own agenda to introduce AI/machine-learning into their processes to improve customer experience and/or regulatory compliance.

As Risk Management increasingly consumes more data points from across the firm to refine risk predictions, it is not unreasonable for the Risk Management function to take leadership of the requirements and/or technology for data management for a firm’s broader needs. However, significant investment is required to make data management tools and infrastructure available as an easy-to-use service across multiple domains, which is essentially what is required. Hence, the third key emerging technology..

Accept the public cloud

The traditional way of developing technology has been to

  1. identify the requirements
  2. propose an architecture
  3. procure and provision the infrastructure
  4. develop and/or procure the software
  5. test and deploy
  6. iterate

Each of these processes take a considerable amount of time in traditional organizations. Anything learned after deploying the software and getting user feedback has to necessarily be addressed with the previously defined architecture and infrastructure – often incurring technical debt in the process.

Particularly for data and analytics use cases, this process is unsustainable. Rather than adapting applications to the infrastructure (as this approach requires), the infrastructure should be adaptable to the application, and if it is proven to be inappropriate, it must be disposed of without any concern re ‘sunk cost’ or ‘amortization’ etc.

The public cloud is the only way to viably evolve complex data and analytics architectures, to ensure infrastructure is aligned with application needs, and minimize technical debt through constantly reviewing and aligning the infrastructure with application needs.

The discovery of which data management and analytics ‘services’ need to be built and made available to users across a firm is, today, a process of learning and iteration (in the true ‘agile’ sense). Traditional solutions preclude such agility, but embracing public cloud enables it.

One point not raised in the article is around the impact of ‘serverless’ technologies. This could be a game-changer for some use cases. While serverless in general could be taken to represent the virtualization of infrastructure, serverless specifically addresses solutions where development teams do not need to manage *any* infrastructure (virtual or otherwise) – i.e., they just consume a service, and costs are directly related to usage, such that zero usage = zero costs.

‘Serverless’ is not necessarily restricted to public cloud providers – a firm’s internal IT could provide this capability. As standards mature in this space, firms should start to think about how client needs could be better met through adoption of serverless technologies, which would require a substantial rethink of the traditional software development lifecycle.

Be open-minded about industry standardization

Conversation around industry standards today is driven by blockchain/distributed ledger technology and/or the efficient processing of digital assets. Industry standardization efforts have always been a feature in capital markets, mostly driven by the need to integrate with clearing houses and exchanges – i.e., centralized services. Firms generally view their data and processing models as proprietary, and fundamentally are resistant to commoditization of these. After all, if all firms ran the same software, where would their differentiation be?

The ISDA Common Domain Model (CDM) seems to be quite different, in that it is not specifically driven by the need to integrate with a 3rd party, but rather with the need for every party to integrate with every other party – e.g, in an over-the-counter model.

Historically, the easiest way to regulate a market is to introduce a clearing house or regulated intermediary that ensure transparency to the regulators of all market activity. However, this is undesirable for products negotiated directly between trading parties or for many digital products, so some way is needed to square this circle. Irrespective of the underlying technology, only a common data model can permit both market participants and regulators to have a shared view of transactions. Blockchain/DLT may well be an enabling technology for digitally cleared assets, but other crypto-based solutions which ensure only the participants and the regulators have access to commonly defined critical data may arise.

Initially, integrations and adaptors for the CDM will need to be built with existing applications, but eventually most systems will need to natively adopt the CDM. This should eventually give rise to ‘derivatives-as-a-service’ platforms, with firms differentiating based on product innovations, and potentially major participants offering their platforms to other firms as a service.

Conclusion

Firms which can thread all four of these themes together to provide a common platform will indeed, in my view, have a major advantage over firms which do not. It is evident that with judicious use of public cloud, use of strong common data models to drive platform evolution, use of shared services across business functions for data management and analytics, and use of AI/machine learning for augmenting users work in different domains (starting with trading), all combine to yield extraordinary benefits. The big question is how quickly this will happen, and how to effectively balance investment in existing application portfolios vs new initiatives that can properly leverage these technologies, even as they continue to evolve.

Becoming a financial ‘super-power’ through emerging technologies