The changing role of data lakes

[tl;dr A single data lake, data warehouse or data pipeline to “rule them all” is less useful in hybrid cloud environments, where it can be feasible to query ‘serverless’ cloud-native data sources directly rather than rely on traditional orchestrated batch extracts. Pipeline complexity can be reduced by open extensions to SQL such as the recently announced AWS PartiQL language. Opportunities exist to integrate enterprise human-oriented data governance and meta-data platforms with data pipelines using serverless technologies.]

The need for Data Lakes

A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale. The data lake concept was created to address a number of issues with traditional data analytics and reporting solutions, specifically:

  • the growing number of applications across an enterprise depending on a given dataset;
  • business and regulatory drivers for governing dataset discovery, quality, creation and/or consumption;
  • the increasing difficulty of IT teams to respond in a timely manner to growing business demand for access to high quality datasets.

The data lake allows data to be made available from its source without making any assumptions about its use. This is particularly critical when the data originates from batch extracts of load-sensitive OLTP databases, most of which are still operating on-premise. Streaming data pipelines, while growing in popularity, are not as common as batch-driven pipelines – although this should change over time as more digital platform architectures become more event-driven in nature.

Data lakes are a key component in data pipelines, a construct (or set of constructs) that provides consolidation of data from multiple sources and makes it available for use. A data pipeline can be orchestrated (via a scheduler) or choreographed (responding to events) – the more jobs a pipeline has to do, the more complex the orchestration or choreography, which has implications for supportability. So reducing the number of jobs a pipeline has to support is key to managing data pipeline complexity.

The Components of a Data Lake

A data lake consists of a few key components:

FeatureDescriptionVirtualTraditional
A storage repositoryDurable, resilient storage of data objects.NoYes
An ingestion mechanismA means to upload content to the repository (no transformation)NoYes
A tagging & metadata mechanismA means to associate metadata with data objects, including user-defined tags.YesYes
A metadata search mechanismA means to search objects in the data lake based on metadata and tags (not content)YesYes
A query engineA means to search the content of objects in the data lakeYesPartially
An access control mechanismA means to ensure that users can only access datasets and parts of data sets that they are entitled to see, and to audit all activity.YesYes

In effect, data lakes have become a kind of data warehouse – the main significant difference being that input sources into data lakes tend to be familiar files – CSVs, Avro, JSON, etc. from multiple sources rather than highly optimized domain-specific schemas – i.e., no assumptions are made about how (or why) the data in the data lake will be consumed. Data lakes also do not concern themselves with scheduling or orchestration.

Datawarehouses, datawarehouses everywhere…

For mature data use cases (i.e., situations where relatively stable, well-known data requirements exist), and where consistent high performance is material to meeting customer needs, data warehouses are still the best solution. A data warehouse stores and manages all of its data locally, and only relies on the data lake as an initial ingestion point.

A data warehouse will transform datasets to the form needed for the specific use cases it supports, and will optimize performance for the consumption of those datasets. Modern data warehouses will use ML/AI techniques to optimize performance rather than relying on human database specialists. But, as this approach is compute intensive, such solutions are more amenable to cloud environments than on-premise environments. Snowflake is an example of this model. As more traditional data warehouses (e.g., Oracle Exadata) move to the cloud, we can expect these to also get ‘smarter’ – however, data gravity will mean such solutions will need to be fundamentally multi-cloud compatible.

For on-premise data warehouses, the tendency is for business lines or functions to create ‘one data warehouse to rule them all’ – mainly because of the traditionally significant storage and compute infrastructure and resources necessary to support data warehouses. Consequently considerable effort is spent on defining and maintaining high performance, appropriately normalized, enterprise data models that can be used in as many enterprise use cases as possible.

In a hybrid/cloud world, multiple data warehouses become more feasible – and in fact, will be inevitable in larger organizations. As more enterprise data becomes available in these dynamically scalable, cloud-based (or HDFS/Hadoop based) data warehouses (such as AWS EMR, AWS Redshift, Snowflake, Google Big Query, Azure SQL Data Warehouse), ‘virtual data warehouses’ avoid the need to move data from its source for query handling, allowing data storage and egress costs to be kept to a minimum, especially if assisted by machine-learning techniques.

Virtual Data Warehouses

Virtual Data warehouse technologies have been around for a while, allowing users to manage and query multiple data sources through a common logical access point. For on-premise solutions, virtual data warehouses have limited use cases, as the cost/effort of scaling out in-house solutions can be prohibitive and not particularly agile in nature, precluding experimental use cases.

On hybrid or cloud environments, virtual data warehouses can leverage the scalability of cloud-native data warehouses, driving queries to the relevant engine for execution, and then leveraging its own scalable infrastructure for executing join queries.

Technologies like Dremio reflect the state of the art in cloud-based data warehouses, which push down queries to the source system where possible, but can process them in-memory directly from a data lake or other source if not.

However, there is one thing that all data warehouses have in common: they leverage SQL and (implicitly) a relational view of the data. Standard ANSI SQL queries are generally supported by all data warehouses, but may mean that some data cannot be queried if it is not in tabular form amenable to SQL processing.

Extending SQL with PartiQL

Enter PartiQL, an open-source project sponsored by Amazon to drive extensions to standard SQL that can cope with non-relational data types, including structured, unstructured, nested, and schemaless (NoSQL, Document).

Historically, all data ingested into a data lake had to be transformed into a format that could be queried by SQL-like commands or processed by typical data warehouse bulk-upload tools. This adds complexity to data pipelines (i.e., more jobs), and may also force premature schema design (i.e., forcing the design of an optimal schema before all critical use cases are fully understood).

PartiQL potentially allows tools such as Snowflake, Dremio (as well as the tools AWS uses internally) to query data using SQL-like syntax, but to also include non-relational data in those queries so they can avoid those separate transformation steps, aiding pipeline complexity reduction.

PartiQL claims to be fully ANSI-compliant, but extended in specific ways to support alternate data formats. While not an official ISO/ANSI standard, it may have the ability to become a de-facto standard – especially as the language has already been used in anger with success within AWS. This will provide a skill path for relational data warehouse experts to become proficient in leveraging modern data pipelines without committing to one specific vendor’s technology.

Technologies like PartiQL will make it much easier to include event-sourced streams into a data pipeline, as events are defined as nested or other non-relational structures. As more data pipelines become event driven rather than batch-driven, having a standard like PartiQL will be key. (It will be interesting to see if Confluent’s KSQL and PartiQL will converge to a single event-stream query standard.)

As PartiQL has only just been released, it’s too soon to tell how the big data ecosystem or ISO/ANSI will respond. Expect more on this topic in the future. For now, virtual data warehouses must rely on their proprietary SQL extensions.

Non-SQL Data Processing

Considerable investment is being made by third party vendors on advanced technology focused on making distributed, scalable processing of SQL (or SQL-like) queries fast and reliable with little or no human tuning required. As such, it is wise to pick a vendor demonstrating a clear strategy in this space, and continuing to invest in SQL as the lingua-franca of transformation logic.

However, for use cases for which SQL is not appropriate, distributed computing platforms like Spark are still needed. The expectation here is that such platforms will ingest data from a data lake, and output results into a data lake. In some cases, the distributed computing platform offers its own storage (e.g., HDFS), but increasingly it is more appropriate to question whether data needs to reside permanently in a HDFS cluster rather than in a data lake. For example, Amazon’s EMR service allows Hadoop clusters to be created ephemerally, and to consume their initial dataset from AWS S3 repositories or other data sources,

Enforcing Enterprise Data Collaboration and Governance

Note that all data warehouse solutions (virtual or not) must support some form of meta-data tagging and management used by their SQL query engines – otherwise they cannot act as a virtual database source (generally an ODBC end-point that applications can connect directly to). This tagging can be automated if sources included meta-data (e.g., field headers, Avro schema definitions, etc) , but can be enhanced by human tagging, which is increasingly augmented by machine-learning to help identify, for example, where data may be sensitive, etc.

But data governance needs extend beyond the needs of the virtual data warehouse query engines, and this is where there are still gaps to be filled in the current enterprise data management tools.

Tools from vendors like Alation, Waterline, Informatica, Collibra etc were created to augment people’s ability to properly tag content in the data-lake with meaningful information to make it discoverable and governable. Consistent tagging in principle allows tag-based governance rules to be defined to automatically enforce data governance policies in data consumers. This data, coupled with schema information which can be derived directly from data-sources, is all the information needed to allow users (or developers) to source the data they need in a secure, compliant way.

But meta-data for data governance has humans as the primary user (e.g. CDOs, business/data analysts, process owners, etc) – or, as Alation describes it – meta-data for human collaboration.

Currently, there is no accepted standards for ensuring the consistency of ‘meta-data for human collaboration’ with ‘meta-data for query execution’.

Ideally, the human-oriented tools would generate standard events that tools in the data pipeline could pick up and act on (via, for example, something like AWS EventBridge), thereby avoiding the need for data governance personnel to oversee multiple data pipelines directly…

Summary

With the advent of cloud-based managed compute and data storage services, a multi-data warehouse and pipeline strategy is viable and may even be desirable, potentially involving multiple data lakes.

Solutions like PartiQL have the potential to eliminate many transformation job phases and greatly simplify data pipeline complexity in a standardized way, leveraging existing SQL skills rather than requiring new skills.

To ensure consistent governance across multiple data pipelines, a serverless event-based approach to connecting human data governance solutions with cloud-native data pipeline solutions may be the way forward – for example, using AWS EventBridge to action events originating from SaaS-based data governance services with data pipelines.

The changing role of data lakes

Why AWS EventBridge changes everything..

“Events, dear boy, events”

Harold McMillan

[tl;dr AWS EventBridge may encourage SaaS businesses to formally define and manage public event models that other businesses can design into their workflows. In turn, this may enable businesses to achieve agility goals by decomposing their organizations into smaller, event-driven “cells” with workflows empowered by multiple SaaS capabilities.]

Last week, Amazon formally announced the launch of the AWS EventBridge service. What makes this announcement so special?

The biggest single technical benefit is the avoidance of the need for webhooks or polling APIs. (See here for a good explanation of the difference.)

Webhooks are generally not considered a scalable solution for SaaS services, as significant engineering is required to make it robust, and consuming applications need to be designed to handle web-hook API calls.

HTTP-based APIs exposed by 3rd party services can be polled by applications that need to know if state has changed, but this polling consumes resources even when nothing changes. Again, this has scalability issues on both the SaaS provider as well as the application consumer.

In both cases, the principle metaphor connecting both the SaaS and consuming application is the ‘service interface’ abstraction – i.e., executing an operation on a resource. As such, this is a technical solution to a technical problem.

From APIs to Events

While this ‘service-based’ model of distributed programming is extremely powerful, it is not an appropriate abstraction for connecting behaviors across multiple services in a value chain. To align with business-level concepts such as Business Process Modelling, event-driven architectures are becoming more and more popular to model complex workflows both within and between organizations.

This trend is accelerated by the desire of organizations to become more “agile”. Increasingly organizations are recognizing this must manifest itself as breaking down the organization into more manageable, semi-autonomous “cells” (see this article from McKinsey as an example). With cells, the event metaphor fits naturally: cells can decide which events they care about, and also decide what events they in turn create that other cells may use.

3rd party service providers (i.e., SaaS companies such as SalesForce, Workday, Office365, ServiceNow, Datadog, etc) empower organizational cells and enable them to achieve far more than a small cell otherwise could. The “cell” concept cannot be fully realized unless every cell has the ability to define and control how it uses these services to achieve its own mission.

In addition, as value/supply chains become more complex, and more (3rd party or internal) providers are embedded in those workflows , the need for a more natural, adaptable way of integrating processes has become evident.

But event-driven architectures require a common ‘bus’ – a target-neutral means to allow zero or more consumers express an interest in receiving events published on the bus. This historically has been impractical to do at scale between organizations (or even within organizations) without requiring all parties to agree on a neutral 3rd party to manage the bus, and at the additional risk of creating a change bottleneck: hence the historic preference for point-to-point HTTP-based standards.

Services like the AWS EventBridge for the first time allow autonomous SaaS solutions to publish a formal event model that can be consumed programmatically and seamlessly included in local (cell-specific) workflows. In addition, this event model can be neutral to the underlying technology and cloud provider.

How it works and what makes it different

The key feature of the EventBridge is the separation of the publisher from the consumer, and the way that business rules to manage the routing and transformation of events is handled.

Once an organization (or AWS account) has registered as a consumer with the publisher (the owner of the “event source”), a logical “event bus” is created to represent all events for that org/account. The consuming org/account can then setup whatever routing and transformation rules it needs for any internal consumers of those events, without any further dependency on the publishing organization. So consumer organizations/accounts have full control over what is published internally to consuming applications.

With appropriate guard-rails in place, individual teams (“cells”) can define and configure their own routing rules, and not rely on any centralized team – a key weakness in many legacy ESB solutions.

Note that the EventBridge has predefined service limits – it has a (reasonably – 400 events/sec) high throughput, but is a high latency service (0.5sec). So low-latency use cases such as electronic trading are not, as this point, an appropriate use case for EventBridge.

The use of EventBridge for internal enterprise event handling should be considered carefully: the 100 event buses per account essentially limits the number of publishers that can be handled by any one account to 100. For most use cases, this should be more than enough, but many large organizations may have many more than 100 ‘publishers’ publishing on their ESB. If each publisher can be viewed as a part of an end-to-end business value-stream, then any value-stream with more than 100 components (i.e., unique event models) is likely to be overly complex. In practice, a ‘publisher’ is likely to be an enterprise application: therefore some significant complexity reduction and consolidation (of event models, if not actual code) would be needed to ensure such organizations can use EventBridge internally effectively.

The AWS Way (also, the Cloud Way)

It’s worth noting that key to Amazon’s success is its ability to “eat its own dogfood“. Every service in Amazon and AWS is built atop other services. No service is allowed to get so big and bloated it cannot be managed effectively. Abstractions are ‘clean’ – rather than add bells and whistles to an existing service, a new service is created which leverages the underlying service or services.

AWS has consistently required every service to have and maintain an API model, which – for asynchronous/autonomous services – leads naturally to an event model. This in turn has made it natural for AWS EventBridge to come out-of-the-box with a number of events already emitted by AWS services that can be leveraged for customer solutions. (For now, many of these events are limited to generic CloudTrail-related events – specifically tracking API calls – but in the future it’s reasonable to expect more service-specific events to be made available.)

AWS does have one key advantage over other major cloud providers such as Google GCP and Microsoft Azure: it set out to build a business (an online marketplace) using these services. So it’s strategy was (and is) driven by its vision for how to build a globally scalable online business – not by the need to provide technology services to businesses. To this extent, it’s hard to see Google and Microsoft being anything other than followers of AWS’s lead.

A Prediction..

Businesses which also follow the Amazon-inspired growth/innovation and organization model will likely have a better chance of succeeding in the digital age. And it is for these businesses that EventBridge will have the most impact – far beyond the technological improvements afforded by the use of events vs webhooks/APIs.

Consequently, as more SaaS companies are on-boarded onto the AWS EventBridge eco-system, we can expect more event models to be published. Tools for managing and evolving event models will evolve and improve so they become more accessible and useful for non-traditional IT folks (i.e., process and workflow designers) – currently the only way right now to see event model definitions seems to be by actually creating business rules.

This increased focus on SaaS integrations may (perhaps) inspire firms to re-organize their internal capabilities along similar lines, as internal service providers, empowering cells across the organization and with a published and accessible software-driven event model – noting that while events may be published and received digitally, they can still be actioned by humans for non-digital processes (e.g., complex pricing decision making, responding to help desk requests, etc).

The roster of SaaS firms signing up to EventBridge over the coming months will hopefully bear out this prediction. A good sense of what services could be onboarded can be had by looking at all the SaaS (and IoT) services integrated by IFTTT.

In the meantime, it is time to explore the re-imagined integration opportunities afforded by AWS EventBridge..

Why AWS EventBridge changes everything..

The Meaning of Data

[tl;dr The Semantic Web may be a CDOs best friend in their efforts to help the business realise the full value of enterprise data.]

Large organisations with complex legacy infrastructure are faced with a dilemma: the technology that allowed them to grow and capture market share has reached a turning point in terms of the value returned from additional spending. The only practical investments that can be made is to shore up (protect) those investments by performing essential maintenance, focusing on security, and (perhaps) moving to lower-cost, cloud-based infrastructure. Mandatory (usually regulatory) enhancements also need to be done.

But in terms of protecting existing revenue or capturing new markets, legacy applications are often seen as a burden. Businesses are increasingly looking outside their core IT capability to address their needs – especially in technologies that can help pinpoint opportunities in the first place (i.e., analysing and processing the digital foot-prints left by customers and prospective customers, whether those foot-prints are internal to the firm or external).

Technology is both the problem and the solution. And herein lies the dilemma: internal IT organisations cannot possibly advise on all possible technology-enabled solutions for a business. Rather, businesses need to become more “tech-savvy” – a term which is bandied about a lot these days.

What does “tech-savvy” actually mean, and where is the line drawn between a “tech-savvy” business person, and the professional IT service provider? And what form should this ‘line’ take?

Arguably, most businesses have always been tech-savvy: they knew by-and-large where technology was necessary to grow their business. So for the most successful firms, there is a *lot* of technology. Does that mean those businesses are tech-savvy?

Yes, if those firms are able to manage the complexity of their IT – in other words, to be able to adapt their IT to shifting market needs, and to incorporate/absorb innovations into their architecture without significantly reducing that agility. In practice, few firms have such mastery of their IT (and, by implication, their processes and business information systems).

So a new form of “tech savvy” is needed, that allows business folks to leverage opportunities to both find customers and meet their needs (profitably), while preventing a build-up of unsustainable complexity of processes and systems.

In essence, tech-savvy business folk need to get better at understanding data. And IT needs to get much better at making meaningful business data available to their business – irrespective of where that business data may actually reside.

What does this mean in practice? It’s all about semantics – something previously in the domain of data geeks. But major initiatives like the semantic web (led by Tim Berners-Lee, of WWW fame) are making semantics useful to traditionally non-technical people.

The tools available to business folks around what information (i.e., data + context) is available is very poor, and even when the required data is found (i.e., it is known where it is) it can be difficult to extract it and use it in a productive way.

A good example of this may be observed in the proliferation of Excel spreadsheets and Access databases in organisations that support critical business functions. These tools are necessary to support those areas as the data they need from various systems are not where they need when they need it – and that is often due to business needs changing faster than IT can capture, absorb and deliver requirements.

Over time (at least in principle), all meaningful business data will (must) end up being processed exclusively by controlled enterprise systems..but this doesn’t mean waiting until IT has made the necessary changes in order to effectively govern that data and make maximum use out of it.

A key principle behind the semantic web is that data can be anywhere. In an enterprise scenario, that means the data could exist in:

  • an internal enterprise system
  • a file system (e.g., Excel sheet or Access database)
  • a website (file download, or website screen scraping)
  • a commercial information provider (Reuters, Bloomberg, etc)
  • a business partner/supplier
  • etc

To take advantage of this, meta-data (i.e., data about data) needs to be made available to businesses, and it needs to be business-relevant and agnostic to internal IT systems and processes.

More than this, the tools needed to discover useful information, as well as to retrieve and process that information need to be available to tech-savvy business folk. The right set of tools, which are suitably agnostic to specific architectures and systems, can allow businesses to explore new opportunities and business models, while allowing the IT systems and platforms to catch up and evolve over time – noting that there is no presumption that the eventual source of useful business data originates from, or is stewarded by, internal systems managed by IT.

Such tools also need to be enforce compliance to ensure only the right people get access to the right data, and (if necessary) limit the ability for people to pull data outside of the retrieval platform. In essence, providing a controlled sandbox in which businesses can distill value from information, whatever its source.

Many organisations may approach this by implementing a ‘data lake‘. This may (perhaps) be a workable solution for all data assets managed by the IT organisation. But it is not feasible for many sets of data which are not managed by the IT organisation, but which still have business utility.

Emerging standards and technologies are evolving to meet this need for a data-centric view of information systems – in particular, RDF, OWL and related ‘ontology‘ languages, as well as standards like SPARQL which enable the discovery and retrieval of data originating from multiple sources. But tools are still very primitive and generally require too much technology savviness. However, efforts like, for example, the Callimachus Project gives a hint at the potential of data-driven applications.

At some point, it will not be unreasonable to ask an IT organisation to expose all of its data assets semantically via SPARQL end-points (with appropriate access controls), and to provide tools to businesses to allow them to explore that data and (where permissible) incorporate them into spreadsheets, models and other tools in ways that allow the business to realise value from that data without requiring IT change requests. Developing the capabilities to understand semantic data, and use it to commercial advantage, will take time but it will be a worthwhile investment (and arguably before folks start spending money on ‘big data’ projects that have no wider context).

In fact, I would go so far as to say that any provider (startup or established) delivering technology services to a business should provide SPARQL endpoints as a matter of course. Making data useful and available to the business will help the business realise the value of data and get more involved in how it is captured, processed and stored in the future – and make it easier to incorporate 3rd party solution providers into business operations.

In a nutshell, the Semantic Web may be the CDOs best friend in their efforts to help the business realise the full value of enterprise data.

The Meaning of Data

The 3 pillars of a ‘digital’ strategy

The ‘term’ digital is bandied around quite a lot, so it is useful to be quite precise about what is meant by it.

I believe a ‘digital’ strategy for a business must formally address all of the following 3 pillars if it is to be considered a true ‘digital’ strategy and to achieve the goals of the business:

  • Customer- or client-centricity
  • Data is an asset
  • Achieve and maintain business agility

Customer- or client-centricity – this is about understanding the client’s needs, and in effect providing the whole capability of the organisation to meet the client’s needs effectively and efficiently. The client is assumed to be the primary source of revenue, and the balance between meeting the firm’s wants (i.e., to get clients and to make as much money as possible from them) and the client’s needs (i.e., to get the service they need) will very much shape any client-centricity programme.

Many ‘digital’ efforts tend to focus exclusively on this space, as this involves big data, social, mobile, etc, etc. It is about building (mobile or traditional desktop) applications to meet client needs, it is about providing clients a richer user experience, it is about clients being able to interact with the firm through a single portal tuned for their needs, and not the firm’s own wants. It is about joining processes and functions which historically have acted as islands.

Client-centricity efforts are usually led by the CEO and/or the COO.

Data is an asset

Data is an asset implies you ‘know what you got’ in terms of knowing what data the firm has and where.

This is becoming a bigger priority for more and more firms, as they realise that key strategic goals such as knowing your client’s needs, meeting demanding regulatory obligations or improving operational excellence are all but impossible to achieve without some stewardship of the vast amounts of data that the organisation captures every day.

Stewardship means putting in place principles, practices and tools to allow data to be discovered and accessed when and where it is needed, and to avoid the creation of redundant or duplicative processes or systems.

Data stewardship needs to be led at the level of COO at least, and potentially CEO.

Achieve and maintain business agility

Business agility is the ability to rapidly and sustainably respond to change. Many businesses have been agile in the past, in the sense that they see market opportunities and do the work needed to take advantage of it, usually within a 1 year timeframe and often much less.

However, as the complexity in an organisation grows, the ability to respond rapidly to change decreases, until eventually sclerotic processes and systems force a massive re-investment in technology with the corresponding high cost and high risk.

So maintaining business agility is a complexity management activity: first to control the complexity and then to manage it.

As with the others, business agility needs to be led at the level of COO at least, and potentially CEO. It should not be the responsibility of the CIO on their own.

Summary

At a minimum, all three of the above threads need to be in place to achieve a successful digital transformation effort. The need for first two threads are generally widely understood at the board- and ‘CXO’ level of firms.

But here’s the question: what are they doing about business agility and complexity management? The answer is buried in the murky world of ‘enterprise architecture’, a discipline which has never quite settled down into a steady-state agreement of what it is or what it should focus on, or who should be accountable for doing it.

In a digital context Enterprise Architecture should be first and foremost about complexity management in the context of business agility – especially when viewed in light of the other two executive areas of focus. (This also applies in other non-digital business contexts, such as Mergers & Acquisitions, or Regulatory Compliance.)

Business agility requires agility in its technology. The way to achieve and maintain agility is through managing complexity *continuously* – the complexity of business processes, and complexity of the supporting IT.

A proven way to address complexity is to partition or modularise activities, from the enterprise all the way down, and to establish principles for identifying partitions and for governing change within and across partitions.

This requires new practices and skills which are specifically focused on these principles. These may be skills more mature IT organisations may have, but the practice is a business-technology endeavour: in the end, in the same way the CEO demands focus on customers and data, they must also demand business agility and hold their business and IT to account for achieving that.

An excellent book on this topic is Roger Sessions ‘Simple Architectures for Complex Systems‘, some of which ideas will feature strongly in future posts. A related enabler is the ‘Scaled Agile Framework (SAFe)‘, which drives business agility goals into execution of projects and programs.

The 3 pillars of a ‘digital’ strategy

Understanding Big Data

Big Data is a Big Topic. So I’m trying to get my head around a few basic concepts. 

My interest stems from the following areas:

  • Managing data complexity – most organisations have more data than they know what to do with. In particular, data semantics is a big problem, as is the ability to find and access the data that is needed.
  • Machine learning – the ability to infer meaning from data and use this in highly automated processes on a real-time or near-real-time basis – e.g., Amazon/Netflix recommendation engines. Any processes which need human ‘4-eyes’ is a good candidate for this. This article from IBM is a good synopsis of open source technologies that can aid machine learning. Note – this is distinct from business intelligence, which (today) assumes that the report produced by the system is the end product; i.e., business intelligence is not in and of itself intended to be part of an automated process. But the lines between business and intelligence and machine learning can be expected to blur..
  • Data Discovery – for many organisations, finding the data you need is a big challenge. Graph databases, triple-stores and open standards like RDF offer a way for these to be useful to, and accessible to, non-architects. In large corporate environments for example, these technologies can enable the creation of a useful who’s-who of experts in different technologies, recognising that the universe of technologies is constantly changing, and many technologies are closely related to each other or tend to be used together. Data discovery initiatives like Datahub and Linked Data are worth watching, as is the W3C efforts around the Semantic Web.
  • Modularity and Data Persistence – the relationship between data and services is historically a challenging one, with the natural tendency to have business logic as close to the data as possible (e.g, stored procedures etc). The sheer number of alternative data store/retrieval options means that it is even more important to separate the implementation of modules from their APIs: by all means (if you must) mix data and logic in the implementation, but do not expose the data any other way except via the module API or you will lose control of the data. This means more and more data should be exposed via services, and business logic should access the data via these services only. In principle, this allows business functionality to be exposed as modules, and data services to support multiple modules without compromising principles of modularity. It also allows a degree of flexibility over which of the many persistence solutions should be used for a given problem.
  • Containers – many database technologies today can be deployed into a self-contained environment, as they expose their interfaces through open APIs (such as RESTful, etc). So they can be isolated from the technology and architecture of the rest of your platform (in much the same way your Oracle database can be on Unix, and your clients running Windows, etc). Technologies like Docker and Mesos enable distributed databases to be built on commodity technology, enabling capacity and resilience to be added horizontally (by adding more commodity nodes) rather than vertically (more big iron). The relationship between these technologies and modular, service-oriented architectures is still rather immature..however, the trend is evident and has significant implications in architectural design.

I don’t pretend to fully understand the nature or implications of all of the above..the real world will decide what is useful or what is not. But there are a number of trends here that are key:

  • Increasing focus on data semantics and data discovery
  • Massive innovation in database technologies – no one-size fits all solution
  • Technologies to support infrastructure management are advancing in lock-step with the advances in database technologies
  • Technologies to be able to do something useful with all this data on a (near) real-time basis are also improving dramatically.

All of the above is mainly concerned with data-at-rest: it’s a whole different subject about how data gets from where it is now to where it is needed, without resorting to building point-to-point interfaces.

 

 

Understanding Big Data