The changing role of data lakes

[tl;dr A single data lake, data warehouse or data pipeline to “rule them all” is less useful in hybrid cloud environments, where it can be feasible to query ‘serverless’ cloud-native data sources directly rather than rely on traditional orchestrated batch extracts. Pipeline complexity can be reduced by open extensions to SQL such as the recently announced AWS PartiQL language. Opportunities exist to integrate enterprise human-oriented data governance and meta-data platforms with data pipelines using serverless technologies.]

The need for Data Lakes

A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale. The data lake concept was created to address a number of issues with traditional data analytics and reporting solutions, specifically:

  • the growing number of applications across an enterprise depending on a given dataset;
  • business and regulatory drivers for governing dataset discovery, quality, creation and/or consumption;
  • the increasing difficulty of IT teams to respond in a timely manner to growing business demand for access to high quality datasets.

The data lake allows data to be made available from its source without making any assumptions about its use. This is particularly critical when the data originates from batch extracts of load-sensitive OLTP databases, most of which are still operating on-premise. Streaming data pipelines, while growing in popularity, are not as common as batch-driven pipelines – although this should change over time as more digital platform architectures become more event-driven in nature.

Data lakes are a key component in data pipelines, a construct (or set of constructs) that provides consolidation of data from multiple sources and makes it available for use. A data pipeline can be orchestrated (via a scheduler) or choreographed (responding to events) – the more jobs a pipeline has to do, the more complex the orchestration or choreography, which has implications for supportability. So reducing the number of jobs a pipeline has to support is key to managing data pipeline complexity.

The Components of a Data Lake

A data lake consists of a few key components:

FeatureDescriptionVirtualTraditional
A storage repositoryDurable, resilient storage of data objects.NoYes
An ingestion mechanismA means to upload content to the repository (no transformation)NoYes
A tagging & metadata mechanismA means to associate metadata with data objects, including user-defined tags.YesYes
A metadata search mechanismA means to search objects in the data lake based on metadata and tags (not content)YesYes
A query engineA means to search the content of objects in the data lakeYesPartially
An access control mechanismA means to ensure that users can only access datasets and parts of data sets that they are entitled to see, and to audit all activity.YesYes

In effect, data lakes have become a kind of data warehouse – the main significant difference being that input sources into data lakes tend to be familiar files – CSVs, Avro, JSON, etc. from multiple sources rather than highly optimized domain-specific schemas – i.e., no assumptions are made about how (or why) the data in the data lake will be consumed. Data lakes also do not concern themselves with scheduling or orchestration.

Datawarehouses, datawarehouses everywhere…

For mature data use cases (i.e., situations where relatively stable, well-known data requirements exist), and where consistent high performance is material to meeting customer needs, data warehouses are still the best solution. A data warehouse stores and manages all of its data locally, and only relies on the data lake as an initial ingestion point.

A data warehouse will transform datasets to the form needed for the specific use cases it supports, and will optimize performance for the consumption of those datasets. Modern data warehouses will use ML/AI techniques to optimize performance rather than relying on human database specialists. But, as this approach is compute intensive, such solutions are more amenable to cloud environments than on-premise environments. Snowflake is an example of this model. As more traditional data warehouses (e.g., Oracle Exadata) move to the cloud, we can expect these to also get ‘smarter’ – however, data gravity will mean such solutions will need to be fundamentally multi-cloud compatible.

For on-premise data warehouses, the tendency is for business lines or functions to create ‘one data warehouse to rule them all’ – mainly because of the traditionally significant storage and compute infrastructure and resources necessary to support data warehouses. Consequently considerable effort is spent on defining and maintaining high performance, appropriately normalized, enterprise data models that can be used in as many enterprise use cases as possible.

In a hybrid/cloud world, multiple data warehouses become more feasible – and in fact, will be inevitable in larger organizations. As more enterprise data becomes available in these dynamically scalable, cloud-based (or HDFS/Hadoop based) data warehouses (such as AWS EMR, AWS Redshift, Snowflake, Google Big Query, Azure SQL Data Warehouse), ‘virtual data warehouses’ avoid the need to move data from its source for query handling, allowing data storage and egress costs to be kept to a minimum, especially if assisted by machine-learning techniques.

Virtual Data Warehouses

Virtual Data warehouse technologies have been around for a while, allowing users to manage and query multiple data sources through a common logical access point. For on-premise solutions, virtual data warehouses have limited use cases, as the cost/effort of scaling out in-house solutions can be prohibitive and not particularly agile in nature, precluding experimental use cases.

On hybrid or cloud environments, virtual data warehouses can leverage the scalability of cloud-native data warehouses, driving queries to the relevant engine for execution, and then leveraging its own scalable infrastructure for executing join queries.

Technologies like Dremio reflect the state of the art in cloud-based data warehouses, which push down queries to the source system where possible, but can process them in-memory directly from a data lake or other source if not.

However, there is one thing that all data warehouses have in common: they leverage SQL and (implicitly) a relational view of the data. Standard ANSI SQL queries are generally supported by all data warehouses, but may mean that some data cannot be queried if it is not in tabular form amenable to SQL processing.

Extending SQL with PartiQL

Enter PartiQL, an open-source project sponsored by Amazon to drive extensions to standard SQL that can cope with non-relational data types, including structured, unstructured, nested, and schemaless (NoSQL, Document).

Historically, all data ingested into a data lake had to be transformed into a format that could be queried by SQL-like commands or processed by typical data warehouse bulk-upload tools. This adds complexity to data pipelines (i.e., more jobs), and may also force premature schema design (i.e., forcing the design of an optimal schema before all critical use cases are fully understood).

PartiQL potentially allows tools such as Snowflake, Dremio (as well as the tools AWS uses internally) to query data using SQL-like syntax, but to also include non-relational data in those queries so they can avoid those separate transformation steps, aiding pipeline complexity reduction.

PartiQL claims to be fully ANSI-compliant, but extended in specific ways to support alternate data formats. While not an official ISO/ANSI standard, it may have the ability to become a de-facto standard – especially as the language has already been used in anger with success within AWS. This will provide a skill path for relational data warehouse experts to become proficient in leveraging modern data pipelines without committing to one specific vendor’s technology.

Technologies like PartiQL will make it much easier to include event-sourced streams into a data pipeline, as events are defined as nested or other non-relational structures. As more data pipelines become event driven rather than batch-driven, having a standard like PartiQL will be key. (It will be interesting to see if Confluent’s KSQL and PartiQL will converge to a single event-stream query standard.)

As PartiQL has only just been released, it’s too soon to tell how the big data ecosystem or ISO/ANSI will respond. Expect more on this topic in the future. For now, virtual data warehouses must rely on their proprietary SQL extensions.

Non-SQL Data Processing

Considerable investment is being made by third party vendors on advanced technology focused on making distributed, scalable processing of SQL (or SQL-like) queries fast and reliable with little or no human tuning required. As such, it is wise to pick a vendor demonstrating a clear strategy in this space, and continuing to invest in SQL as the lingua-franca of transformation logic.

However, for use cases for which SQL is not appropriate, distributed computing platforms like Spark are still needed. The expectation here is that such platforms will ingest data from a data lake, and output results into a data lake. In some cases, the distributed computing platform offers its own storage (e.g., HDFS), but increasingly it is more appropriate to question whether data needs to reside permanently in a HDFS cluster rather than in a data lake. For example, Amazon’s EMR service allows Hadoop clusters to be created ephemerally, and to consume their initial dataset from AWS S3 repositories or other data sources,

Enforcing Enterprise Data Collaboration and Governance

Note that all data warehouse solutions (virtual or not) must support some form of meta-data tagging and management used by their SQL query engines – otherwise they cannot act as a virtual database source (generally an ODBC end-point that applications can connect directly to). This tagging can be automated if sources included meta-data (e.g., field headers, Avro schema definitions, etc) , but can be enhanced by human tagging, which is increasingly augmented by machine-learning to help identify, for example, where data may be sensitive, etc.

But data governance needs extend beyond the needs of the virtual data warehouse query engines, and this is where there are still gaps to be filled in the current enterprise data management tools.

Tools from vendors like Alation, Waterline, Informatica, Collibra etc were created to augment people’s ability to properly tag content in the data-lake with meaningful information to make it discoverable and governable. Consistent tagging in principle allows tag-based governance rules to be defined to automatically enforce data governance policies in data consumers. This data, coupled with schema information which can be derived directly from data-sources, is all the information needed to allow users (or developers) to source the data they need in a secure, compliant way.

But meta-data for data governance has humans as the primary user (e.g. CDOs, business/data analysts, process owners, etc) – or, as Alation describes it – meta-data for human collaboration.

Currently, there is no accepted standards for ensuring the consistency of ‘meta-data for human collaboration’ with ‘meta-data for query execution’.

Ideally, the human-oriented tools would generate standard events that tools in the data pipeline could pick up and act on (via, for example, something like AWS EventBridge), thereby avoiding the need for data governance personnel to oversee multiple data pipelines directly…

Summary

With the advent of cloud-based managed compute and data storage services, a multi-data warehouse and pipeline strategy is viable and may even be desirable, potentially involving multiple data lakes.

Solutions like PartiQL have the potential to eliminate many transformation job phases and greatly simplify data pipeline complexity in a standardized way, leveraging existing SQL skills rather than requiring new skills.

To ensure consistent governance across multiple data pipelines, a serverless event-based approach to connecting human data governance solutions with cloud-native data pipeline solutions may be the way forward – for example, using AWS EventBridge to action events originating from SaaS-based data governance services with data pipelines.

The changing role of data lakes

Strategic Theme # 5/5: Machine Learning

[tl;dr Business intelligence techniques coupled with advanced data semantics can dynamically improve automated or automatable processes through machine learning. But 2015 is still mainly about exploring the technologies and use cases behind machine learning.]

Given the other strategic themes outlined in this blog (lean enterprise, enterprise modularity, continuous delivery & system thinking), machine learning seems to be a strange addition. Indeed, it is a very specialist area, about which I know very little.

What is interesting about machine learning (at least in the enterprise sense), is that it heavily leans on two major data trends: big data and semantic data. It also has a significant impact on the technology that is the closest equivalent to machine learning in wide use today: business intelligence (aka human learning).

Big Data

Big data is a learning area for many organisations right now, as it has many potential benefits. Architecturally, I see big data as an innovative means of co-locating business logic with data in a scalable manner. The traditional (non big-data) approach to co-locating business logic with data is via stored procedures. But everyone knows (by now) that while stored-procedure based solutions can enable rapid prototyping and delivery, they are not a scalable solution. Typically (after all possible database optimisations have been done) the only way to resolve performance issues related to stored procedures is to buy bigger, faster infrastructure. Which usually means major migrations, etc.

Also, it is generally a very bad idea to include business logic in the database: this is why so much effort has been expended in developing frameworks which make the task of modelling database structures in the middle tier so much easier.

Big data allows business logic to be maintained in the ‘middle’ tier (or at least not the database tier) although it changes the middle tier concept from the traditional centralised application server architecture to a fundamentally distributed cluster of nodes, using tools like SparkMesos and Zookeeper to keep the nodes running as a single logical machine. (This is different from clustering application servers for reasons of resilience or performance, where as much as possible the clustering is hidden from the application developers through often proprietary frameworks.)

While the languages (like Pig, Hive, Cascading, Impala, F#, Python, Scala, Julia/R, etc)  to develop such applications continue to evolve, there is still some way to go before sophisticated big-data frameworks equivalent to JEE /Blueprint and Ruby on Rails on traditional 3-tier architectures are developed.  And clearly ‘big data’ languages are optimised for queries and not transactions.

Generally speaking, traditional 3-tier frameworks still make sense for transactional components, but for components which require querying/interpreting data, big data languages and infrastructure make a lot more sense. So increasingly we can see more standard application architectures using both (with sophisticated messaging technologies like Storm helping keep the two sides in sync).

An interesting subset of the ‘big data’ category (or, more accurately, the category of databases knowns as NoSQL), are graph databases. These are key for machine learning, as will be explained below. However, Graph databases are still evolving when it comes to truly horizontal scaling, and while they are the best fit for implementing machine learning, they do not yet fit smoothly on top of ‘conventional’ big data architectures.

Semantic Data

Semantic data has been around for a while, but only within very specialist areas focused on AI and related spheres. It has gotten more publicity in recent years through Tim Berners-Lee promoting the concept of the semantic web. It requires discipline managing information about data – or meta-data.

Initiatives like Linked Data, platforms like datahub.io, standards like RDF, coupled with increasing demand for Open Data are helping develop the technologies, tools and skillsets needed to make use of the power of semantic data.

Today, standard semantic ontologies – which aim to provide consistency of data definitions – by industry are thin on the ground, but they are growing. However, the most sophisticated ontologies are still private: for example, Wolfram Alpha has a very sophisticated machine learning engine (which forms part of Apple’s Siri capability), and they use an internally developed ontology to interpret meaning. Wolfram Alpha have said that as soon as reliable industry standards emerge, they would be happy to use those, but right now they may be leading the field in terms of general ontology development (with mobile voice tools like Apple Siri etc close behind).

Semantic data is interesting from an enterprise perspective, as it requires knowing about what data you have, and what it means. ‘Meaning’ is quite subtle, as the same data field may be interpreted in different ways at different times by different consumers. For example, the concept of a ‘trade’ is fundamental to investment banking, yet the semantic variations of the ‘trade’ concept in different contexts are quite significant.

As regulated organisations are increasingly under pressure to improve their data governance, firms have many different reasons to get on top of their data:

  • to stay in business they need to meet regulatory needs;
  • to protect against reputational risk due to lost or stolen data;
  • to provide advanced services to clients that anticipate their needs or respond more quickly to client requests
  • to anticipate and react to market changes and opportunities before the competition
  • to integrate systems and processes efficiently with service providers and partners both internally and externally
  • to increase process automation and minimise unnecessary human touch-points
  • etc

A co-ordinated, centrally led effort to gather and maintain knowledge about enterprise data is necessary, supported by federated, bottom-up efforts which tend to be project focused.

Using and applying all the gathered meta-data is a challenge and an opportunity, and will remain high on the enterprise agenda for years to come.

Business Intelligence

Business intelligence solutions can be seen as a form of ‘human learning’. They help people understand a situation from data, which can then aid decision making processes. Eventually, decisions feed into system requirements for teams to implement.

But, in general, business intelligence solutions are not appropriate as machine learning solutions. In most cases, the integrations are fairly unsophisticated (generally batch ETL), and computational ability is optimised for non-technical users to define and execute. The reports and views created in BI tools are not optimised to be included as part of a high performance application architecture, unlike big data tools.

As things stand today, the business intelligence and machine learning worlds are separate and should remain so, although eventually some convergence is inevitable. However, both benefit from the same data governance efforts.

Conclusions

Machine learning is a big topic, which ideally executes in the same context as the other strategic themes. But for 2015, this technology is still in the ‘exploratory’ stages, so localised experiments will be necessary before the technology and business problems they actually solve can be fully exploited.

Strategic Theme # 5/5: Machine Learning