The Meaning of Data

[tl;dr The Semantic Web may be a CDOs best friend in their efforts to help the business realise the full value of enterprise data.]

Large organisations with complex legacy infrastructure are faced with a dilemma: the technology that allowed them to grow and capture market share has reached a turning point in terms of the value returned from additional spending. The only practical investments that can be made is to shore up (protect) those investments by performing essential maintenance, focusing on security, and (perhaps) moving to lower-cost, cloud-based infrastructure. Mandatory (usually regulatory) enhancements also need to be done.

But in terms of protecting existing revenue or capturing new markets, legacy applications are often seen as a burden. Businesses are increasingly looking outside their core IT capability to address their needs – especially in technologies that can help pinpoint opportunities in the first place (i.e., analysing and processing the digital foot-prints left by customers and prospective customers, whether those foot-prints are internal to the firm or external).

Technology is both the problem and the solution. And herein lies the dilemma: internal IT organisations cannot possibly advise on all possible technology-enabled solutions for a business. Rather, businesses need to become more “tech-savvy” – a term which is bandied about a lot these days.

What does “tech-savvy” actually mean, and where is the line drawn between a “tech-savvy” business person, and the professional IT service provider? And what form should this ‘line’ take?

Arguably, most businesses have always been tech-savvy: they knew by-and-large where technology was necessary to grow their business. So for the most successful firms, there is a *lot* of technology. Does that mean those businesses are tech-savvy?

Yes, if those firms are able to manage the complexity of their IT – in other words, to be able to adapt their IT to shifting market needs, and to incorporate/absorb innovations into their architecture without significantly reducing that agility. In practice, few firms have such mastery of their IT (and, by implication, their processes and business information systems).

So a new form of “tech savvy” is needed, that allows business folks to leverage opportunities to both find customers and meet their needs (profitably), while preventing a build-up of unsustainable complexity of processes and systems.

In essence, tech-savvy business folk need to get better at understanding data. And IT needs to get much better at making meaningful business data available to their business – irrespective of where that business data may actually reside.

What does this mean in practice? It’s all about semantics – something previously in the domain of data geeks. But major initiatives like the semantic web (led by Tim Berners-Lee, of WWW fame) are making semantics useful to traditionally non-technical people.

The tools available to business folks around what information (i.e., data + context) is available is very poor, and even when the required data is found (i.e., it is known where it is) it can be difficult to extract it and use it in a productive way.

A good example of this may be observed in the proliferation of Excel spreadsheets and Access databases in organisations that support critical business functions. These tools are necessary to support those areas as the data they need from various systems are not where they need when they need it – and that is often due to business needs changing faster than IT can capture, absorb and deliver requirements.

Over time (at least in principle), all meaningful business data will (must) end up being processed exclusively by controlled enterprise systems..but this doesn’t mean waiting until IT has made the necessary changes in order to effectively govern that data and make maximum use out of it.

A key principle behind the semantic web is that data can be anywhere. In an enterprise scenario, that means the data could exist in:

  • an internal enterprise system
  • a file system (e.g., Excel sheet or Access database)
  • a website (file download, or website screen scraping)
  • a commercial information provider (Reuters, Bloomberg, etc)
  • a business partner/supplier
  • etc

To take advantage of this, meta-data (i.e., data about data) needs to be made available to businesses, and it needs to be business-relevant and agnostic to internal IT systems and processes.

More than this, the tools needed to discover useful information, as well as to retrieve and process that information need to be available to tech-savvy business folk. The right set of tools, which are suitably agnostic to specific architectures and systems, can allow businesses to explore new opportunities and business models, while allowing the IT systems and platforms to catch up and evolve over time – noting that there is no presumption that the eventual source of useful business data originates from, or is stewarded by, internal systems managed by IT.

Such tools also need to be enforce compliance to ensure only the right people get access to the right data, and (if necessary) limit the ability for people to pull data outside of the retrieval platform. In essence, providing a controlled sandbox in which businesses can distill value from information, whatever its source.

Many organisations may approach this by implementing a ‘data lake‘. This may (perhaps) be a workable solution for all data assets managed by the IT organisation. But it is not feasible for many sets of data which are not managed by the IT organisation, but which still have business utility.

Emerging standards and technologies are evolving to meet this need for a data-centric view of information systems – in particular, RDF, OWL and related ‘ontology‘ languages, as well as standards like SPARQL which enable the discovery and retrieval of data originating from multiple sources. But tools are still very primitive and generally require too much technology savviness. However, efforts like, for example, the Callimachus Project gives a hint at the potential of data-driven applications.

At some point, it will not be unreasonable to ask an IT organisation to expose all of its data assets semantically via SPARQL end-points (with appropriate access controls), and to provide tools to businesses to allow them to explore that data and (where permissible) incorporate them into spreadsheets, models and other tools in ways that allow the business to realise value from that data without requiring IT change requests. Developing the capabilities to understand semantic data, and use it to commercial advantage, will take time but it will be a worthwhile investment (and arguably before folks start spending money on ‘big data’ projects that have no wider context).

In fact, I would go so far as to say that any provider (startup or established) delivering technology services to a business should provide SPARQL endpoints as a matter of course. Making data useful and available to the business will help the business realise the value of data and get more involved in how it is captured, processed and stored in the future – and make it easier to incorporate 3rd party solution providers into business operations.

In a nutshell, the Semantic Web may be the CDOs best friend in their efforts to help the business realise the full value of enterprise data.

The Meaning of Data

Strategic Theme # 5/5: Machine Learning

[tl;dr Business intelligence techniques coupled with advanced data semantics can dynamically improve automated or automatable processes through machine learning. But 2015 is still mainly about exploring the technologies and use cases behind machine learning.]

Given the other strategic themes outlined in this blog (lean enterprise, enterprise modularity, continuous delivery & system thinking), machine learning seems to be a strange addition. Indeed, it is a very specialist area, about which I know very little.

What is interesting about machine learning (at least in the enterprise sense), is that it heavily leans on two major data trends: big data and semantic data. It also has a significant impact on the technology that is the closest equivalent to machine learning in wide use today: business intelligence (aka human learning).

Big Data

Big data is a learning area for many organisations right now, as it has many potential benefits. Architecturally, I see big data as an innovative means of co-locating business logic with data in a scalable manner. The traditional (non big-data) approach to co-locating business logic with data is via stored procedures. But everyone knows (by now) that while stored-procedure based solutions can enable rapid prototyping and delivery, they are not a scalable solution. Typically (after all possible database optimisations have been done) the only way to resolve performance issues related to stored procedures is to buy bigger, faster infrastructure. Which usually means major migrations, etc.

Also, it is generally a very bad idea to include business logic in the database: this is why so much effort has been expended in developing frameworks which make the task of modelling database structures in the middle tier so much easier.

Big data allows business logic to be maintained in the ‘middle’ tier (or at least not the database tier) although it changes the middle tier concept from the traditional centralised application server architecture to a fundamentally distributed cluster of nodes, using tools like SparkMesos and Zookeeper to keep the nodes running as a single logical machine. (This is different from clustering application servers for reasons of resilience or performance, where as much as possible the clustering is hidden from the application developers through often proprietary frameworks.)

While the languages (like Pig, Hive, Cascading, Impala, F#, Python, Scala, Julia/R, etc)  to develop such applications continue to evolve, there is still some way to go before sophisticated big-data frameworks equivalent to JEE /Blueprint and Ruby on Rails on traditional 3-tier architectures are developed.  And clearly ‘big data’ languages are optimised for queries and not transactions.

Generally speaking, traditional 3-tier frameworks still make sense for transactional components, but for components which require querying/interpreting data, big data languages and infrastructure make a lot more sense. So increasingly we can see more standard application architectures using both (with sophisticated messaging technologies like Storm helping keep the two sides in sync).

An interesting subset of the ‘big data’ category (or, more accurately, the category of databases knowns as NoSQL), are graph databases. These are key for machine learning, as will be explained below. However, Graph databases are still evolving when it comes to truly horizontal scaling, and while they are the best fit for implementing machine learning, they do not yet fit smoothly on top of ‘conventional’ big data architectures.

Semantic Data

Semantic data has been around for a while, but only within very specialist areas focused on AI and related spheres. It has gotten more publicity in recent years through Tim Berners-Lee promoting the concept of the semantic web. It requires discipline managing information about data – or meta-data.

Initiatives like Linked Data, platforms like datahub.io, standards like RDF, coupled with increasing demand for Open Data are helping develop the technologies, tools and skillsets needed to make use of the power of semantic data.

Today, standard semantic ontologies – which aim to provide consistency of data definitions – by industry are thin on the ground, but they are growing. However, the most sophisticated ontologies are still private: for example, Wolfram Alpha has a very sophisticated machine learning engine (which forms part of Apple’s Siri capability), and they use an internally developed ontology to interpret meaning. Wolfram Alpha have said that as soon as reliable industry standards emerge, they would be happy to use those, but right now they may be leading the field in terms of general ontology development (with mobile voice tools like Apple Siri etc close behind).

Semantic data is interesting from an enterprise perspective, as it requires knowing about what data you have, and what it means. ‘Meaning’ is quite subtle, as the same data field may be interpreted in different ways at different times by different consumers. For example, the concept of a ‘trade’ is fundamental to investment banking, yet the semantic variations of the ‘trade’ concept in different contexts are quite significant.

As regulated organisations are increasingly under pressure to improve their data governance, firms have many different reasons to get on top of their data:

  • to stay in business they need to meet regulatory needs;
  • to protect against reputational risk due to lost or stolen data;
  • to provide advanced services to clients that anticipate their needs or respond more quickly to client requests
  • to anticipate and react to market changes and opportunities before the competition
  • to integrate systems and processes efficiently with service providers and partners both internally and externally
  • to increase process automation and minimise unnecessary human touch-points
  • etc

A co-ordinated, centrally led effort to gather and maintain knowledge about enterprise data is necessary, supported by federated, bottom-up efforts which tend to be project focused.

Using and applying all the gathered meta-data is a challenge and an opportunity, and will remain high on the enterprise agenda for years to come.

Business Intelligence

Business intelligence solutions can be seen as a form of ‘human learning’. They help people understand a situation from data, which can then aid decision making processes. Eventually, decisions feed into system requirements for teams to implement.

But, in general, business intelligence solutions are not appropriate as machine learning solutions. In most cases, the integrations are fairly unsophisticated (generally batch ETL), and computational ability is optimised for non-technical users to define and execute. The reports and views created in BI tools are not optimised to be included as part of a high performance application architecture, unlike big data tools.

As things stand today, the business intelligence and machine learning worlds are separate and should remain so, although eventually some convergence is inevitable. However, both benefit from the same data governance efforts.

Conclusions

Machine learning is a big topic, which ideally executes in the same context as the other strategic themes. But for 2015, this technology is still in the ‘exploratory’ stages, so localised experiments will be necessary before the technology and business problems they actually solve can be fully exploited.

Strategic Theme # 5/5: Machine Learning

Understanding Big Data

Big Data is a Big Topic. So I’m trying to get my head around a few basic concepts. 

My interest stems from the following areas:

  • Managing data complexity – most organisations have more data than they know what to do with. In particular, data semantics is a big problem, as is the ability to find and access the data that is needed.
  • Machine learning – the ability to infer meaning from data and use this in highly automated processes on a real-time or near-real-time basis – e.g., Amazon/Netflix recommendation engines. Any processes which need human ‘4-eyes’ is a good candidate for this. This article from IBM is a good synopsis of open source technologies that can aid machine learning. Note – this is distinct from business intelligence, which (today) assumes that the report produced by the system is the end product; i.e., business intelligence is not in and of itself intended to be part of an automated process. But the lines between business and intelligence and machine learning can be expected to blur..
  • Data Discovery – for many organisations, finding the data you need is a big challenge. Graph databases, triple-stores and open standards like RDF offer a way for these to be useful to, and accessible to, non-architects. In large corporate environments for example, these technologies can enable the creation of a useful who’s-who of experts in different technologies, recognising that the universe of technologies is constantly changing, and many technologies are closely related to each other or tend to be used together. Data discovery initiatives like Datahub and Linked Data are worth watching, as is the W3C efforts around the Semantic Web.
  • Modularity and Data Persistence – the relationship between data and services is historically a challenging one, with the natural tendency to have business logic as close to the data as possible (e.g, stored procedures etc). The sheer number of alternative data store/retrieval options means that it is even more important to separate the implementation of modules from their APIs: by all means (if you must) mix data and logic in the implementation, but do not expose the data any other way except via the module API or you will lose control of the data. This means more and more data should be exposed via services, and business logic should access the data via these services only. In principle, this allows business functionality to be exposed as modules, and data services to support multiple modules without compromising principles of modularity. It also allows a degree of flexibility over which of the many persistence solutions should be used for a given problem.
  • Containers – many database technologies today can be deployed into a self-contained environment, as they expose their interfaces through open APIs (such as RESTful, etc). So they can be isolated from the technology and architecture of the rest of your platform (in much the same way your Oracle database can be on Unix, and your clients running Windows, etc). Technologies like Docker and Mesos enable distributed databases to be built on commodity technology, enabling capacity and resilience to be added horizontally (by adding more commodity nodes) rather than vertically (more big iron). The relationship between these technologies and modular, service-oriented architectures is still rather immature..however, the trend is evident and has significant implications in architectural design.

I don’t pretend to fully understand the nature or implications of all of the above..the real world will decide what is useful or what is not. But there are a number of trends here that are key:

  • Increasing focus on data semantics and data discovery
  • Massive innovation in database technologies – no one-size fits all solution
  • Technologies to support infrastructure management are advancing in lock-step with the advances in database technologies
  • Technologies to be able to do something useful with all this data on a (near) real-time basis are also improving dramatically.

All of the above is mainly concerned with data-at-rest: it’s a whole different subject about how data gets from where it is now to where it is needed, without resorting to building point-to-point interfaces.

 

 

Understanding Big Data