Big Data is a Big Topic. So I’m trying to get my head around a few basic concepts.
My interest stems from the following areas:
- Managing data complexity – most organisations have more data than they know what to do with. In particular, data semantics is a big problem, as is the ability to find and access the data that is needed.
- Machine learning – the ability to infer meaning from data and use this in highly automated processes on a real-time or near-real-time basis – e.g., Amazon/Netflix recommendation engines. Any processes which need human ‘4-eyes’ is a good candidate for this. This article from IBM is a good synopsis of open source technologies that can aid machine learning. Note – this is distinct from business intelligence, which (today) assumes that the report produced by the system is the end product; i.e., business intelligence is not in and of itself intended to be part of an automated process. But the lines between business and intelligence and machine learning can be expected to blur..
- Data Discovery – for many organisations, finding the data you need is a big challenge. Graph databases, triple-stores and open standards like RDF offer a way for these to be useful to, and accessible to, non-architects. In large corporate environments for example, these technologies can enable the creation of a useful who’s-who of experts in different technologies, recognising that the universe of technologies is constantly changing, and many technologies are closely related to each other or tend to be used together. Data discovery initiatives like Datahub and Linked Data are worth watching, as is the W3C efforts around the Semantic Web.
- Modularity and Data Persistence – the relationship between data and services is historically a challenging one, with the natural tendency to have business logic as close to the data as possible (e.g, stored procedures etc). The sheer number of alternative data store/retrieval options means that it is even more important to separate the implementation of modules from their APIs: by all means (if you must) mix data and logic in the implementation, but do not expose the data any other way except via the module API or you will lose control of the data. This means more and more data should be exposed via services, and business logic should access the data via these services only. In principle, this allows business functionality to be exposed as modules, and data services to support multiple modules without compromising principles of modularity. It also allows a degree of flexibility over which of the many persistence solutions should be used for a given problem.
- Containers – many database technologies today can be deployed into a self-contained environment, as they expose their interfaces through open APIs (such as RESTful, etc). So they can be isolated from the technology and architecture of the rest of your platform (in much the same way your Oracle database can be on Unix, and your clients running Windows, etc). Technologies like Docker and Mesos enable distributed databases to be built on commodity technology, enabling capacity and resilience to be added horizontally (by adding more commodity nodes) rather than vertically (more big iron). The relationship between these technologies and modular, service-oriented architectures is still rather immature..however, the trend is evident and has significant implications in architectural design.
I don’t pretend to fully understand the nature or implications of all of the above..the real world will decide what is useful or what is not. But there are a number of trends here that are key:
- Increasing focus on data semantics and data discovery
- Massive innovation in database technologies – no one-size fits all solution
- Technologies to support infrastructure management are advancing in lock-step with the advances in database technologies
- Technologies to be able to do something useful with all this data on a (near) real-time basis are also improving dramatically.
All of the above is mainly concerned with data-at-rest: it’s a whole different subject about how data gets from where it is now to where it is needed, without resorting to building point-to-point interfaces.