Agile quantitative analytics in Financial Services

[tl;dr technology is empowering quantitative analysts to much more on their own. IT organisations will need to think about how to make these capabilities available to their users, and how to incorporate it into IT strategies around big-data, cloud computing, security and data governance.]

The quest for positive (financial) returns in investments is helping drive considerable innovation in the space of quantitative analytics. This, coupled with the ever-decreasing capital investment required to do number-crunching, has created demand for ‘social’ analytics – where algorithms are shared and discussed amongst practitioners rather than kept sealed behind the closed doors of corporate research & trading departments.

I am not a quant, but have in the past built systems that provided an alternate (to Excel) vehicle  for quantitative research analysts to capture and publish their models. While Excel is hard to beat for experimenting with new ideas, from a quantitative analyst perspective, it suffers from many deficiencies, including:

  • Spreadsheets get complex very quickly and are hard to maintain
  • They are not very efficient for back-end (server-side) use
  • They cannot be efficiently incorporated into scalable automated workflows
  • Models cannot be distributed or shared without losing control of the model
  • Integrating spreadsheets with multiple large data sets can be cumbersome and memory inefficient at best, and impossible at worst (constrained by machine memory limits)

QuantCon was created to provide a forum for quantitative analysts to discuss and share tools and techniques for  quantitative research, with a particular focus on the sharing and distribution of models (either outcomes or logic, or both).

Some key themes from QuantCon which I found interesting were:

  • The emergence of social analytics platforms that can execute strategies on your venue of choice (e.g. quantopian.com)
  • The search for uncorrelated returns & innovation in (algorithmic) investment strategies
  • Back-testing as a means of validating algorithms – and the perils of assuming backtests would execute at the same prices in real-life
  • The rise of freely available interactive model distribution tools such as the Jupyter project (similar to Mathematica Notebooks)
  • The evolution of probabilistic programming and machine learning – in particular the PyMC3 extensions to Python
  • The rise in the number of free and commercial data sources (APIs) of data points (signals) that can be included in models

From an architectural perspective, there are some interesting implications. Specifically:

  • There really is no limit to what a single quant with access to multiple data sources (either internal or external) and access to platform- or infrastructure-as-a-service capabilities can do.
  • Big data technologies make it very easy to ingest, transform and process multiple data sources (static or real-time) at very low cost (but raise governance concerns).
  • It has never been cheaper or easier to efficiently and safely distribute, publish or share analytics models (although the tools for this are still evolving).
  • The line between the ‘IT developer’ and the user has never been more blurred.

Over the coming years, we can expect analytics (and business intelligence in general) capabilities to become key functions within every functional domain in an organisation, integrated both into the function itself through feedback loops, as well as for more conventional MIS reporting. Some of the basic building blocks will be the same as they are today, but the key characteristic is that users of these tools will be technologists specifically supporting business needs – i.e., part of ‘the business’ and not part of ‘IT’.

In the past, businesses have been supported by vendors providing expensive but easy-to-use tools to allow non-technical people to work with large datasets. The IT folks were very specifically supporting the core data warehouse & business intelligence infrastructure, or provided technical support for the development of particular reports. In these cases, the clients were typically non-technical, and the platform could only evolve as quickly as the software  vendors evolved.

The emerging (low-cost) tools for quantitative analytics will give rise to a post-Excel world of innovation, scale and distribution that will empower users, give rise to whole new business models, and be itself a big driver into how enterprise IT defines its role in the agile, unbundled, decentralised and as-a-service technology landscape of the near future.

Traditional IT organisations often saw the ‘application development’ functions as business aligned, and were comfortable with the client-oriented nature of providing technical infrastructure services to development teams. However, internal development teams supporting other (business-aligned) development teams is a fairly new concept, and will likely best be done by external specialist providers. This is a good example of where IT’s biggest role (apart from governance) in the future is in sourcing relevant providers, and ensuring business technologists are able to do their job effectively and efficiently in that environment.

In summary, technology is empowering quantitative analysts to much more on their own. IT organisations will need to think about how to make these capabilities available to their users, and how to incorporate it into IT strategies around big-data, cloud computing, security and data governance.

Advertisements
Agile quantitative analytics in Financial Services

Strategic Theme # 5/5: Machine Learning

[tl;dr Business intelligence techniques coupled with advanced data semantics can dynamically improve automated or automatable processes through machine learning. But 2015 is still mainly about exploring the technologies and use cases behind machine learning.]

Given the other strategic themes outlined in this blog (lean enterprise, enterprise modularity, continuous delivery & system thinking), machine learning seems to be a strange addition. Indeed, it is a very specialist area, about which I know very little.

What is interesting about machine learning (at least in the enterprise sense), is that it heavily leans on two major data trends: big data and semantic data. It also has a significant impact on the technology that is the closest equivalent to machine learning in wide use today: business intelligence (aka human learning).

Big Data

Big data is a learning area for many organisations right now, as it has many potential benefits. Architecturally, I see big data as an innovative means of co-locating business logic with data in a scalable manner. The traditional (non big-data) approach to co-locating business logic with data is via stored procedures. But everyone knows (by now) that while stored-procedure based solutions can enable rapid prototyping and delivery, they are not a scalable solution. Typically (after all possible database optimisations have been done) the only way to resolve performance issues related to stored procedures is to buy bigger, faster infrastructure. Which usually means major migrations, etc.

Also, it is generally a very bad idea to include business logic in the database: this is why so much effort has been expended in developing frameworks which make the task of modelling database structures in the middle tier so much easier.

Big data allows business logic to be maintained in the ‘middle’ tier (or at least not the database tier) although it changes the middle tier concept from the traditional centralised application server architecture to a fundamentally distributed cluster of nodes, using tools like SparkMesos and Zookeeper to keep the nodes running as a single logical machine. (This is different from clustering application servers for reasons of resilience or performance, where as much as possible the clustering is hidden from the application developers through often proprietary frameworks.)

While the languages (like Pig, Hive, Cascading, Impala, F#, Python, Scala, Julia/R, etc)  to develop such applications continue to evolve, there is still some way to go before sophisticated big-data frameworks equivalent to JEE /Blueprint and Ruby on Rails on traditional 3-tier architectures are developed.  And clearly ‘big data’ languages are optimised for queries and not transactions.

Generally speaking, traditional 3-tier frameworks still make sense for transactional components, but for components which require querying/interpreting data, big data languages and infrastructure make a lot more sense. So increasingly we can see more standard application architectures using both (with sophisticated messaging technologies like Storm helping keep the two sides in sync).

An interesting subset of the ‘big data’ category (or, more accurately, the category of databases knowns as NoSQL), are graph databases. These are key for machine learning, as will be explained below. However, Graph databases are still evolving when it comes to truly horizontal scaling, and while they are the best fit for implementing machine learning, they do not yet fit smoothly on top of ‘conventional’ big data architectures.

Semantic Data

Semantic data has been around for a while, but only within very specialist areas focused on AI and related spheres. It has gotten more publicity in recent years through Tim Berners-Lee promoting the concept of the semantic web. It requires discipline managing information about data – or meta-data.

Initiatives like Linked Data, platforms like datahub.io, standards like RDF, coupled with increasing demand for Open Data are helping develop the technologies, tools and skillsets needed to make use of the power of semantic data.

Today, standard semantic ontologies – which aim to provide consistency of data definitions – by industry are thin on the ground, but they are growing. However, the most sophisticated ontologies are still private: for example, Wolfram Alpha has a very sophisticated machine learning engine (which forms part of Apple’s Siri capability), and they use an internally developed ontology to interpret meaning. Wolfram Alpha have said that as soon as reliable industry standards emerge, they would be happy to use those, but right now they may be leading the field in terms of general ontology development (with mobile voice tools like Apple Siri etc close behind).

Semantic data is interesting from an enterprise perspective, as it requires knowing about what data you have, and what it means. ‘Meaning’ is quite subtle, as the same data field may be interpreted in different ways at different times by different consumers. For example, the concept of a ‘trade’ is fundamental to investment banking, yet the semantic variations of the ‘trade’ concept in different contexts are quite significant.

As regulated organisations are increasingly under pressure to improve their data governance, firms have many different reasons to get on top of their data:

  • to stay in business they need to meet regulatory needs;
  • to protect against reputational risk due to lost or stolen data;
  • to provide advanced services to clients that anticipate their needs or respond more quickly to client requests
  • to anticipate and react to market changes and opportunities before the competition
  • to integrate systems and processes efficiently with service providers and partners both internally and externally
  • to increase process automation and minimise unnecessary human touch-points
  • etc

A co-ordinated, centrally led effort to gather and maintain knowledge about enterprise data is necessary, supported by federated, bottom-up efforts which tend to be project focused.

Using and applying all the gathered meta-data is a challenge and an opportunity, and will remain high on the enterprise agenda for years to come.

Business Intelligence

Business intelligence solutions can be seen as a form of ‘human learning’. They help people understand a situation from data, which can then aid decision making processes. Eventually, decisions feed into system requirements for teams to implement.

But, in general, business intelligence solutions are not appropriate as machine learning solutions. In most cases, the integrations are fairly unsophisticated (generally batch ETL), and computational ability is optimised for non-technical users to define and execute. The reports and views created in BI tools are not optimised to be included as part of a high performance application architecture, unlike big data tools.

As things stand today, the business intelligence and machine learning worlds are separate and should remain so, although eventually some convergence is inevitable. However, both benefit from the same data governance efforts.

Conclusions

Machine learning is a big topic, which ideally executes in the same context as the other strategic themes. But for 2015, this technology is still in the ‘exploratory’ stages, so localised experiments will be necessary before the technology and business problems they actually solve can be fully exploited.

Strategic Theme # 5/5: Machine Learning

Understanding Big Data

Big Data is a Big Topic. So I’m trying to get my head around a few basic concepts. 

My interest stems from the following areas:

  • Managing data complexity – most organisations have more data than they know what to do with. In particular, data semantics is a big problem, as is the ability to find and access the data that is needed.
  • Machine learning – the ability to infer meaning from data and use this in highly automated processes on a real-time or near-real-time basis – e.g., Amazon/Netflix recommendation engines. Any processes which need human ‘4-eyes’ is a good candidate for this. This article from IBM is a good synopsis of open source technologies that can aid machine learning. Note – this is distinct from business intelligence, which (today) assumes that the report produced by the system is the end product; i.e., business intelligence is not in and of itself intended to be part of an automated process. But the lines between business and intelligence and machine learning can be expected to blur..
  • Data Discovery – for many organisations, finding the data you need is a big challenge. Graph databases, triple-stores and open standards like RDF offer a way for these to be useful to, and accessible to, non-architects. In large corporate environments for example, these technologies can enable the creation of a useful who’s-who of experts in different technologies, recognising that the universe of technologies is constantly changing, and many technologies are closely related to each other or tend to be used together. Data discovery initiatives like Datahub and Linked Data are worth watching, as is the W3C efforts around the Semantic Web.
  • Modularity and Data Persistence – the relationship between data and services is historically a challenging one, with the natural tendency to have business logic as close to the data as possible (e.g, stored procedures etc). The sheer number of alternative data store/retrieval options means that it is even more important to separate the implementation of modules from their APIs: by all means (if you must) mix data and logic in the implementation, but do not expose the data any other way except via the module API or you will lose control of the data. This means more and more data should be exposed via services, and business logic should access the data via these services only. In principle, this allows business functionality to be exposed as modules, and data services to support multiple modules without compromising principles of modularity. It also allows a degree of flexibility over which of the many persistence solutions should be used for a given problem.
  • Containers – many database technologies today can be deployed into a self-contained environment, as they expose their interfaces through open APIs (such as RESTful, etc). So they can be isolated from the technology and architecture of the rest of your platform (in much the same way your Oracle database can be on Unix, and your clients running Windows, etc). Technologies like Docker and Mesos enable distributed databases to be built on commodity technology, enabling capacity and resilience to be added horizontally (by adding more commodity nodes) rather than vertically (more big iron). The relationship between these technologies and modular, service-oriented architectures is still rather immature..however, the trend is evident and has significant implications in architectural design.

I don’t pretend to fully understand the nature or implications of all of the above..the real world will decide what is useful or what is not. But there are a number of trends here that are key:

  • Increasing focus on data semantics and data discovery
  • Massive innovation in database technologies – no one-size fits all solution
  • Technologies to support infrastructure management are advancing in lock-step with the advances in database technologies
  • Technologies to be able to do something useful with all this data on a (near) real-time basis are also improving dramatically.

All of the above is mainly concerned with data-at-rest: it’s a whole different subject about how data gets from where it is now to where it is needed, without resorting to building point-to-point interfaces.

 

 

Understanding Big Data