Data Myths: 2 – A data lake will solve all my digital problems

Getting new digital functionality working is complicated, especially when it involves integrating to existing systems from different vendors. The idea of copying all this data into a single ‘data lake’ is very attractive; everything will be in one place, your architecture drawings looks really clean, you don’t need to worry about how to connect to legacy systems or overloading them, the performance should be blisteringly fast. A modern one-stop-shop for all future development. I know this because this was my dream when we started Eigen 12 years ago! We were trying to build the one unifying data model that would enable all our functionality dreams. But here’s what we’ve found over the last 12 years.

Sometimes it is the right answer, but a significant portion of the time it’s not. 

Myth 2: A data lake will solve all my digital problems 

So why is that?  Well here’s our experiences. 

A data lake is another system that you have to build manage.  Configuring the datalake at the start of a project adds a lot of complexity. (And the datalake always needs some new feature for every project!).   Then once in production someone has to make sure that the datalake always has the latest data and is a 100% match against the source systems.  This is a very big job! 

The data lake has to have some sort of structure (schema), whether its SQL or NoSQL you have to decide on how to structure the data in the data lake.  It’s virtually impossible to design a datalake without reference to the future use cases, and usually the first use case will drive most of the decisions.  This means your datalake is heavily biased towards the first use cases and future use cases will have to make changes to make it work for them. 

It adds another layer of complexity in your application design and support.  When you design a new function you really need to understand the raw data, how it is generated and processed in the source system.  If you can’t every access that you will be designing with one arm tied behind your back.  If you have to go through a 3rd party or central team to answer every question it will significantly slow the process down and add cost. 

You have a new single point of failure.  Ok it will be in the cloud and hyper resilient but if the data gets corrupted or there is a problem with access, everything is affected. 

When an new application goes into support, how do you troubleshoot a problem with the data?  Who is responsible?  The users will always blame the UI as that is where they interact, but someone then has to troubleshoot the data through the datalake and out to the source system. 

The source systems usually have a lot of functionality in addition to the data that app designers can levarage.  For example Realtime Historians can provide raw or interpolated values and are highly optimised for retrieval of timeseries datasets.  The datalake needs to replicate all these features otherwise the app will have to recreate them, which makes the app more expensive to build and usually results in slower performance.  For example, if want to plot 1000 datapoints over 1 year, a Realtime historian will just give me 1000 points which is a few Kilobytes.  A SQL db will give me all the data, which could be 100s of Megabytes. 

written by

Murray Callander

posted on

May 11, 2020

you may also like...