Sometimes it is the right answer, but a significant portion of the time it’s not.
Myth 2: A data lake will solve all my digital problems
So why is that? Well here’s our experiences.
A data lake is another system that you have to build manage. Configuring the datalake at the start of a project adds a lot of complexity. (And the datalake always needs some new feature for every project!). Then once in production someone has to make sure that the datalake always has the latest data and is a 100% match against the source systems. This is a very big job!
The data lake has to have some sort of structure (schema), whether its SQL or NoSQL you have to decide on how to structure the data in the data lake. It’s virtually impossible to design a datalake without reference to the future use cases, and usually the first use case will drive most of the decisions. This means your datalake is heavily biased towards the first use cases and future use cases will have to make changes to make it work for them.
It adds another layer of complexity in your application design and support. When you design a new function you really need to understand the raw data, how it is generated and processed in the source system. If you can’t every access that you will be designing with one arm tied behind your back. If you have to go through a 3rd party or central team to answer every question it will significantly slow the process down and add cost.
You have a new single point of failure. Ok it will be in the cloud and hyper resilient but if the data gets corrupted or there is a problem with access, everything is affected.
When an new application goes into support, how do you troubleshoot a problem with the data? Who is responsible? The users will always blame the UI as that is where they interact, but someone then has to troubleshoot the data through the datalake and out to the source system.
The source systems usually have a lot of functionality in addition to the data that app designers can leverage. For example Realtime Historians can provide raw or interpolated values and are highly optimised for retrieval of timeseries datasets. The datalake needs to replicate all these features otherwise the app will have to recreate them, which makes the app more expensive to build and usually results in slower performance. For example, if want to plot 1000 datapoints over 1 year, a Realtime historian will just give me 1000 points which is a few Kilobytes. A SQL db will give me all the data, which could be 100s of Megabytes.