As digital enablement becomes a pressing concern, digital twin technology is becoming increasing prevalent, and twin graphs and knowledge graphs are increasingly being used, which of these options is the best approach for your digital strategy’s data layer when you’re digitising operations? Essentially, both options come with their own pros and cons, so how can you choose which one is right for you? Here, we take a closer look so you can make a well-informed decision.
Cloud Based Data Lakes And Directly Connecting To A Source System – The Potential Issues
When it comes to data visualisation, on paper, the architecture of cloud based data lakes look simple and this makes the concept an appealing one to those looking for an effective business intelligence tool. After all, all of the data enters the lake then people only need to connect to a single location to view everything – surely, then, this must make a data lake a convenient data visualisation tool to harness? However, in practice, things aren’t that simple.
Just copying all of the data into this kind of system will take a long time and typically speed is of the essence when it comes to delivering something useful for users. Therefore, in reality, what happens is a cyclical process in which the source system and use case are prototyped against each other to define which data set requires copying up to the data lake. This data is then copied up before development begins against the data lake. Inevitably, what happens next is that an iterative process begins of discovering elements have been incorrectly transferred or are missing altogether.
With this in mind, it means that in practice every use case must be built with direct reference to the source system in any case. Developing a new functionality solely against a data lake will always present difficulties due to nuances in the data which has to be understood. In reality, data lakes are inevitably built on a use-case by use-case basis, with every new use case expanding the data lake’s scope.
Although directly connecting to a source system offers some convenience benefits, it presents some issues too. Often there are security concerns when it comes to giving direct access to the database, especially for systems that sit on sensitive networks. If networks have been only designed for a handful of users or run on old hardware it’s unlikely that they will be able to cope with multiple simultaneous connections or large ad-hoc requests. Furthermore, the data structures may be obscure, with information that can be seen by the user being generated in the application UI itself rather than stored in the database.
What Are The Advantages Of Copying Data Sets Into A Data Warehouse Or Cloud Based Data Lake?
If you’re considering data integration into a cloud based data lake or data warehouse, you need to be aware of the advantages of this option:
- Performance – this is usually the top selling point of data warehouses or cloud based data lakes. Large volumes of data can be provided to multiple users rapidly.
- Ability to query easily across several data sources – since all data is in one location queries will all be written using the same language.
- Easier control of user access – there is only a single access point and granular control is usually possible over the data which is available to each user who tries to engage with the system.
What Are The Disadvantages Of Copying Data Sets Into A Data Warehouse Or Cloud Based Data Lake?
Integrating data into a cloud based data lake or data warehouse also presents some issues including:
- Quality Assurance efforts – it requires a lot of ongoing to work to ensure the data lake remains consistent. With gigabytes of data being generated on a daily basis it represents a huge additional workload to keep the data lake’s data consistent. This raises the issue of whether or not the data stored in the cloud is up to date.
- Entropy of information increases with copying – whenever information is copied from the source system the entropy level will always increase.
- In practice the data lake can never have all of the data – there will inevitably be a data set which is required but is missing from the data lake, for example in the case of detailed data being kept by vendors with only aggregate or summary data being available externally.
- Slow to develop against – to define the dataset required in the data lake an extra stage is required in the development process.
What Are The Advantages Of Connecting Directly To The Source Systems?
If you would rather avoid fusing data in a data lake, you need to know whether connecting directly to source systems is a better option. This option presents the following benefits:
- Integrity of data – the data visualised is the data in the source systems.
- Easy troubleshooting – when the data in the visualisation layer is wrong there’s only a single place that requires checking.
- Simpler data flows.
- Trying new things is easier as there is no extra layer required. This means fewer people are involved and less co-ordination is required allowing development to proceed rapidly.
What Are The Disadvantages Of Connecting Directly To The Source Systems?
- Multiple simultaneous clients may not be supported – depending on the source system’s performance it may be incapable of handling multiple clients simultaneously especially 10 or 20-year old legacy systems.
- Tracking can be problematic – if many user level systems connect across multiple source systems tracking access can be difficult.
Case Study – Corrective Maintenance
If work on a facility is both planned and tracked in a maintenance management system then every piece of work or job will have its own associated workorder. Within this workorder, there’ll be schedules, budgets, lists of dependencies and tasks, among other things.
The workorder will also have its own lifecycle in which it progresses through various stages from initial concept to scheduled and completed or cancelled. Interaction with the source system will only ever reveal the status of the workorder at the present time, so only a single version of the object will be available. However, if this data is then copied to a data lake, there are two options:
- Creating a new object each time the workorder’s status changes to allow the status to be seen at any point in time;
- Updating the status against the existing object. This will require a mechanism to detect changes at all levels in the source system.
Which option do you choose?
The answer will probably depending on the use case itself. If analysing the time that each workorder spends in its various stages is important then it’s necessary to have a time stamped history of each object change. On the other hand, if it’s only important for users to connect to the data lake rather than the source system, option 2 will be the best choice.
Issues can arise if you went for option 1, but users would have preferred option 2. This is because the user experience will become considerably more complex since filtering out all of the object’s superseded versions will be necessary.
So, Which Option Is Best?
So, when it comes to deciding whether to copy or not, the answer really depends on your own use case.
If you place more importance on performance than accuracy of any single thing, a data lake represents the best option. Training algorithms and research into machine learning are ideal examples of this since huge volumes of data are required and users don’t want to have to access the source systems frequently. Also, since training datasets are historic and won’t change, keeping them up to date is no problem.
Conversely, if you place more importance on accuracy than performance, for example in the case of interactive dashboards revealing job status information in real time, it’s better to connect directly to the sources. It’s faster to develop and simpler to maintain and support in the long-term.
Thought Leadership Presents A Possible Third Way
To make things even more complex, there is another, third digital work option – an integration layer or API that effectively connects directly to the source system while also adding an additional security layer (and possibly caching too). This can solve the issues surrounding data quality and security while also helping to improve performance.
As all APIs must be configured on a case-by-case basis and therefore there isn’t a single place where all the available data can be seen this may seem to be an issue, but it’s important to remember that in practice, data lakes will also require an initial definition with reference to the source system. On the downside, though, these layers will add another potential single failure point in the application architecture which must be considered when developing support and troubleshooting procedures so that sufficient resources can be made available by the layer’s business owner to support clients.
Nevertheless, there is one fact which holds true whichever option you choose – to validate any use case and ensure that users are able to trust the results, validating it against raw source data will always be necessary and, therefore, it’s very likely that some way of accessing raw data in the source system will continue to be required so that troubleshooting any issues or queries in the future will be possible.