Just so you know, I am of course going to bust this myth but there are ways you can speed up the process that I’ll mention at the end.
I’ve been thinking about coining a new term “Data Debt“, but it turns out it’s already been coined!
Oh well, maybe a bit late to the party on that one but anyway, I can pretty much guarantee that an industrial asset that’s been operating for years will have build up a “Data Debt”. Loads and loads of information spread across different systems, in inconsistent formats, custom databases, archives, missing fields, strange naming conventions..
And when it comes to digitising work processes and automating things to lower costs, this Data Debt has to be paid off. It’s like a really annoying obstacle holding you back from all the cool new possibilities. No-one wants to spend any time or money sorting it out if at all possible because it’s just not sexy. Like the foundations of a house no-one will ever see all that hard work
There have probably also been multiple previous attempts to solve these data problems before but all have fallen foul of some technical, security or behavioural issue. It’s so tempting to hope that maybe someone has an AI tool that can be run over your data and organise it for you at minimal cost.
Myth 3: AI can untangle my data
Around 15 years ago Autonomy claimed this but it turned out to be a myth. In reality this is a really really hard problem for AI to solve. Here’s why, and then I’ll talk about some of the ways you can at least speed it up :
AI has to be trained to do a job: the first thing you need is lots of examples of all the data problems you are trying to sort out so that you can train an AI tool to find and fix them.
Data is in lots of different formats, and very often spreadsheets: Understanding what data is in well structured relational database is one thing, extracting data from spreadsheets is a whole different challenge!
The data isn’t always accessible: even worse than being indifferent formats is being inaccessible. Whether its in pdf scans of documents or sitting on users hard drives, you don’t know what you don’t know. Yes, ok there are tools like Googles Cloudvision API that can convert images to text but processing this text is still a challenge.
The relationship between the data may not be written down anywhere: getting data out of a system is one thing, but know what it means, its context, is just as important. Information without context is meaningless. Field names in databases are rarely descriptive, and often the relationships between the data in different systems are just understood by people, they aren’t written down anywhere.
Quantative vs Qualitative: Let’s say it was possible to have an AI tool automatically categorise all your data. How will you know if it found everything and if it did it correctly? If you now build systems on top of this that make implicit assumptions about the veracity of the data then how can you be sure the results are trustworthy? Has it really returned all the instances of ‘x’? And what do you do if you find it hasn’t?.
So what can I do to speed up the process?
The good news is that there are now a lot of tools that can help you untangle your data and pay off the ‘Data debt’ more quickly. For example Data Wrangler from Trifacta through to python libraries such as Dora and ‘ftfy’ https://mode.com/blog/python-data-cleaning-libraries/. Finding “scripting language” literate people with an engineering, data science or programming background will greatly speed up the process of information ingestion and processing workflows.
And if you have a big enough project to justify it, you could build a dedicated AI tool such as Equinor have done to extract data from incident reports (SPE-195750-MS Equinor ML for Operational Risk https://www.onepetro.org/conference-paper/SPE-195750-MS) but that was built for a specific purpose and it certainly wasn’t cheap or fast.
So maybe I should qualify this myth and say that AI can be used to sort out your data but it’s probably the most expensive and costly way of doing it!