This is the elephant in the Big Data room….
How much “Bad Data” is out there?
Wait a minute, I hear you say, do you mean to say that not all of this data is accurate? What is the point of salivating over terabytes of Big Data if we don’t understand which bits are inaccurate, duplicated, incorrect or incomplete?
That is the right question to be asking.
The consequences of cyber attacks are well documented, but actually, in a world that is increasingly reliant on our data, the costs of Bad Data could be even higher. The U.S. economy is estimated to lose $3 trillion per year due to Bad Data. I don’t need to mention that this is a lot of money….
As Big Data grows ever more ambitious in its scale and ambition, might we be forgetting that it is worthless unless it has a reasonable degree of accuracy? You may be able to draw conclusions, but are they the right conclusions? In some many crucial industries, from healthcare to banking, big data mistakes can cause havoc. They could even cause people to lose lives.
Research from Experian Data Quality shows that inaccurate data has a direct impact on the bottom line of 88% of companies, with the average company losing 12% of its revenue. Corporate data is growing by 40% per year and with improvements in technology, that rate is increasing. There is a race to the top in terms of data volumes and complexity, but the quality of the data often is the poor relation standing to one side, pleading, “what about me, guys?”
Billions have been thrown at the National Health Service in the UK in an attempt to get them to modernise their IT systems. What is the main reason behind the failure (apart from gross incompetence)? They are not able to cope with the variety and validity of the data scattered all over their organization.
If your data is dirty, you need to give it a clean. This, in my view, is the biggest challenge that Big Data faces. It doesn’t matter how many super intelligent Data Scientists you hire – they have to be confident that they are making sound judgements based on sound numbers.
Practising good data management should be a fundamental of the Big Data movement. Executives should question the validity of the data before they are impressed by the size. Systems should be designed with this in mind, and data collection should be simplified to minimize the risks of deficient data.
The industry is moving a little too fast in my opinion. If anything is going to make it crash spectacularly, it will be Bad Data. The potential of the industry to do good for society and humanity is too big to sweep these issues under the carpet. Let’s think about them and do something about them before it is too late.
You can’t trust the insights when you can’t trust the inputs.