At A Glance – Dirty Data

Data makes the world go round, but with more data comes more mistakes

In the era of Big Data, incomplete, inaccurate, or inconsistent information is difficult to avoid, and can confuse insights. This misleading data has come to be known as ‘dirty data’, and is the result of defective management and storage. The Data Warehouse Institute estimates that dirty data costs US businesses more than $600bn a year.

Dirty data is one of many terms associated with the data revolution. It differs from dark data in that dark data is overlooked or underused – rather than being passive, dirty data can actively impact insights in a negative way. It doesn’t take a data scientist to work out why this is problematic. Insights based on misinformation can be especially damaging for businesses who rely on them to make strategic decisions. If a decision making process if flawed, then the decision will be too.

Despite this, there are steps that can be taken to reduce the amount of dirty data. Firstly, businesses can train their algorithms to recognise anomalies. While algorithms are invaluable tools, it’s also possible to apply common sense and human insights to certain anomalous entries. Sometimes, data that appears to be dirty is actually accurate, and it takes human investigation to work this out. In some cases, incorrect data can even be used to the benefit of the company. Google, for example, has kept data about misspellings to improve its spellcheck functions. Perhaps the more useful thing that can help to prevent the build up of dirty data is to pursue a policy of careful data entry. If time and attention is given to data input, then it’s less likely that mistakes will be made. Obviously there will always be a level of human error, but this can be reduced by setting out clear guidelines. This will become vitally important as Big Data continues to run the world.