How Clean Is Your Data? 

Without clean and trusted data, analytics is impossible

Data oils the wheels of business. Without relevant and accurate data, it is virtually impossible to function as an organisation. But therein lies the problem – collecting data is only half of the battle. If information is outdated, duplicated, or simply wrong, then it becomes more of a hindrance than a help. The answer? Good data hygiene.

Suman Nambiar, Head of Strategy at Mindtree‘s Partners and Offerings branch, discusses data hygiene and the techniques and processes used to maintain clean and trustworthy data, and with so much data floating around in the digital ether, how can organisations make sure that their datasets aren’t starting to smell?

Dirty data, dirty decisions

All businesses are obligated to make sure that the data they collect is safe and secure, but also that it is clean. Worrying, in 2017, Harvard Business Review research claimed that just three per cent of the companies surveyed met basic data quality standards. Thanks to recent regulations, knowing precisely where data has come from and where it is stored is all the more important – especially when it comes to personally identifiable information (PII).

“Most businesses are striving to become insight driven organisations,” says Nambiar, who has over two decades of experience in business development. “To do this, they need to apply different analytical techniques which include information such as descriptive insights and automated insights. At the core of using this data successfully, there must be trustworthy or clean data, which is the essence of data hygiene.”

Without good data hygiene, organisations take a considerable number of risks. At the lowest level, they are likely to miss out on impactful insights. Another consequence of allowing a build up of dirty data is non-compliance. Given the recent regulatory attention to data protection – namely GDPR – it makes good business sense to prepare for incoming legislation. Failing to do so means paying a hefty fine, and facing stakeholder disillusionment. Ultimately, basing any decision on inaccurate or confused datasets will inadvertently damage the business.

Cleaning up your data act

Organisations can take a number of steps to ensure that they have good data hygiene. The first, says Nambiar, is to make sure that the data they are working with is accessible and transparent.

“Usually, good data hygiene has to be supported by a data infrastructure – an example of such an infrastructure would be a data lake. Organisations need to back this with data standardised rules which help set the ground rules. This in turn should be backed by a data auditing exercise which helps organisations understand where the data comes from as well as if there are any deviations in the data.”

Even so, it’s easy for data lakes to turn into data swamps without proper classification. One of the most effective technological remedies for disparate data is AI.

“We are seeing organisations applying automated techniques backed by artificial intelligence and machine learning for data discovery, matching and cleansing data,” says Nambiar. “While we can have definitive guidelines, without an automated framework it is going to be costly and more importantly time consuming, and could also risk human errors.”

Not all organisations have the luxury of in house machine learning capabilities – but through open source software like Apache Griffin, they can access a more economic solution.

“Apache Griffin offers a unified process to measure data quality from different perspectives, which helps build trusted data assets. It offers a set of well defined data quality domain models, which can cover most data quality problems and help users define their quality criteria,” says Nambiar. “By extending the digital subscriber line, users are even able to implement their own specific features and functions.”

The need to have good data hygiene has presented an opportunity for businesses like Informatica and Talend, which have released specific products for corporate data cleansing. This involves removing data that is incorrect, incomplete, duplicated, or that has been formatted in the wrong way.

A data detox

The availability of ready-to-use software has made life much easier for SMEs by taking away the expense of sourcing talent and embarking on an internal data cleanse. That said, even once the appropriate infrastructure and technologies have been put in place, good data hygiene is heavily reliant on the behaviour of all employees. There is no silver bullet: keeping data clean is a constant and crucial chore.

“One has to understand that this is a continuous challenge,” explains Nambiar. “When data is generated and modified on a continual basis, the quality measurement needs to be too.”

Don’t drown in data – sign up for our free newsletter here.

Most Popular