The Challenge Of Data Curation

Why the wide variety and diverse origins of business data call for specialist data curation

In business, data provides essential business intelligence (BI) and insightful analysis. We are now operating in an age when businesses are dealing with greater volumes of data than ever before.

In most organisations, data is typically collected from a diverse range of sources, and a carefully curated data set is a lot more valuable than the sum of its parts. Yet there are many challenges involved in curating data sets that inform BI and enable analytical activity.

A mixed bag

Typically, enterprises store information in a variety of different applications and business systems. As a consequence, organisations must adopt to a range of data curation processes to unify data from each of these sources. A database can be directly queried for information, whereas extracting data from a business application might require a deeper understanding of the application’s data export APIs or its database schema.

The data curation process often deals with heterogenous data formats and diverse data sets from different business units. One of the biggest challenges is to ensure data formats are normalised to fit into an organised structure before they can be used to run analytics.

Best practices for curating data sets for BI and analytics users

Organisations need to recognise the importance of the data curator, at both the organisational and departmental level. They not only check the validity and relevance of the data being collected for analysis in their respective departments, but also ensure the data fits a uniform structure. Individuals with experience handling data sets and who have a good understanding of how inter-departmental processes fit into the overall business are therefore ideal for the role.

Data collection points that are relevant to the business should also be identified by data curators. Since BI and analytics users largely depend on structured data, good practice requires structuring data at the source. This means setting up data cleansing processes at the points where data is generated, resulting in structured data – ready for consumption by analytics platforms or for storage in data warehouses.

Meaningful analysis

Enterprises are generating data at an unprecedented rate and from new sources, such as IoT systems, that didn’t exist a few years ago. To run meaningful analysis that benefits the business, curated data sets should be minimal in complexity and should make it easy for analytics users to discover insights that can help steer the business in the right direction.

However, we have to accept that not all information is relevant for analysis. Organisations should be quick to separate useful data from information that is not relevant to the objective they are trying to achieve.

In order to clearly distinguish relevant data from information that doesn’t add any business value, enterprises must establish the clear goals they expect to achieve by running analytics. Neglecting this aspect could make it difficult to find useful answers buried under heaps of irrelevant data.

Data curation is an ongoing process of organising and maintaining quality data for repeated use over time. Enterprises should never fail to include processes that periodically ensure the quality of the data that is being managed.

Identifying the right tools for curation

As business systems evolve, the range of platforms that hold information relevant for decision making will increase. Data curation tools should have out-of-the-box connectors or the ability to gather data from different sources, without a lot of tweaking under the hood.

The right solution should provide a good balance between user driven and machine learning capabilities. Machine learning can automate a lot of repetitive tasks that otherwise must be performed manually, but when it comes to deciding the relevance of data, humans still do a better job. Tools that can learn from human input can evolve to judge the relevance of data sets over time, thus increasing the level of automation.

The right solution should include extract, transform and load (ETL) capabilities to read the data from each source, convert the extracted data from its original form into the required form, and then write the data into the target database.

An increasingly important process

As data becomes increasingly crucial to the decisions businesses make, it is important that organisations invest in the tools to overcome data curation challenges.

The case for investment is clear, since these tools quickly pay for themselves. By enabling greater analysis and significant, previously unseen insights, they support faster decision making and deliver the best possible business results.

For the lastest insights from DISRUPTIONHUB, sign up to our weekly newsletter.