At A Glance – Labelled Data

Not all data is created equal

Machine learning models are invaluable tools for understanding data. Deep learning, for instance, is a subset of machine learning that builds Artificial Neural Networks (ANNs) that can make decisions on their own. However, in order to teach these models to handle information, they must be trained using labelled data.

Labelled data is data that has been assigned a label to indicate its topic or content. Essentially all companies have unstructured datasets that can be understood using AI algorithms, but first those algorithms need to be exposed to labelled data. Machine learning models can’t learn to complete new tasks without it. For this reason, as well as the fact that it has already been organised, it is far more useful than raw unlabelled alternatives. This also makes it harder to obtain, and more expensive.

Usually, human teams are responsible for taking unlabelled data and tagging it with the appropriate categories. This is a time consuming and specialist task, which requires the curation of enough examples to reliably teach algorithms. Businesses can either employee people to do this internally, or outsource the work to an automated data classification vendor. Outsourcing is often off limits to organisations that handle sensitive data – think finance, or healthcare. Given the development of a data economy, more companies will be eager to hire or outsource data labelling. As such, providing labelled data has presented a real opportunity for startups and data engineers.

At the same time, data classification software is likely to gradually automate the labelling process. Google’s AI, for example, created a daughter AI called NASNet in 2017 which was able to recognise objects in real time videos with 82.7 per cent accuracy. Eventually, perhaps this could be applied to all raw data.

To learn about machine learning and other topics, join D/SRUPTION here.