Why your data science projects aren’t working – and what you can do about it
AI and data science are often talked of as the jewel in the crown of our digital future. But present day AI projects seem blighted by failure, from Amazon’s ‘sexist’ recruitment tool to IBM Watson’s shortcomings in healthcare. Plenty of good AI toolkits are freely available, including those from old timers (IBM, Microsoft) and newer kids (Amazon, Google). But too often the applications built with them end up either misguided or downright dangerous. Why is this?
There’s an obvious analogy to the early days of software development. Many early products were prone to bugs or technical debt once deployed. Eventually we learnt that success needed proper processes. Software is now a highly professional global industry, with standards, development methodologies, project management frameworks and industry accreditations. AI and data science need to go through a similar professionalisation process to tackle their own high failure rates, and build user trust.
Why do we need a professional data science framework?
There are various red flags which should alert companies to the need for professional processes, like having multiple underperforming data science projects, running over budget, falling behind schedule, or not delivering value. Or seeing budget flowing into a never-ending stream of proof-of-concepts, few of which become deployed into the enterprise. The point of a framework is to keep such projects on track, catch failures early, and ensure those with potential do not fall at predictable hurdles.
What challenges must a professional data science framework address?
Data science is all about difficult choices. At the core of any professional framework lies guidance to making the right decision at each point, from proofs-of-concept to deployed enterprise solutions. A good framework will take a fail early approach, with multiple stage-checks which spot problems before they become serious and ensure appropriate levels of rigour when using data. The framework must have in-built agility to allow rapid experimentation, and be capable of effectively integrating into the complex landscapes of enterprise IT organisations.
It must also be technology multilateral, i.e., applicable to any tool or toolkit. Data science moves too quickly to be limited to specific tools and techniques, but a good framework also guards against ‘shiny and new’ for the sake of it. Finally, any framework in complex areas such as AI and machine learning needs a considered focus on guidance for independent thinking, rather than direct prescription. To explore how such a framework can deliver, let’s look at a recent project we worked on, which used AI to predict component failures in engines.
Data science frameworks in action
In high performance engines, e.g., on planes, there is a need to keep close tabs on the physical properties of engine materials. Wear and tear could cause disastrous engine failures if not spotted early, but regularly taking them out of service for inspection creates costly downtime, and often turns out to be unnecessary.
To reduce manual inspections, a manufacturer of specialist engine parts wanted to construct a model to spot finely nuanced changes through sensor data, collected from the component while in operation, which can be used as a basis for predicting when a failure can occur. The project used Tessella’s own data science framework, RAPIDE, which has been developed over many years.
RAPIDE has six phases designed to ensure that projects: are business Ready; only use data that passes Advance screening; Pinpoint the real factors driving outcomes; Identify and evaluate multiple models, methods and toolsets; Develop models where trust is the equal of raw predictive power; and Evolve capability upon contact with the real world.
The first steps included a ‘readiness assessment’: evaluating the existing understanding of system behaviour and conducting a detailed review of data quality and completeness. This highlighted that their training data was too sparse and variable in quality for the chosen technique – a very common situation.
This was followed by ‘advanced data screening’, which revealed that it was only possible to obtain the required insights through a combination of component sensor data fused with other engine measurements. This allowed the project to make early changes to ensure it had the right data for the desired outcome before it went any further.
The new denser datasets provided enough information to pinpoint what specific factors were driving increased risk of failure and decouple them from other correlations which appeared related at first, but proved to be incidental.
The team could then investigate candidate algorithms, assess them, and down-select the most effective for that specific problem. Only the most successful model was progressed to training and validation; this was performed using the upgraded training dataset. The resulting model could spot a heightened risk of failure within hours of the first pre-failure event, instead of the weeks or months common to traditional manual inspection regimes.
Frameworks reduce failure
This illustrates the power of frameworks to deliver a genuine and trusted AI model. It ensures the problem is properly understood, the right data is collected, the causal link between drivers and outcomes is identified, the most suitable technology is used, and the model is trained and deployed correctly.
Companies now have ready access to powerful AI and data science tools. But creating AI without the guidance of a professional framework means a high risk of project failure, creating baseless ‘insights’ that are misleading and probably damaging to the business.
Developing a professional data science framework is hard. Software engineering frameworks offer a good starting point but need considerable enhancement. As AI goes mainstream, now is a critical time for the industry to create and adopt rigorous professional frameworks for data science and AI, crystallising best practice and ensuring we can trust AI to deliver as expected. Otherwise, history will repeat itself, and we’ll make the same mistakes we did at the start of previous technology revolutions.
This article was co-written by:
David Dungate, Lead Data Science Consultant, Tessella
Matt Jones, Lead Data Science & Analytics Strategist, Tessella
Tessella’s whitepaper RAPIDE, A professional governance framework for data science, provides more further information on the importance of data science frameworks and on Tessella’s own framework.
For more on AI advancements and data science, sign up for our free weekly newsletter here.