Looking Beyond The Personal Voice Assistant

The future of voice?

There is much buzz and arguably a degree of hype over the disruptive potential of AI driven voice technology.  Thanks to significant improvements in speech recognition accuracy, natural language processing (NLP) and the growing popularity of smart speaker devices in the home, the whole voice economy is a hot topic with analysts and the media. Will 2019 possibly be a significant growth year for the technology where we’ll see wider adoption and breakthrough capabilities? Or not?

What is the reality with capability – can we differentiate between the novelty factors versus real benefit? Whilst smart speakers might be useful in the home environment for playing music, activating smart home devices, checking the weather and telling jokes, when do we see some practical use cases and benefits for business? And society?

Are technologists just looking for a new hyper growth sector as part of the AI wave or is the underlying technology really up to changing our lives by offering real productivity and potentially social and health lifestyle benefits?

Hello computer!

Cinema has long been enthralled with the idea of conversing with a super computer:

  • From the infamous HAL in 2001: A Space Odyssey with its eerie soft, calm voice and conversational style
  • To a somewhat bemused Scotty in Star Trek IV -The Voyage Home (1986) who was initially baffled by the lack of a voice interface to a Mac – he mistook the computer mouse for a microphone – “How quaint”.
  • To Iron Man where Tony Stark first developed JARVIS, his conversational AI system that ran his business, managed security and helped with avengers combat; and later FRIDAY when JARVIS was destroyed.

My very own FRIDAY

While none of us are (or likely to be!) an egomaniac Tony Stark trying to save the world with JARVIS (or latterly FRIDAY) at our side responding to our every command, I’d suggest the core concept of a voice responsive personal assistant being a reality would go some way to the voice economy coming of age. For me voice technology in much of its current form is no more than just a novelty. But give me a personal assistant who can be tasked to do all that tedious work admin as and when I ask; and that would be truly valuable:

  • Book, arrange and confirm my meetings
  • Check my diary and book my train travel and hotels
    • Work out where I need to go and when;
    • Know my preferences – and just gets on with it
  • Voice-driven learning
    • More than just a CBT learning package or a podcast
    • The learning assistant would regularly (say, every 30 minutes) test my learning by quizzing me on the key topics and remembering what I’ve answered correctly or not ensuring to retest me on weak areas. That would go a long way to make learning stick instead of forgetting.

Outside of work – Anything to do with call queueing or appointment bookings would be great. I’d quite like the idea of my FRIDAY being on hold (whilst my call is important) and connecting me when a human is ready to speak to me. Whilst my call might be important to whoever I have to contact, my time is much more important. Let my assistant have the stilted conversation with the IVR’s speech recognition platform. And that’s perhaps the promise that Google Duplex offers? Although at this stage Duplex is constrained to certain domains or certain appointment types. By constraining Duplex to closed domains, which are narrow enough to explore extensively, the platform can carry out more natural conversations after being deeply trained in those domains. But it cannot carry out general conversations. Not yet anyway.

The novelty factor and does it wear off?

Whilst you do see media analysts waxing lyrical about how Conversational AI will be as much a part of our routine as the browser; and that voice interfaces will be embedded in our lives; and the novelty factor will wear off; others are less convinced and see the technology as run of the mill and not beyond what you can do already by other means, so not particularly clever.

Many consumers don’t find their smart speakers particularly useful – an Ovum survey indicated that 31% of respondents give their assistants an average or poor rating for this reason. Lack of perceived utility or benefit is also the main reason that consumers do not have AI assistants in the first place – 47% of respondents said this was behind their reason for non-adoption in the Ovum survey and 38% in a Voicebot.ai survey also said they were just not interested.

If that’s the case then why has smart speaker adoption exceeded previous analyst expectations with just under 20% of US adults using smart speakers? Growth in the UK, whilst modest has also increased – to 10% ownership.

I think the main reason for good adoption is the fact that the device are not expensive, they are kind of beneficial for integrating with smart home devices and actually good novel fun for the price. But then again, has anything to do with smart homes yet been particularly revolutionary compared to say the benefits of home devices like the washing machine, dishwasher and vacuum cleaner when they were introduced?

Certainly the likes of Amazon, Apple and Google are keen to integrate their smart speakers into the home environment but they actually don’t make a great deal of profit from selling smart speakers. Underneath it all perhaps they just want to lock you into their ecosystem for wider cross–sell across the more lucrative areas of their portfolio whether that is Netflix, Prime, Apple TV or iPhones? Is this self-generated hype when the current reality is that what you’re really getting with many systems is just a clever voice-form IVR rather than the ability to converse with HAL?

High expectations when it comes to conversation

Forrester published a report in October last year stating voice interfaces based on speaking and listening are wildly immature and far from conversational. The problem is most voice interfaces are just a form in disguise which might be okay for simple tasks that the likes of Alexa can handle but tend to disappoint after a while unless you restrict what you ask for and how you ask it. But that beats the point. We as humans actually look to anthropomorphise machines and have higher expectations for voice as an interface. As computers, mobiles and tablets emerged and became mainstream, we were okay with learning how to use a keyboard, mouse, swiping and tapping. These were new behaviours necessary to use the devices so we were open to learning about these fresh techniques. However speaking or conversing is different; it is an engrained part of our being. Our expectations are therefore entrenched and very high for voice technology, often leading to disappointment if the platform does not work in the way we would intuitively expect it to.

Google Duplex – the potential tipping point?

However, deep neural networks has enabled Google to take voice into a more natural sounding conversational space. Although Duplex works on closed domains there are two key innovative components developed by Google Deepmind that make it sound so human-like:

  • Recurrent neural networks: Provide a structure that lets it exhibit “memory” of previous events and adjust its output based on the sequence of inputs rather than just a one-time input. This enables Duplex with the ability to keep track of the conversation both in terms of status (greeting, request or clarification) and to deal with any confusion or uncertainty.
  • Tacatron and Wavenet text to speech engines: Google generated human-like speech from text using convolutional neural networks trained using only speech examples and corresponding text transcripts and capturing not only pronunciation of words, but also various subtleties of human speech, including volume, speed and intonation. The capability also manages natural speech latency and “hmm”s and “ah”s. Wavenet models the raw audio waveform one sample at a time to get more natural sounding speech.

Google believes Duplex takes a significant step in allowing people to interact with technology as naturally as they interact with each other, even if it is currently only for specific appointment booking scenarios. They also recognise there are still some limitations to be researched – such as pronunciation of some difficult words, randomly generated strange noises, generating audio in real-time and expressing emotion.

Of course the promise of Duplex has also raised some ethical concerns – is it right to fool somebody into thinking they are talking to a real person and not a machine and it could be open to abuse? Google say they recognise this and the need for transparency.

The challenges with machine learning

To think machine learning will solve everything for conversational AI voice systems is a fallacy according to a recent report by Pullstring CTO and founder, Dr Martin Reddy. The report highlights scalability issues with machine learned solutions, e.g.  Amazon’s Alexa has a limit of 250 intents.

Natural language production tasks for conversational AI do not apply as well to machine learning as these algorithms require stateful memory, contextual awareness, integration with external knowledge bases, and custom logic programming. All of which are unsuited to machine learning solutions. These are big challenges that remain; and solving them is critical to delivering a believable and engaging conversational experience.

The skill of conversation … and linguistics

Developing workable conversational voice systems is not just down to good developer skills and machine learning expertise – there is a much wider skill-set required to deliver solutions to the level of quality that humans would expect. As Forrester outlines – teaching computers human language remains notoriously difficult. Can we really expect computers to achieve what the human brain has had millions of years to develop for the sophistication and complexity of language and speech?

Words and sentences can mean different things dependent on context, choice, order and punctuation. And different languages have their own unique structures, syntax, and cultural variations. Much of the focus on machine learning development is focused on working out user intent but that is still in its infancy and we still experience failures – There is a reason that #alexafail is a regular trending hashtag on Twitter.

We can see how voice interfaces and speaker systems work reasonably well for simple queries effectively steering the user through the “known knowns”. This is a key point when considering the use of conversational AI for customer service scenarios. Clearly the technology is not suitable for complaints or complexity. When we complain, we tend to get angry and emotional and we tend to tell long, rambling stories. Emotional context and long sentence are hard to parse. Sarcasm also tends to throw algorithms – the context is illogical when it comes to processing!

In addition to understanding intent and sentiment analysis there are many other non trivial problems that represent decades of AI research challenges – such as crafting a generated dialogue with tone, style and personality and to have the ability to handle the conversation unknowns or unexpected requests and responses. This covers several of the long-term AI challenges like abstract reasoning, logical deduction, self-learning, and goal solving.

It is a fallacy to think that it is the machines or deep learning algorithms that do all the work. When Alexa, Siri, Cortana, or Assistant respond to your queries, it is not magic. The companies building these services have many employees behind the scenes manually defining intent models, training them with relevant utterances, and then connecting them to hand-authored responses (or enabling 3rd party developers to do the same). A broad range of skills over and above the many research challenges is needed to develop conversational AI solutions to our entrenched levels of expectation. This includes conversational AI designers, computing conversation script writers and voice-bot brand designers (personality, character and voice). Also, the other critical wider skill set needed is linguistics – linguistics underpins any approach to understanding the use of language in conversational AI systems. But the Forrester report flags a critical skills shortage in this area.

Smart speakers and voice assistants can do smart and useful stuff

You might think I’m sceptical about the current status of conversational AI technology. I do recognise there are many beneficial skills and capabilities where the technology is starting to make a difference beyond the usual use cases such as:

  • Charity donations by voice – for easier and wider donating
  • Any activity where being  hands-free is critical – which is why there is notable uptake for the connected car
  • Healthcare in particular, where the technology can offer quality of life enhancement benefits for the elderly and those with disabilities:
  • Medical prescription adherence – reminder for taking medication at prescribed times and dosage
  • Check-ins and remote patient care monitoring
  • Medical appointment reminders
  • To a certain extent reduce the impact of social isolation through the access to content (but not that much different to other channels, except Alexa is like a virtual friend!)
  • Playback of news, weather and other content for the visually impaired

Often elderly people try to use Alexa but get frustrated and eventually give up because it keeps asking users, who may have slow or interrupted speech or are hard of hearing, to repeat themselves. The problem is the user is trying to adopt to the characteristics of the technology rather than the other way around. That is why I think we’ll start to see the evolution of products like miicare with its inbuilt digital assistant evolve to meet the increasing demand of an ageing society and the need to deal with self-care and isolation for this increasing population.

Other cool voice and conversational AI stuff

Voice biometrics is now significantly maturing as a key capability. This is likely to be augmented with facial and gesture recognition for multi-modal authentication as part of an integrated conversational UI platform. People will become more and more comfortable with biometric authentication and the idea of their voice being their password. This can be combined with selfie from their phone to further verify identity. This will provide a significantly better experience for users than current procedures where you typically verify who you are via an OVR when calling a contact centre only to have to repeat all over again when you eventually speak to an agent.

Authentication can be passive as well as active with continuous authentication throughout the conversation working in the background. AI algorithms can continue to analyse the voice for emotions, sentiment and voice characteristics against voice prints to check for any unusual characteristics as an additional fraud measure.

Established speech recognition vendors such as Nuance and NICE are focusing much of their development in these areas and we are seeing some disruptors in the market such as Utopia Thinking Systems who develop their solutions using deep learning AI.

This is where I think we’ll see noticeable transition of voice technology into the mainstream in 2019 and 2020.

In summary

There is still many challenges to be overcome before we’ll see the likes of HAL or JARVIS in the mainstream. Whilst new skills will continue to be developed for the likes of Alexa, until the experts can overcome the challenges with getting computers to speak like we do across a broad range of domains and overcome the constraints of machine learning in addition to its benefits then voice applications will continue to be simple. However even if they are simple there are still many beneficial use cases for both business and society that are and can continue to be developed. We are unlikely to see anything revolutionary beyond Google Duplex in 2019 or 2020 but we will see an evolution. We should see voice biometrics combined with other modes to be more common to address risks around fraud and identity threat.

In any case – the voice economy will evolve over the next few years and will be a complementary channel to the Web as we continue to use mobile devices more and more.

Stay up to date with D/SRUPTION’s latest insights, sign up to our free weekly newsletter here.