Imagine if speech recognition and image recognition could work simultaneously…
It’s not hard to be impressed by the incremental improvement of machine learning and deep neural networks. However, the devil’s in the data – in order to carry out a task, networks are trained on extensive, specific datasets. When it comes to speech recognition, systems rely on countless manual transcriptions. The same is true for images. If you want a system to recognise a photo of a cat, you had better show it hundreds of pictures of other cats first. Fortunately, a new machine learning model that recognises both images and speech could help this time consuming and labour intensive process.
‘Natural’ machine learning
A team of MIT computer scientists revealed the development of a dual functioning convolutional neural network (CNN). The model was trained on 400,000 image caption pairs, building on earlier research that matched words with pixel patches. Currently, the system can recognise several hundred different word and object pairs. In the grand scale of things, that isn’t a lot – but it does herald a fundamental shift in the development of machine learning. Until now, deep neural networks have been unable to carry out speech and object recognition at the same time. But following the efforts of the MIT researchers, these two functions could now work together to save manual effort and make machine learning more natural. This would undoubtedly lead to more accurate natural language processing and image comprehension which will have far reaching effects across industries.
Type no more
MIT’s model has positive implications for the development of deep learning, saving manual effort while improving both speech and object recognition. Translation is the obvious benefactor, as the system could remove the need for a bilingual annotator by learning how different languages refer to certain objects. Over time, the model itself would become bilingual. From a business perspective, this could make international B2B and B2C interactions much easier by clarifying conversations. There are thousands of global languages that cannot be translated using the same platforms as others simply because there is a lack of bilingual annotations, but MIT’s project could eventually provide the solution.
Another area that could benefit from advanced speech and image recognition is healthcare. Radiologists could search for specific details in X-rays, ultrasound, MRI and other scans using verbal cues, reducing the time spent manually viewing images. Consumer facing industries like retail could even use the technology to offer ‘shop by speech’ to check stock and item availability. A more immediate implication is the gradual abandonment of written transcriptions in favour of voice commands, thus edging closer to the point when machine-to-human interactions are predominantly based on speech.
Machine learning is advancing steadily, offering tangible proof that neural networks are becoming ever more capable. As machine learning techniques improve, the models will require less input data. Not only will this make it easier for computer scientists to train them, but it will encourage democratisation. That said, merging speech and object recognition is by no means easy and remains in the early stages. The disruption it could bring to translation and many more industries is clear. However, on a societal level, bypassing text commands and refining voice cues marks another important milestone in the journey towards verbal machine and human interactions.
For the latest insights on artificially intelligent tech, sign up for our newsletter here.