It’s been hard to miss some of the big contenders utilising voice technology for home automation and personal assistants. Siri now comes baked into all Apple products, while Amazon have Alexa, Microsoft has Cortana and Google utilise voice in Google assistant and search.
In fact, this article was written using the voice typing feature built into Google Docs! There are also a number of DIY methods for handling voice control; Google, Amazon, IBM and Microsoft all have SAAS solutions to handle each stage in the process independently.
Speech recognition software has come a long way over the last few years; until 4 years ago best we could do is around 70% hit rate but now some companies claim to be closing in on the elusive 95% accuracy 1 (the point at which it passes human ability). This is a huge milestone for speech recognition. If the software has as good an understanding of your speech as a human, it’s possible to avoid a lot of the all too familiar comedy of errors situations which are frustrating for early adopters of the technology.2
This is still only the first part of the story. It is all well and good taking in speech and converting it (however accurately) to text, but the next challenge is understanding that text so you can make informed decisions on what an appropriate response would be.
Fortunately this is something that’s been in the works for a long time; NLP (natural language processing) has been slowly improving for many years and is used heavily across the industry, most notably in chat interfaces, like facebook messenger bots. Another great example being shown with IBM’s Jeopardy crushing Watson that uses NLP to identify meaning in complex questions, see Watson in action.3
Most of the offerings in this space use a similar methodology to interpret commands, identify the intent (the thing the user wants to do) and trigger an action and response. The Cortana documentation describes this as “Cortana present skills in response to user requests by employing an intent model. And intent model applied machine learning to detect utterances from a set of training data used to derive the users and intent from the request (along with entities, or specific data from the intent). Based in the request, this info is then passed to the skill, which enables Cortana to provide a suitable customised user experience.”4
Now if that sounds like an awful lot of work, don’t worry, all the major “conversation platforms” do the heavy lifting for you leaving you to focus on developing engaging user experiences. Alexa, Cortana and Google Home all provide simple user interfaces to map requests to actionable intents, Siri however provides a set list of domain specific intents that your app can subscribe to. On top of that they all provide developer SDKs or toolkits to kickstart your project. Due to the pluggable nature of Alexa, Google Assistant and Cortana there are already lots of skills available to augment and integrate voice control with any number of third party services and other apps you have on your device. IFTTT (if this then that) have also recently launched channels supporting both Alexa and Google Assistant opening up the world of custom automation to everyone.
Now that we’ve reached tipping point it’s probably time to ask what the point of all this effort was. Humans love to talk (I am certainly no exception), the average human spends the first three months of their lives communicating entirely through cries and screams at which point they come to realise it may not be the most effective way of getting their point across. By 12 to 18 months most children have managed to form simple words and it’s been suggested that 15 month old children start to add intonation when asking a question, showing quite an advanced understanding of speech patterns. If you think about how we currently communicate with machines and mobiles (typing, scrolling and clicking/gestures), skills that we developed much later in life, it is relatively obvious the speech would be a far more natural medium for human computer interaction.
Arthur C Clarke said “any significantly advanced technology is indistinguishable from magic.” Those words have never been more fitting. As a society, we have longed for fantasy worlds where spells cause magical effects in our surroundings. The rapid advances in voice technology bring that closer to reality than ever.
Now that the technology has caught up with expectations there is a shift from early adopters with niche use cases to a much more open and competitive market. There are some obvious situations where voice control may not be appropriate; there are still issues differentiating different voices, stripping background noise and privacy. For these reasons the main contexts voice makes sense are – in the home, while driving, or in private. Also, as I have experienced while writing this article, voice works great for distinct, relatively short sentences, much like when speaking with another human. Long form text can be problematic (typing is definitely quicker/easier if you stray beyond a paragraph of text).
If we just look at the technology, Google Assistant (at time of writing) seems to have the best speech recognition and also understands context. You can ask it for suggestions while reading a text from a friend about dinner plans and it will be smart enough to understand and suggest relevant restaurants in the area, currently no other assistant offers this contextual awareness.5 It can also handle chained questions such as, after asking “What is the weather in Edinburgh?”, asking “What about tomorrow?” the assistant will retain the information from the previous request and tell you the weather for tomorrow in Edinburgh.
There is, however, currently a clear winner when it comes to how much you can do with the voice control, Amazon’s Alexa has over 15,000 skills (as of July this year 6) and has since run a developer initiative to increase that number further! Siri has relatively limited functionality when compared to the other big conversation platforms, with Cortana equally struggling, they currently list less than 100 skills on their site.7 For Google’s assistant they haven’t released numbers on actions available but from a browse the number seems somewhere mid range between Cortana and Alexa. This may be surprising as Google, Apple and Microsoft all have heavily used operating systems that come with their assistant of choice baked in but having a solution that works across your iPhone, Amazon Echo and Windows laptop is clearly enough to swing people in Amazon’s direction!
We will see how well this holds up as Google and Microsoft push their assistants cross device as well (Apple are still keeping very quiet on any plans to extend Siri availability). This has already started with Microsoft and Amazon just announcing Cortana and Alexa will be able to talk to each other, making all Alexa skills able to run on the 100 million Windows 10 machines that come with Cortana built in!
With the largest set of skills available and now access to a huge market with Cortana, Amazon and their Alexa assistant are in a very strong position. In the Waracle office we have had a lot of experience with Amazon Alexa already. We recently hosted Amazon’s David Low, Amazon Alexa Principal Evangelist, in our Dundee office to discuss in more detail developing skills and their uses. The office has been a flood with ideas of how we can use voice recognition to further enhance our customers apps and services. Improvements in the conversation platform space are coming thick and fast at the moment, there is certainly a lot of momentum behind the up take in voice assistants and it doesn’t seem to be showing any signs of slowing down. There may never be a better time to jump on the bandwagon! If you want to talk more about voice, mobile app development and IoT contact us today.
Glasgow Alexa fans will be delighted to know that the city will host an Amazon Alexa meetup at ScottishPower HQ on Wednesday 24th January 2018. Spaces are limited so please do register your seat at this event which is co-hosted by our team at Waracle mobile app developers.
1 Microsoft just broke this milestone (English language only) https://www.microsoft.com/en-us/research/blog/microsoft-researchers-achieve-new-conversational-speech-recognition-milestone/
2 Sketch of Scottish people using voice controlled lift https://www.youtube.com/watch?v=sAz_UvnUeuU
3 IBM watson crushing it on Jeopardy (Engadged) https://www.youtube.com/watch?v=WFR3lOm_xhE
4 Cortana docs https://docs.microsoft.com/en-us/cortana/getstarted
5 Only available on compatible Android devices
7 Cortana skills https://www.microsoft.com/en-us/windows/cortana/cortana-skills/