Intelligent personal assistant from Hey Machine Learning Company
Virtual Assistants accompany many modern operating systems and services and they make our life easier. They help to run the applications, manage the IoT systems, find the information on the Internet and reserve the time of various establishments visit. And thanks to Artificial Intelligence (AI), assistants can conduct a conversation and learning in the process of working with you.
Virtual Digital Assistant (VDA) is software that performs various user tasks. The first such system was demonstrated in 1961 at the World Fair in Seattle. The largest computer manufacturer IBM introduced the Shoebox computer, which independently generated 16 words and called numbers from zero to nine. But the most approximated was a Virtual Assistant which was installed at the first mobile phone IBM Watson in 1994.
Among modern gadgets, the first full-fledged VDA for smartphones was Siri (Speech Interpretation and Recognition Interface) from Apple company, presented at the event «Let's Talk iPhone 4» in October, 2011.
The Intelligent Virtual Assistant Market from 2012 to 2014 grew from $352M to $572.2M. By 2019, it is expected to reach $2,126.4M in value, maintaining a growth rate of 30.6 percent annually beginning from 2013. Also, speech recognition technology is progressing. In 1970, it correctly recognized the words in 10% of cases, in 2010 in 70%, and in 2016 in 90%.
In this article, we will tell you about our new project Pizza Bot, over which we worked less than 4 weeks. This is a web service for smartphones and PCs, which orders pizza when you click on the button. The project is based on Machine Learning (ML) and Natural Language Processing technology (NLP).
NLP technology to understand human speech
The history of Natural Language Processing technology dates back to the first days of the existence of modern AI science. Mathematician Alan Turing in article «Computing Machinery and Intelligence», indicated the ability of a machine to interact with a person as its main «intelligence» measure. Now this is an important, but not the only task that NLP systems developers can solve.
NLP combines several of technologies that allow solving algorithmic problems that are associated with the processing natural language:
Extract facts from the text, from searching stop words to syntactic analysis of literature
Voice recognition and voice-to-text conversion
Classification of texts
Generating text or speech
Analysis tonality of texts
In practice, all these tasks are narrowly specialized and they are solved by separate algorithms that are constantly improving. Such direction of NLP, like voice recognition, is often based on Hidden Markov models. They break up speech into components, allocate phonemes, perform statistical analysis, and provide the most likely result of what was said, in the text format. Also, developers use Artificial Intelligence algorithms and Neural Networks.
In our project for Natural Language Processing, we used the Google Cloud Speech-to-Text service, which allows us to recognize human speech in real-time with an accuracy of 94 percent, and transform it into text.
The API recognizes 120 languages and is able to extract meaning from the words and analyze the context of what was said. For example, the system will understand that the phrase "Call Houston" refers to a person and the phrase "Let's go to Houston" refer to the city.
We realized two options for learning Pizza Bot to recognize speech and convert it into text. The first variant was to configure the streaming of data. We send the audio fragments to the Google Cloud Speech-to-Text and the service sends them to us in the form of text. If the service does not provide such a possibility, then there is a second option. We send the entire audio file to the API, and we get the whole recognized text back. But with this approach it is necessary to track silence independently; in order to understand when the operator has finished the phrase and the bot can already respond. To do this, we compare the absolute values of the sound amplitudes. If the operator is silent for 0.5 seconds, we will think that the phrase is completed and we will send it on processing.
Thanks to Google Cloud Speech-to-Text, we taught our Pizza Bot to understand human speech and convert it into text. How did we teach him to speak? Amazon Polly service helped us.
Text-to-Speech technology to synthesize text into speech
The transformation of text into natural speech is called the technology of speech synthesis or Text-to-Speech. The task of synthesizing speech is solved in several stages. Pizza Bot is able to read the text, a special algorithm should convert all the numbers into words and decipher the abbreviations. Then it breaks the text into separate combinations of words, which are read with continuous intonation and for this the system is guided by punctuation marks and stable constructions.
For all words is composed phonetic transcription. To understand how to read a word and where to accentuate, the system appeals to the dictionaries that humans compose. If there is no necessary word in the dictionary, the system builds a transcription based on academic rules. Then, the number of frames is calculated, that is 25-millisecond fragments, each of which is described by a set of parameters, for example which phoneme is part of it, what place it occupies, which syllable this phoneme enters. Using the data on the phrase and the sentence, the system sets the correct intonation.
Then the algorithm uses the acoustic model, to read the prepared text. It establishes correspondences between phonemes with certain characteristics and sounds. The acoustic model knows how to pronounce the phoneme and give the intonation correctly to the sentence, thanks to Machine Learning. The more data the model learns, the better the result is.
Service Amazon Polly copes well with the tasks of converting text into natural speech. With the help of advanced Deep Learning technologies, it makes possible to synthesize speech practically indistinguishable from human one. It also includes many naturally sounding voices for different languages. For the scoring of Pizza Bot we chose the female voice of Tatyana.
To order pizza with one touch, we used the smart device Dash Buttons from the American company Amazon.com. The device is designed to order goods, and it can be placed in any convenient corner of the house. The button configuration includes a wireless Wi-Fi interface, through which you connect it to your smartphone. All operations are performed through a companion mobile application. The logic for processing Dash Buttons clicks can be configured to count or track items, phone calls or messages to someone, to turn certain devices on and off. We set it up so that when you press the button, the signal is sent to the server and Pizza Bot orders the pizza.
Virtual Assistant Pizza Bot to quick pizza ordering
The peculiarity of our service is its simplicity. You just need to press the button!
This technology is scalable. It can be customized for any task you need and the required language: from ordering water in an office in France to booking a table in a restaurant in Korea. An example of this is our Pizza Bot.