According to MarketsandMarkets Research, at the beginning of 2019, the speech recognition market was estimated at $7.5 billion. It is believed that in 5 years, this amount will increase by another 19% to $ 21.5 billion in 2024. It is apparent that speech recognition technologies are among the most valued and sought after.
Speech recognition is not a new technology. The first speech recognition device appeared in 1952. In 1963, Septrons, the devices that could perform voice commands, were introduced in the United States. They were successfully used in the defense industry, allowing pilots of combat helicopters to use voice control. In mass markets and in large commercial projects, voice solutions became popular after the appearance of smartphones and the spread of voice interfaces with voice support in mobile phones, IoT devices, fitness gadgets, and car devices.
Then, banks and large enterprises became interested in speech recognition. For example, financial institutions are still considering the possibility of authenticating customers by voice. Perhaps in the future, to sign a payment document or approve a transaction, you’ll simply just have to call the bank and talk with the automated system. Developers minimized voice recognition errors by up to 2%, a record number ensuring high reliability.
Voice technologies are applied not only in banks, but also in other business processes. For example, in client services for the sale of goods or services and technical support. There, voice recognition technology helps to simplify the performance of various operations. For example, online customers may spell their credit card information to the system to make a purchase. The recognition accuracy of a single digit is 99.1%, the accuracy of the text information recognition on a card is 93.3%.
A similar voice recognition technology is also used for call centers. With its help, it is possible to ensure recognition of a user’s request, routing to the necessary employee, and performing simple actions through the voice menu.
Today, voice systems do a good job with deciphering distinct speech during such tasks as generating television subtitles or translating voice to text for messaging. At the same time, the recognition systems are still not ready for simple hearing tests.
In the near future, speech recognition could be an important technology in conjunction with actively developing global satellite Internet projects. In conditions of a limited communication channel, people could make voice calls in the form of a text, which would then turn back to a voice on the receiving side.
From the point of view of machine learning, speech recognition consists of many stages. First of all, noise and interference must be removed from the original audio stream. In the cleared speech recording, phonemes are distinguished – perceptually distinct units in a language, which then can be assembled more or less clearly into the text of words, phrases, and sentences. For greater accuracy, other data sources are used, such as the image of the speaker’s face, or other voice recordings with known transcripts.
Recognizing the meaning of what is said is a separate big task. For this reason, the role of voice assistants is still limited to simple commands. Understanding the meaning has a much greater complexity and is currently implemented only in individual components, such as an object, emotion, or tonality.
The reverse process, speech synthesis, thanks to machine learning, allows you to quickly and efficiently generate speech using specified samples of real people’s voices. Already today there are startups that allow substituting voices for dubbing texts with voices of historical personalities.
As for the language support, English and Chinese dominate speech recognition technologies. This is due to the volume of investments in the speech recognition technology from Chinese companies and the US. However, there are free solutions, voice engines, which allow other companies to be included in the technology race. Mozilla DeepSpeech and, a completely open source solution, Kaldi are among the open solutions.
However, in order to add support for a language other than English or Chinese to free solutions, you need about 10 thousand hours of speech for training. And it should be marked data and the recording of various dialogues – only then acoustic language models can be well-trained.
There are many speech recognition services today and most of them are focused on English speech recognition. The main problem is the lack of additional mechanisms for the interpretation of recognition, which is why such systems provide several options, one of which may be correct, but it may not be the most probable one offered by these services.
As a result, people are uncomfortable using these technologies since they have to speak unnaturally and slowly. If the program can’t recognize what has been said, then the text need to be repeated again and again, which can be annoying. A person gets the feeling the technology is flawed; this negative effect could influence the success implementation factor if people refuse to communicate with the robot. Moreover, for those tasks where it could be applied, often, the alternative is low-skilled human work, where the cost of work may not be so high that it would be beneficial to apply.
Developers today lack user experience on how to properly build a dialogue between a person and a machine. For example, an interesting pattern can be noted – adults conduct a dialogue with the robot as with, to put it mildly, a silly person. In response to questions, they make a lot of explanations, start speaking slowly, hence there are various problems in building a dialogue, although they behave in the usual way with live operators having the same dialogue. In this case, children behave as naturally as possible, and they do not have problems in communication.
Speech recognition technologies are actively developing. Tasks that were previously considered impracticable have already been solved. For example, the voice recognition technology with simultaneous conversation has already been implemented; smooth speech synthesis is applied, suitable for the level of human speech. Experts believe that in the next three years, significant technological growth associated with speech recognition will be observed. As a result, there will be many solutions with voice technologies in the field of business automation.
Speech recognition, voice biometrics and voice control have become reliable tools, thanks to the development of the technologies. For most tasks, speech recognition copes with its work. Difficulties remain with the recognition of telephone conversations or the separation of mono-recordings in stereo, but there is progress in this direction as well.
Today’s popular voice assistants are unlikely to become so massive, as it was thought a year ago, but the task of processing people’s call recordings remains relevant and will gain value in the service economy.