Deep Learning Based Speech Recognition

4 min readMay 28, 2021

What is deep learning ?

Deep learning can be considered as a subset of machine learning. It is a field that is based on learning and improving on its own by examining computer algorithms. Deep learning allows computational models that are composed of multiple processing layers to learn representations of data with multiple levels of abstraction. These methods have dramatically improved the state-of-the-art in speech recognition, visual object recognition, object detection and many other domains.

In this article we will be discussing about speech recognition using deep learning.

So, what’s speech recognition ?

Speech recognition is the ability of a machine or program to identify words and phrases in spoken language and convert them to a machine-readable format. Speech recognition is an alternative to traditional methods of interacting with a computer, such as textual input through a keyboard. An effective system can replace, or reduce the reliability on, standard keyboard input.

Regardless of the differences in architecture, all the most spread voice assistants like Google Assistant, Cortana, and Siri are powered by a deep neural network based engine at the backend.

Wondering how Alexa works ?

It starts with signal processing, which gives Alexa as many chances as possible to make sense of the audio by cleaning the signal. Signal processing is one of the most important challenges in far-field audio.

The idea is to improve the target signal, which means being able to identify ambient noise like the TV and minimize them. To resolve these issues, seven microphones are used to identify roughly where the signal is coming from so the device can focus on it. Acoustic echo cancellation can subtract that signal so only the remaining important signal remains.

The next task is “Wake Word Detection”. It determines whether the user says one of the words the device is programmed to need to turn on, such as “Alexa”. This is needed to minimize false positives and false negatives, which could lead to accidental purchases and angry customers. This is really complicated as it needs to identify pronunciation differences, and it needs to do so on the device, which has limited CPU power.

If the wake word is detected, the signal is then sent to the speech recognition software in the cloud, which takes the audio and converts it to text format. The output space here is huge as it looks at all the words in the English language, and the cloud is the only technology capable of scaling sufficiently. This is further complicated by the number of people who use the Echo for music — many artists use different spellings for their names than there are words.

To convert the audio into text, Alexa will analyze characteristics of the user’s speech such as frequency and pitch to give you feature values.

A decoder will determine what the most likely sequence of words is, given the input features and the model, which is split into two pieces. The first of these pieces is the prior, which gives you the most likely sequence based on a huge amount of existing text, without looking at the features, the other is the acoustic model, which is trained with deep learning by looking at pairings of audio and transcripts. These are combined and dynamic coding is applied, which has to happen in real time.

Not only Alexa,siri, cortona etc uses Deep learning, Youtube also uses Deep learning.

let’s take a look :

The auto-generated youtube subtitles (youtube cc) is one example of speech recognition.

How does it work?

The algorithms that decode and transcribe audio into the text are trained on vast amounts of data.

The 2 models that work here are –

Acoustic model — It represents a mapping between audio words and phonemes (sound units of which words are composed).

Language model — the second model is a language model which represents the probability of dependency between words in a sentence.

There is a phonetic dictionary that maps between acoustic model and language model. The acoustic model learns through deep learning that is a deep neural network that is trained on hours of conversation. While the language model is trained on billions of words.

The subtitles (youtube cc) that we see appearing on youtube videos is by the courtesy of automatic speech recognition. Through deep learning, automatic speech recognition models can efficiently generate subtitles with up to 95% accuracy( that will only increase in the coming times).

Improvements in deep learning can enhance automatic speech recognition in youtube subtitles and improve their accuracy. As deep learning models work on self-learning and experience it is very much possible. The more the data, the higher the accuracy.

The intense level of competition we’re seeing between these tech giants in the industry and the increasing prevalence of companies jumping in to create content in the space suggests that we still have a long road ahead of us.

I hope, you have get the idea of speech recognition based on deep learning.

Deep Learning Based Speech Recognition

Written by RUTUJA JAMDAR

No responses yet