Introduction to Speech Recognition Algorithms
Consider all the ways to cook an egg. Using mostly the same ingredients, we can prepare them sunny side up or over easy. We can make a diner-style omelet, a fancy French omelet, or even a Japanese rolled omelet. Maybe they differ slightly in seasoning or the type of fat we use to coat the pan, but the real distinction between these preparations is technique.
We see this same theme play out in computer science with the development of new and improved speech recognition algorithms. While many technologists rightly attribute the recent “AI explosion” to the rise of big data alongside advances in computing power, especially graphics processing units (GPUs) for machine learning (ML), we can’t ignore the profound effects of hard work by data scientists, researchers, and academics around the world. Yes, using a stove instead of a campfire helps us to cook an egg, but that doesn’t tell the whole story when it comes to differentiating a soft-boiled egg from a souffle.
That’s why we’re going to give you a quick rundown of speech recognition algorithms, both past and present. We’ve done away with the dense jargon and indecipherable math that fills so many similar articles on the web. This introduction is written for you, the curious reader who doesn’t have a PhD in computer science.
Before we get started, let’s return for a moment to our cooking analogy to simplify our central concept: the algorithm. Just like a recipe, an algorithm is nothing more than a set of ordered steps. First do this, then do that. Computer algorithms usually rely on a complex series of conditionals—if the pancake batter is too thick, add more milk; else add more flour. You get the point.
The Old Way
Automatic speech recognition (ASR) is one of the oldest applications of artificial intelligence because it’s so clearly useful. Being able to use voice to give a computer input is much easier and more intuitive than using a mouse, keyboard, or touchscreen. While we’re not going to be able to account for every method we’ve tried to get computers to become better listeners, we are going to give you an overview of the two major algorithms that dominated ASR until very recently.
Keep in mind that these speech recognition engines were Frankenstein creations that required multiple models to turn speech into text. An acoustic model digested soundwaves and translated them into phonemes, the basic building blocks of language, while a language model pieced them together to form words and a pronunciation model glued them together by attempting to account for the vast variations in speech that result from everything from geography to age. The result of this multi-pronged approach was a system that was as fragile as it was finicky. We’re sure you’ve dealt with this mess on a customer service hotline.
These systems primarily relied on two types of algorithms. First, n-gram models use the previous n words as context to try to figure out a given word. So, for instance, if it looks at the previous two words, we call it a bi-gram system and n=2. While higher values for n lead to greater accuracy since the computer has more context to look at, it simply isn’t practical to use a large number for n because the computational overhead is too much. Either we need such a powerful computer that the costs aren’t worth it, or the system becomes sluggish to the point of unusability.
The other algorithm is the Hidden Markov Model (HMM), which basically just goes in the opposite direction. Instead of looking backwards, HMMs look forwards. Without including any knowledge of the previous state—in our case, the words that came before the word in question—a HMM algorithm uses probabilities and statistics to guess what comes next. The “hidden” part means that we can include information about the target word that’s not obviously apparent, such as a part-of-speech tag (verb, noun, etc.). If you’ve ever used an auto-complete feature, then you’ve seen an HMM in action.
The New Way
Today’s state-of-the-art speech recognition algorithms leverage deep learning to create a single, end-to-end model that’s more accurate, faster, and easier to deploy on smaller machines like smart phones and internet of things (IoT) devices such as smart speakers. The main algorithm that we use is the artificial neural network, a many-layered (hence deep) architecture that’s loosely modeled on the workings of our brains.
Larry Hardesty at MIT gives us a good overview of how the magic happens: “To each of its incoming connections, a node will assign a number known as a ‘weight.’ When the network is active, the node receives a different data item—a different number—over each of its connections and multiplies it by the associated weight. It then adds the resulting products together, yielding a single number. If that number is below a threshold value, the node passes no data to the next layer. If the number exceeds the threshold value, the node ‘fires,’ which in today’s neural nets generally means sending the number—the sum of the weighted inputs—along all its outgoing connections.”
While most neural networks “feed-forward,” meaning that nodes only send their output to nodes that are further down in the chain, the specific algorithms that we use for speech processing work a little differently. Dubbed the Recurrent Neural Network (RNN), these algorithms are ideal for sequential data like speech because they’re able to “remember” what came before and use their previous output as input for their next move. Since words generally appear in the context of a sentence, knowing what came before and recycling that information into the next prediction goes a long way towards accurate speech recognition.
Now, there’s one last algorithm that we need to mention to give you a full overview of speech recognition algorithms. This one solves a very specific problem with training speech recognition models. Remember that ML models learn from data; for instance, an image classifier can tell the difference between cats and dogs after we feed it pictures that we label as either “cat” or “dog.” For speech recognition, this amounts to feeding it hours upon hours of audio and the corresponding ground-truth transcripts that were written by a human transcriptionist.
But how does the machine know which words in the transcript correspond to which sounds in the audio? This problem is even further compounded by the fact that our rate of speaking is anything but constant. Maybe we slow down for effect or speed up when we realize that our allotted presentation time is almost through. Either way, the rate at which we say the same word can vary dramatically. In technical terms, we call this a problem of alignment.
To solve this conundrum, we employ Connectionist Temporal Classification (CTC). This algorithm uses a probabilistic approach to align the labels (transcripts) with the training data (audio). How exactly this works is beyond the scope of this article, but suffice to say that this is a key ingredient for training a neural network to perform speech recognition tasks.
When we add it all up, recurrent neural networks alongside CTC have enabled huge breakthroughs in speech recognition technology. Our systems are able to handle large vocabularies, incredible variations in speaker dialect and pronunciation, and even operate in real-time thanks to these algorithms.
In truth, there’s really no single factor that’s responsible for these advances. Yes, the software that we’ve described plays a huge role, but the hardware it runs on and the data it learns from are all equal parts of the equation. These factors have a symbiotic relationship; they grow and improve together. It’s a virtuous feedback loop.
However, that also means getting started can feel harder than ever. Between the vast quantities of data, the complex algorithms, and access to supercomputers in the cloud, established players in the industry have a huge head-start on anyone who is trying to catch up.
And that’s exactly why we’ve decided to offer our text-to-speech API, Rev.ai, to the developer community. You don’t have to build a full ASR engine to build a custom application that includes state-of-the-art voice integration. Get started today.