What is a Language Model in Speech Recognition?

RevBlogResourcesOther ResourcesA.I. & Speech RecognitionWhat is a Language Model in Speech Recognition?

Whether it’s made of clay or code, a model is a representation. Models help us to understand the world by stripping away the unnecessary or the circumstantial, enabling us to look more closely and clearly at the modeled subject. In our case, we use computers—and therefore math—to build a model of a subject that’s long been considered antithetical to the mathematician’s rigid confines.

Language ebbs and flows; its rules often appear more as suggestion than mandate, and each of us carries a unique way of speaking that’s bred in a place and evolves inside of us as we travel through life.

All this serves to say is that teaching a computer to recognize and use language is really hard. We take language processing for granted because it comes so naturally to us, but creating a statistical language model by assigning probability distributions over sequences of words takes a lot more effort.

But don’t we build models and other applications in programming languages? If we can tell a computer what to do in Python, JavaScript, or C++, why can’t we just as easily give it input sequences in English, Spanish, or Japanese?

Language Model

Dr. Jason Brownlee from Machine Learning Mastery clarifies this question by distinguishing between formal languages that “can be fully specified” and natural languages, which “are not designed; they emerge, and therefore there is no formal specification.” He calls language a “moving target” that involves “vast numbers of terms that can be used in ways that introduce all kinds of ambiguities.”

Mathematicians will recognize an isomorphism, a one-to-one mapping of one group onto another, between formal languages and the series of bits that computers rely on at their core. Lacking this direct correspondence, our best bet with natural language is to do what we often do in the face of uncertainty: play the odds. By applying statistical analysis via computational linguistics and technologies like machine learning (ML) algorithms, we can enable our computers to at least make good guesses. 

Our obsession is making those guesses better and better until they’re as good as you and me.

Also keep in mind that a language model is only one part of a total Automatic Speech Recognition (ASR) engine. Language models rely on acoustic models to convert analog speech waves into digital and discrete phonemes that form the building blocks of words. Other key components include lexicons, which glues these two models together by controlling what sounds we recognize and what words we predict, as well as pronunciation models to handle differences between accents, dialects, age, gender, and the many other factors that make our voices unique.

A Language Model’s Life Story

Now let’s take a closer look at the lifecycle of a typical language model. This section will give you a general overview of how we create, use, and improve these models for everything from live streaming speech recognition to voice user interfaces for smart devices.

First we need our raw materials: data and code. The audio recordings of our datasets make up a corpus, and we’ll also need ground-truth transcripts of those recordings to serve as a baseline and to which we can compare our results. Data scientists begin to write code using a variety of algorithms that we will describe in the next section on the different types of language models. This ML code is usually written in Python by leveraging frameworks like TensorFlow and PyTorch.

At this stage, developers can also use some other techniques to speed up their timeline or achieve better results. For instance, transfer learning lets us reuse a pretrained model on a new problem. So, even though languages like English and Spanish aren’t identical, we can still leverage an English language model to kickstart a Spanish one. Another type of pretraining includes initializing certain values in one direction rather than beginning with ones that are completely random.

What is Speech Recognition?

Now it’s time to train our model. While the specific technique will vary based on our selection algorithm—for instance, whether we’re using supervised or unsupervised learning—the principles are the same. We give it input audio, it generates text, and we check that text against the ground-truth transcript. If it’s right then it will be more likely to guess that same word in the future; if not, then it will be less likely to do so. We repeat this learning cycle hundreds of thousands, millions, or even billions of times on high powered cloud computers.

We then assess our models performance using benchmarks like Word Error Rate (WER) and decide how to proceed with the next iteration. We’ll tune and tweak the code, change parameters, source new data, and run it again. At companies like Rev that specialize in ASR and language models, this iterative process never ends.

Nonetheless, once we’ve achieved suitable performance levels, our model is ready for the big time: deployment. This could mean putting it on a server where users can access it for computer generated transcripts, running it on a smartphone or smart speaker to improve a voice assistant’s listening skills, or plugging it in via a speech-to-text API for any number of custom applications.

Now that our model is up and running, we can use it for inference, the technical term for turning inputs into outputs. When a language model receives phonemes as an input sequence, it uses its learned probabilities to “infer” the right words. Most ML models will continue to learn and improve during their operational lifetimes, enabling them to learn new words and become more adept at serving individual users. 

Types of Language Models

There are two major types of language models. The older variety uses traditional statistical techniques like Hidden Markov Models (HMMs) and n-grams to assign probability distributions to word sequences. These models rely heavily on context, using their short-term memory of previous words to inform how they parse the next. A bigram model, for instance, uses the two previous words for inference while a trigram uses three. An n-gram, therefore, uses n words to make their predictions.

These models do have a few main drawbacks. While higher values for n will give us better results, it also leads to higher computer overhead and RAM usage. This makes them difficult to deploy on resource-light devices like those found in the internet of things (IoT). Plus, this means that they have a really hard time drawing back on what came at the beginning of a sentence, paragraph, or section. They are also completely dependent on the training corpus, meaning that they are unable to ever infer new words that weren’t in the corpus. Lastly, they’re very dependent on the performance of the other models that make up an ASR engine.

On the other hand, deep learning language models use artificial neural networks to create a many-layered system that most data scientists consider the current state of the art. Deep learning algorithms are more flexible, faster to train, and don’t require as many resources for deployment.

One specific technology, the Transformer Network, is especially prominent in language modeling because of its capability for mechanisms like attention. While an n-gram model will always pay attention to the previous n words, a neural network with attention will give more weight to the important words. Just like you might be skimming this article right now to pick out the important bits, these systems operate along similar lines.

Another key benefit of the deep learning approach is end-to-end (E2E) modeling. In this architecture, we do away with the various models—acoustic, lexicon, language, etc.—and bring them all into a single model. These models are more efficient and less fragile than their fragmented counterparts.

Final Thoughts on Language Models

Especially for deep learning language models, data is EVERYTHING. Of course it’s important that the quality is of a high standard, but what often sets a great model apart from a good one is the sheer volume of data that we use to train it. Most ASR developers rely on standard corpuses like librispeech and the WSJ corpus, but these can only get you so far.

And that’s exactly why Rev outperforms tech giants like Microsoft and Amazon in ASR benchmarking tasks like WER. Our team of over 50,000 human transcriptions work on Rev’s premium human transcription and captioning services. From a data science perspective, that’s a lot of high-quality ground-truth transcripts. Not only do we have a world-class team of speech engineers and computer scientists, but they get to work with the best ingredients on the market.

Rev Beats Google Microsoft Amazon

The ASR solutions that we produce are only the beginning. From there, developers can use them for Natural Language Processing (NLP), tasks that require genuine language understanding. Essentially, while ASR converts audio into text, NLP digests that text to render meaning. To see this in action, you can check out this subreddit that was entirely generated by AI bots or this website that lets you play with a neural network to auto-complete your sentences and predict the next word.
Whether you need live captioning for your business’s teleconferences or better voice integration for your custom mobile app, you’re going to need ASR. Don’t start from scratch. Visit our services page to find out more about how you can leverage our best-in-class speech recognition solutions.