What is Automatic Speech Recognition Technology? The Ultimate Guide to ASR
When we look at the history of computer science, we see clear generational lines that are defined by their input method. How does information travel from our brains into the computer? From the early punch card computers, to the familiar keyboard, to the latest touch screens that we carry in our pockets, we can trace advances in computation to the ways we interact with the digital. As is always the case with technology, our question is “what comes next?”
Answer: the human voice. Automated Speech Recognition (ASR) is the tech that’s enabling this transition. Essentially, ASR is all about using computers to transform the spoken word into the written one.
This is a huge step, both in terms of the opportunities that it creates and the challenges that we have to overcome to achieve it. To give you an analogy, consider the evolution of language itself. The written word didn’t appear for tens of thousands of years after the emergence of spoken language, but, when it did, it began a new epoch of civilization. The point is that going from speaking to writing is really hard, but the consequences are just as significant.
Even without getting into the sci-fi future—and don’t worry, we’ll get there—there’s a wellspring of opportunity in the here-and-now that savvy business people, developers, and others are tapping into. ASR is already being put to good use.
We’re going to show you how. But first, let’s go through the basics.
Demystifying Automatic Speech Recognition
Today’s ASR falls within the domain of machine learning (ML), which, in turn, is a form of artificial intelligence (AI). The latter refers broadly to computers that simulate thought, while the former is a specific technology that tries to achieve AI’s goals by training a computer to learn on its own. Basically, instead of attempting to code the rules for translating speech input into text output, we train an ML model by feeding large datasets into an algorithm, such as a convolutional neural network (CNN), which loosely mimics the human brain’s architecture. The model becomes progressively better at inference, the process of turning inputs into outputs or, in our case, speech into text.
Another key distinction is the difference between automatic speech recognition and natural language processing (NLP). ASR concerns itself with converting speech data into text data, while NLP seeks to “understand” language to fuel other actions. They’re easy to conflate because they often appear together; a smart speaker, for instance, uses ASR to convert voice commands into a usable format and NLP to figure out what we’re asking it to do. Hence, NLP is more concerned with meaning than ASR.
Finally, let’s dive into some specific ASR terminologies and technologies. Most ASR begins with an acoustic model to represent the relationship between audio signals and the basic building blocks of words. Just like a digital thermometer converts an analog temperature reading into numeric data, an acoustic model transforms sound waves into bits that a computer can use. From there, language and pronunciation models take that data, apply computational linguistics, and consider each sound in sequence and in context to form words and sentences.
The latest research, however, is stepping away from this multi-algorithm approach in favor of using a single neural network called an end-to-end (E2E) model. Michelle Huang, one of Rev’s Senior Speech Scientists, explains that “one exciting thing we’re working on right now is end-to-end speech recognition. That would enable us to quickly expand to more non-English language due to the ease of training new models.” Other advantages include reduced decoding time and joint optimization with downstream NLP.
Another key term is speaker diarization, which enables an ASR computer to determine which speaker is speaking at which time. Not only is this crucial for use cases like generating transcripts from conference calls with multiple speakers, but it also avoids the confusion of combining two speakers’ simultaneous speech into a single nonsensical caption.
Once we’ve built our ASR, we also need a way to evaluate it. Word Error Rate (WER) is the gold-standard for ASR benchmarking because it tells us how accurately our model can do its job by comparing the output to a ground-truth transcript created by a human. Here’s the formula:
Word Error Rate = (inserts + deletions + substitions ) / number of words in reference transcript
Simply put, it gives us the percentage of words that the ASR messed up. A lower WER, therefore, translates to higher fidelity. We’ll come back to WER later when we see how different ASR providers stack up.
How We’re Using ASR Right Now
Now it’s time for the good stuff: the brilliant applications and innovative use cases that we’re seeing from a variety of industries. Since ASR is such a generally useful technology, it’s impossible to list every application. These are some of our favorites.
Generating closed captions is the most obvious place to start. It comes in two forms: offline and live. Whether it’s for movies, television, video games, or any other form of media, offline ASR accurately creates captions ahead of time to aid comprehension and make media more accessible to the deaf. In contrast, live ASR lets us stream captions in real time with a latency in the magnitude of seconds. This makes it ideal for live TV, presentations, or video calls.
ASR is also great for creating transcripts after the fact. In addition to the standard lectures, podcasts, etc., one of the most innovative uses that we’re seeing is companies creating transcripts of Zoom calls and other virtual meetings. There’s a few key benefits. First, text is much easier to search than audio, enabling us to easily reference important moments or pull out quotes. Second, a transcript takes much less time to review than a recording. Lastly, transcripts are easier to share if someone misses a meeting.
The medical community is another heavy adopter of ASR tech. According to a Wharton Customer Analytics whitepaper, “Physicians are increasingly relying on AI-assisted technologies that convert voice dictated clinical notes into machine-understandable electronic medical records and, combined with analysis of diagnostic images in such disease areas as cancer, neurology, and cardiology, relevant information is being uncovered for decision making.” Along similar lines, the Covid pandemic accelerated the rise of telehealth, where ASR has been crucial for screening and triaging remote patients.
Call centers are also employing ASR to drive better customer outcomes. Besides contact centers that employ fully automated chatbots, uses include monitoring customer support interactions, analyzing initial contacts to more quickly resolve issues, and improving employee training. McKinsey research found that “companies have already applied advanced analytics to reduce average handle time by up to 40 percent, increase self-service containment rates by 5 to 20 percent, cut employee costs by up to $5 million, and boot the conversion rate on service-to-sales calls by nearly 50 percent—all while improving customer satisfaction and employee engagement.”
Software developers are also putting ASR to good use. For instance, a mobile app developer can leverage Rev.ai’s speech-to-text APIs to integrate ASR functionality without paying for the overhead of a data science team or hours of high-powered cloud compute to train a new model. In turn, their users can enjoy a more seamless and intuitive user experience by navigating the app with their voices.
While ASR plays a role in app categories across the board, it takes front and center in translation apps. This technology is well on the way to creating a “universal translator” that breaks down language barriers and makes both travel and cross-border communication more accessible.
Lastly, we have the Internet of Things (IoT)–and this is a big one. The IoT includes all the physical “smart” devices that increasingly inhabit our world. These range from smart home devices like thermostats and speakers to Industrial Internet of Things (IIoT) devices that optimize manufacturing processes and drive improved automation. Voice is quickly becoming the best way for users to interact with the IoT. By simply saying “turn on the lights,” or “turn up the temperature,” we’re able to control our environment in real time, all without ever having to look at a screen or press a button.
If we had to make a bet, we’d say the smart money is on ASR being pivotal to the implementation and adoption of the IoT at scale. But before we get ahead of ourselves by discussing the immense opportunities created by ASR, we’re going to have to overcome some serious challenges to get there.
The Future of ASR: Challenges and Opportunities
The first pressing challenge that both ASR and AI more generally faces is inclusivity and equitability. Technology must serve all of us equally, but research shows bias in services ranging from AI-based financial services that are less likely to provide loans for minorities to search engines reinforcing racism to disparities in voice recognition software.
One report concludes, “Our results point to hurdles faced by African Americans in using increasingly widespread tools driven by speech recognition technology. More generally, our work illustrates the need to audit emerging machine-learning systems to ensure they are broadly inclusive.” More specifically, they found that the top five ASR systems “exhibited substantial racial disparities, with an average word error rate (WER) of 0.35 for black speakers compared with 0.19 for white speakers.”
Remember, ML models learn from their training datasets, so when black voices are missing from that data, the ASR cannot accurately parse their speech. Besides bringing more diverse developers into tech, we must also employ more diverse training datasets that represent different accents, vernaculars, and speakers. You can read more about Rev’s initiatives on this front in this blog post.
Privacy is another major sticking point for ASR’s widespread adoption. To put it bluntly, surveillance is incompatible with democracy, and, as technologists, it’s our responsibility to innovate in a way that benefits our society. Plus, it’s just good business. The Wharton paper makes the case:
“The adoption of voice-enabled technology for common use-cases by consumers at home, in their vehicles, at work, at the store, or in any setting that offers convenience, is contingent on consumers trusting the privacy of their data…We note that going forward, as companies rely on first-party data relationships, earning consumer trust will become paramount. We foresee companies gaining competitive advantage by taking lead in engendering trust and incorporating Privacy by Design (Pbd) to ensure that personally identifiable information (PII) in systems, processes, and products is protected.”
Looking specifically at ASR, a paper titled Design Choices for X-vector Based Speaker Anonymization explains that “privacy protection methods for speech fall into four broad categories: deletion, encryption, distributed learning, and anonymization. The VoicePrivacy initiative specifically promotes the development of anonymization methods which aim to suppress personally identifiable information in speech while leaving other attributes such as linguistic content intact.”
Finally, there’s also a host of purely technical challenges to overcome. “The reality is that speech recognition systems today still struggle on audios that humans are capable of transcribing accurately,” says Nischal Bhandari, (insert job here, I couldnt find his title online) at Rev. “Complicating factors include overlapping speech, diversity in pronunciation, and the ever-changing nature of language.”
If we can overcome these challenges, ASR will provide an incredible amount of opportunity. Many of these will come from ASR at the edge, which means we’ll run the end-to-end models on lower-powered computers that are closer to the data source rather than on high-powered computers in the cloud. There are a few key benefits: lower latency, more personalized—and therefore accurate—models, and better privacy protections since voice data doesn’t have to travel over a network.
Take, for instance Apple’s Neural Engine, a custom chip that enables iPhones to handle certain ML tasks at the edge. In an interview with Ars Technica, Apple’s AI chief, John Giannandrea, explains, “I understand this perception of bigger models in data centers somehow are more accurate, but it’s actually wrong…It’s better to run the model close to the data, rather than moving the data around…it’s also privacy preserving.” Of course, Apple isn’t the only one running ML on the edge; with new hardware such as NVIDIA’s Jetson microcontrollers hitting the market, developers can now run Rev’s ASR just about anywhere.
What this amounts to is a key step towards ambient computing, a state where computers are so ubiquitous that we forget that they’re even there. This vision comes from Mark Weiser’s seminal 1991 paper, The Computer for the 21st Century, which he begins by writing, “The most profound technologies are those that disappear. They weave themselves into the fabric of everyday life until they are indistinguishable from it.” Weiser’s thoughts on ubiquitous computers have remained a guiding light for Silicon Valley for the past three decades, and now we’re closer than ever before to achieving his dream. In practical terms, this means employing ASR to control an ambient IoT with our voices.
Another breakthrough on the horizon is improved affective computing, the art of breaking down speech patterns and other communications to detect the undercurrents of emotion and thought that run beneath the words. “When people speak, there is so much information in how things are said beyond what is said,” explains Victor Yap, Rev’s (job title).
This includes “intonations, pauses, speed of speaking, and word choice, which can convey emotions and secret meanings.” He points to the case of a woman calling 911 to “order a pizza,” when really she was a victim of domestic violence. The operator was able to understand her meaning because of how she spoke, not the words themselves. As we make more advances in ASR tech, we expect chatbots to take on more of these capabilities.
Finally, we must note that ASR will be a pivotal part of any AI system that’s eventually able to pass the Turing test. Alan Turing, inventor of the first computer and breaker of the Nazi’s Enigma Code, created this test as a way to determine whether a machine can truly think. Essentially, it involves a human holding two conversations, one with another person and one with a machine. If they can’t determine which one is the machine, then Turing concludes it must be a thinking machine.
While top AI researchers submit their work in an annual competition, nobody has yet come close to actually passing the Turing test. However, if and when we do cross the threshold, ASR will play an important role in that conversation.
For all its complexities, challenges, and technicalities, ASR is really just about one simple objective: helping computers listen to us. We take this quality for granted in each other, but, when we stop to think about it, we realize just how important this capability truly is. As children, we learn by listening to our parents and teachers. We expand our minds by listening to the people we encounter, and we maintain our relationships by listening to one another.
Simply put, getting machines to listen is a big deal. It’s deceptively powerful, even if we only consider the present day use cases and ignore the vast opportunities that it will bring. At the same time, we must remember that with great power comes great responsibility. As technologists, we’re responsible for upholding our users’ privacy, for developing technologies without bias or prejudice, and for creating systems that benefit us all.
At Rev, we take these commitments as seriously as we do the quality of our ASR technology. Just as Rev leads the industry in speech recognition accuracy with a WER well below tech giants such as Google, Microsoft, and Amazon, we’re also leading the way in thinking critically about how these technologies will be applied to our daily lives.
Are you a developer who’s looking for an ASR speech to text API that’s fast, accurate, and easy to integrate? Are you a businessperson, lecturer, or podcaster looking for transcription services? Or do you simply need captions for your original content? Rev has you covered. Contact us today to learn more.