What is ASR? Rev’s Guide to Automatic Speech Recognition Technology
Artificial Intelligence is changing the way we teach, learn, work, and function as a society, especially ASR. Automated Speech Recognition (ASR) is tech that uses AI to transform the spoken word into the written one.
While ASR might seem like the stuff of science fiction – don’t worry, we’ll get there later – it opens up plenty of opportunity in the here and now that savvy business people, developers, and others are tapping into. ASR is already being put to good use.
Want to learn how you can get in on the fun? We’re going to show you how. But first, let’s go through the basics.
What is ASR (Automatic Speech Recognition)?
To put it simply, ASR is a technology that uses machine learning (ML) and artificial intelligence (AI) to convert human speech into text. It’s a common technology that many of us encounter every day – think Siri, Okay Google or any speech dictation software.
It’s different from natural language processing (NLP), another AI-based language technology, in the way that ASR aims to convert speech data into text data. NLP, on the other hand, seeks to “understand” language, usually to be able to produce written text from scratch that mimics the way a human would write.
How Does Automatic Speech Recognition Work?
Most ASR voice technology begins with an acoustic model to represent the relationship between audio signals and the basic building blocks of words. Just like a digital thermometer converts an analog temperature reading into numeric data, an acoustic model transforms sound waves into bits that a computer can use. From there, language and pronunciation models take that data, apply computational linguistics, and consider each sound in sequence and in context to form words and sentences.
This is the way things have been done for a while, but the latest research is stepping away from this multi-algorithm approach in favor of using a single neural network called an end-to-end (E2E) model.
Michelle Huang, one of Rev’s Senior Speech Scientists, explains that:
“one exciting thing we’re working on right now is end-to-end speech recognition. That would enable us to quickly expand to more non-English language due to the ease of training new models.”
Other advantages include reduced decoding time and joint optimization with downstream NLP.
Another key term you may have heard of in the context of automatic speech recognition is speaker diarization. This technical process enables an automatic speech recognition system to determine which speaker is speaking at which time. Not only is this crucial for use cases like generating transcripts from conference calls with multiple speakers, but it also avoids the confusion of combining two speakers’ simultaneous speech into a single nonsensical caption.
The final step in building out automatic speech recognition software is choosing a way to evaluate it. Word Error Rate (WER) is the gold-standard for ASR benchmarking because it tells us how accurately a model can do its job by comparing the output to a ground-truth transcript created by a human. Here’s the formula:
Word Error Rate = (inserts + deletions + substitions ) / number of words in reference transcript
Simply put, this formula gives us the percentage of words that the ASR messed up. A lower WER, therefore, means a higher accuracy. We’ll come back to WER later when we see how different ASR providers stack up.
How We’re Using ASR Right Now
Now it’s time for the good stuff: the brilliant applications and innovative automatic speech recognition examples that we’re seeing from a variety of industries. Since ASR is such a generally useful technology, it’s impossible to list every application so here are some of our favorites.
Generating closed captions is the most obvious place to start. Whether it’s for movies, television, video games, or any other form of media, offline ASR accurately creates captions ahead of time to aid comprehension and make media more accessible to the deaf and hard-of-hearing. In comparison, live ASR lets us stream captions in real time with a latency in the magnitude of seconds. This makes it ideal for live TV, presentations, or video calls.
ASR is also great for creating transcripts after the fact. In addition to the standard lectures, podcasts, etc., one of the most innovative uses that we’re seeing is companies creating transcripts of Zoom calls and other virtual meetings. There’s a few key benefits. First, text is much easier to search than audio, enabling us to easily reference important moments or pull out quotes. Second, a transcript takes much less time to review than a recording. Lastly, transcripts are easier to share if someone misses a meeting.
The medical community is another heavy adopter of ASR tech. According to a Wharton Customer Analytics whitepaper, “Physicians are increasingly relying on AI-assisted technologies that convert voice dictated clinical notes into machine-understandable electronic medical records and, combined with analysis of diagnostic images in such disease areas as cancer, neurology, and cardiology, relevant information is being uncovered for decision making.” Along similar lines, the Covid pandemic accelerated the rise of telehealth, where automatic speech recognition technology has been crucial for screening and triaging remote patients.
Call centers are also employing ASR to drive better customer outcomes. Uses include monitoring customer support interactions, analyzing initial contacts to more quickly resolve issues, and improving employee training. In fact, McKinsey research found that “companies have already applied advanced analytics to reduce average handle time by up to 40 percent, increase self-service containment rates by 5 to 20 percent, cut employee costs by up to $5 million, and boot the conversion rate on service-to-sales calls by nearly 50 percent—all while improving customer satisfaction and employee engagement.”
Software developers are also putting ASR to good use. For instance, a mobile app developer can leverage Rev.ai’s speech-to-text APIs to integrate ASR functionality without paying for the overhead of a data science team or hours of high-powered cloud compute to train a new model. In turn, their users can enjoy a more seamless and intuitive user experience by navigating the app with their voices.
While ASR plays a role in app categories across the board, it takes front and center in translation apps. This technology is well on the way to creating a “universal translator” that breaks down language barriers and makes both travel and cross-border communication more accessible.
Lastly, we have the Internet of Things, which includes all the physical “smart” devices that increasingly inhabit our world. These range from smart home devices like thermostats and speakers to Industrial Internet of Things (IIoT) – devices that optimize manufacturing processes and drive improved automation. Voice is quickly becoming the best way for users to interact with the IoT. By simply saying “turn on the lights,” or “turn up the temperature,” we’re able to control our environment in real time, all without ever having to look at a screen or press a button.
If we had to make a bet, we’d say the smart money is on ASR being pivotal to the implementation and adoption of the IoT at scale.
The Future of ASR: The Challenges
Before we get ahead of ourselves by discussing the immense opportunities created by ASR, we’re going to have to overcome some serious challenges to get there.
The first pressing challenge that both ASR and AI face is inclusivity and equitability. Technology must serve all of us equally, but research shows that even the best speech recognition systems are biased. The impact of these biases ranges from AI-based financial services that are less likely to provide loans for minorities to search engines reinforcing racism to disparities in voice recognition software.
As one report concludes:
“Our results point to hurdles faced by African Americans in using increasingly widespread tools driven by speech recognition technology. More generally, our work illustrates the need to audit emerging machine-learning systems to ensure they are broadly inclusive.”
More specifically, they found that the top five automatic speech recognition systems “exhibited substantial racial disparities, with an average word error rate (WER) of 0.35 for black speakers compared with 0.19 for white speakers.”
Remember, ML models learn from their training datasets, so when black voices are missing from that data, the ASR cannot accurately parse their speech. To counteract this, we need to bring more diverse developers into tech, but we must also employ more diverse training datasets that represent different accents, vernaculars, and speakers. You can read more about Rev’s initiatives on this front.
Privacy is another major sticking point for ASR’s widespread adoption. To put it bluntly, surveillance is incompatible with democracy, and, as technologists, it’s our responsibility to innovate in a way that benefits our society. Plus, it’s just good business. The Wharton paper makes the case:
“The adoption of voice-enabled technology for common use-cases by consumers at home, in their vehicles, at work, at the store, or in any setting that offers convenience, is contingent on consumers trusting the privacy of their data…We note that going forward, as companies rely on first-party data relationships, earning consumer trust will become paramount. We foresee companies gaining competitive advantage by taking lead in engendering trust and incorporating Privacy by Design (Pbd) to ensure that personally identifiable information (PII) in systems, processes, and products is protected.”
Looking specifically at ASR, a paper titled Design Choices for X-vector Based Speaker Anonymization explains that “privacy protection methods for speech fall into four broad categories: deletion, encryption, distributed learning, and anonymization. The VoicePrivacy initiative specifically promotes the development of anonymization methods which aim to suppress personally identifiable information in speech while leaving other attributes such as linguistic content intact.”
Finally, there’s also a host of purely technical challenges to overcome.
“The reality is that speech recognition systems today still struggle on audios that humans are capable of transcribing accurately,” says Nischal Bhandari, Senior Speech Scientist at Rev.ai. “Complicating factors include overlapping speech, diversity in pronunciation, and the ever-changing nature of language.”
The Future of ASR: The Opportunities
If we can overcome these challenges, ASR will provide an incredible amount of opportunity. Many of these will come from ASR at the edge, which means we’ll run the end-to-end models on lower-powered computers that are closer to the data source rather than on high-powered computers in the cloud. There are a few key benefits: lower latency, more personalized—and therefore accurate—models, and better privacy protections since voice data doesn’t have to travel over a network.
Take, for instance Apple’s Neural Engine, a custom voice regonition chip that enables iPhones to handle certain ML tasks at the edge. In an interview with Ars Technica, Apple’s AI chief, John Giannandrea, explains “I understand this perception of bigger models in data centers somehow are more accurate, but it’s actually wrong…It’s better to run the model close to the data, rather than moving the data around…it’s also privacy preserving.”
Of course, Apple isn’t the only one running ML on the edge; with new hardware such as NVIDIA’s Jetson microcontrollers hitting the market, developers can now run Rev’s ASR just about anywhere.
Another breakthrough on the horizon is improved affective computing – the art of breaking down speech patterns and other communications to detect the undercurrents of emotion and thought that run beneath the words. “When people speak, there is so much information in how things are said beyond what is said,” explains Rev’s Victor Yap.
This includes “intonations, pauses, speed of speaking, and word choice, which can convey emotions and secret meanings.” He points to the case of a woman calling 911 to “order a pizza,” when really she was a victim of domestic violence. The operator was able to understand her meaning because of how she spoke, not the words themselves. As we make more advances in ASR tech, we expect chatbots to take on more of these capabilities.
What’s Next for Automatic Speech Recognition Technology?
For all its complexities, challenges, and technicalities, automatic speech regcognition technology is really just about one simple objective: helping computers listen to us.
Getting machines to listen is a big deal; it’s deceptively powerful, even if we only consider the present day use cases and ignore the vast opportunities that it will one day bring.
At the same time, we must remember that with great power comes great responsibility. As technologists, we’re responsible for upholding our users’ privacy, for developing technologies without bias or prejudice, and for creating systems that benefit us all.
At Rev, we take these commitments as seriously as we do the quality of our ASR technology. Just as Rev leads the industry in speech recognition accuracy with a WER well below tech giants such as Google, Microsoft, and Amazon, we’re also leading the way in thinking critically about how these technologies will be applied to our daily lives.
Are you a developer who’s looking for an ASR speech to text API that’s fast, accurate, and easy to integrate? Are you a businessperson, lecturer, or podcaster looking for transcription services? Or do you simply need captions for your original content? Rev has you covered. Contact us today to learn more.