Exploring the Evolution of Speech Recognition with Rev’s Speech Scientists
Rev is built on a foundation of bright ideas and engineering breakthroughs. We’ve created the world’s most sophisticated speech-to-text A.I. to provide accurate automatic transcripts and bolster our human transcriptionists’ speed and accuracy. Since we’re fortunate to work alongside the brilliant minds developing this technology, we’re always looking for opportunities to get their expert perspectives on our biggest speech recognition questions.
The team is normally heads-down improving things like speaker diarization and reducing bias for gender and race-based ethnic groups, so we’re thrilled they’ve carved out a few minutes to give their thoughts on the evolution of the tech, common misconceptions, and where the industry’s headed.
How have you seen the technology or A.I. in the speech recognition space advance since you’ve been an engineer?
“In 20 years, the speech recognition world evolved from being able to handle only short utterances from a limited vocabulary (one language at a time while requiring speech scientist experts to feed and tune the system) to today’s state where continuous speech recognition is ubiquitous and multilingual systems are at the door. Said differently, ASR went from being a very niche domain known and used by only large telecom corporations to something closer to a utility that anyone can have on their desk or phone!
Two things really changed: we now have much more computing power available for this task and the amount of data available for training also exploded. This conjunction allows us to rely a bit less on speech science expert knowledge and more on the general tools the machine learning field brought to the industry. Interestingly, the efficiency of the speech recognizers in terms of CPU requirement went down a bit: earlier algorithms were slightly less performant from an accuracy perspective, but they were much more economical from a CPU and RAM usage standpoint. We currently see some effort in the industry and academia to tackle that now because the computing power needs are growing faster than what’s available. The future is bright!” — Jean-Philippe Robichaud
What’s the biggest misconception about speech recognition technology right now?
“That the problem is as simple as converting voice to a sequence of words. There are tons of other important features that make speech recognition technology usable. Punctuation, formatting, speaker diarization, and proper capitalization and spelling of proper nouns (e.g. Lyft vs lift) are all components of speech recognition output that we often take for granted but are critical to speech comprehension. Providing these features is also not trivial and each comes with its own challenges and edge cases.” — Quinn McNamara
“That the problem is mostly solved. There are cases where speech recognition can perform really well right now, like short snippets of speech when someone is giving a command to their phone. The reality is that speech recognition technologies today still struggle on audios that humans are capable of transcribing accurately. Complicating factors include overlapping speech, diversity in pronunciation, and the ever-changing nature of language.” — Nishchal Bhandari
What excites you most about the future of speech recognition technology?
“I’m excited for sentiment analysis in the future of speech recognition technology. When people speak, there is so much information in how things are said beyond what is said. Intonations, pauses, speed of speaking and word choice which can convey emotions and secret meanings. There was a story of a 911 call operator who understood that a woman’s call for a large pizza was a secret message that domestic violence was going on.” — Victor Yap
“I’m really excited to see speech recognition become more available to people. Currently speech recognition is pretty hardware-intensive to run and requires a stable internet connection, but there are developments in edge computing to enable speech recognition on smaller devices. One exciting thing we’re working on right now is end-to-end speech recognition. That would enable us to quickly expand to more non-English languages due to the ease of training new models.” — Michelle Huang
“Human level accuracy on ASR will enable all sorts of adjacent technologies. We’ve seen some of that already with Alexa and Siri. In the future, we will have highly accurate ASR and it will be everywhere. Wearable devices that allow those hard of hearing to see what people around them are saying. Immersive virtual reality where you can have a spoken conversation with computer controlled characters. Help stations in hospitals and office buildings that tell you exactly where you need to go and how to get there. And of course, useful voice assistants connected to personal devices. It’s exciting to see the first drafts of products that ASR enables today and imagine what the future will look like.” — Joseph Palakapilly
Where are we currently in our journey toward less bias and fairer ASR models?
“As a research community we have a long way to go before we can say that ASR models are fair. ASR models are ultimately biased by the data that is used to train them; in general, these models are better at detecting younger generations, women, and “typical” speech. That being said, there’s a lot of great work being done to try to address these issues such as efforts to improve recognition of people with “atypical” speech but all models still have a long way to go to achieve feature parity for all voices.
We’ve shared a blog post about this before and since then, we’ve taken massive strides to reduce the gap between different voices primarily focused on accented speech. I hope that this year we are able to further close the gap not only on accented speech, but also on other “atypical” speech such as those caused by speech disabilities.” — Miguel Del Rio
“We are still somewhere around the start of our journey. The first step is recognizing you have a problem, the second is working to fix it.
I’ve been heading and working on the accent effort since the beginning of Fall 2020. So I’ll focus mainly on what we’re doing to improve accents at Rev, compared to our past selves.
We began when investigations (inspired by papers like Racial disparities in automated speech recognition | PNAS) showed that we performed significantly worse on accented speech compared to Standard American. As a team, we investigated and tried out different approaches to improve the biases in our models from training new language models and acoustic models, balancing data as much as we could to modifying our pronunciations used in the lexicon. We’ve even started writing a survey paper over all the research papers on accents in ASR we’ve read. And, although we are going to test an AM with slight improvements in production, we did not see much improvement overall in our accuracy on the different accents we focused on (British, Indian, AAVE (African American Vernacular English)).
Because of this lack of progress, we are currently pausing the effort until we are in a better position to approach it. Part of this means waiting for our end-to-end migration to finish where we will have a lot more flexibility in the types of models we use. Another part is finding other sources of data and gathering more, although we have a lot of data at Rev, we have much less labeled accent data. Accents are hard and I believe this will be a long journey, but just because we are taking a break doesn’t mean we’ve given up.” — Arthur J. Hinsvark
What’s an accomplishment or initiative at Rev that you’re most proud of?
“First and foremost, I am proud of the team we have built here. I am lucky to work with such talented and driven people. They work well, they are fast, and they have an admirable passion for solving hard problems.
I am also very proud of the ASR we have built as a team. Before coming to Rev, there was no speech recognition at the company. The team has built on top of my initial explorations and we have built what I believe to be the best English speech recognition models in the world. Having worked over 15 years in Speech Recognition, I understand how difficult it is to build a good generic model, and I am constantly impressed by the versatility and high-quality of our ASR models.” — Miguel Jette