Speech Recognition Challenges and How To Solve Them

Speech Recognition Challenges and How to Solve Them

Cutting edge tech is always a challenge. That’s one of the reasons we love it. The breakthrough discovery, the moment when we figure out how to solve them

Rev Press

•

September 14, 2021

Button Text

Illustration of a cheerful robot holding a green book with more books emerging from its head, symbolizing learning or knowledge acquisition by AI.

Table of contents

Cutting edge tech is always a challenge. That’s one of the reasons we love it. The breakthrough discovery, the moment when we figure out how to solve the riddle, is what every technologist lives for.

Just like any form of artificial intelligence (AI), automatic speech recognition (ASR) is exceedingly challenging to get right. These obstacles go well beyond coding the algorithms, processing data, and all the other technical challenges. Both ASR and AI in general also pose dilemmas for individual users and society at large.

As a tech company, we mostly focus on getting the technology to work and expanding its functionality. It’s also our duty, however, to lead the conversation about how we can make our products less challenging for the end user and better for all of us.

These are the top 4 challenges for voice recognition.

Accuracy

Let’s start with the technical. Simply put, if speech recognition isn’t accurate, then it isn’t any good. To measure this, we use word error rate (WER), which gives us a percentage value for the words that our speech recognition messed up or missed. Like golf, a lower score is better.

While of course WER varies among different products, one of the main challenges for today’s ASR is maintaining a high WER across many unique scenarios. Anyone can build a solution that works perfectly in the lab for a small number of phrases. When we start to add in the complexity that we call “the real world”, however, things can get tricky.

Background noise is one of the biggest challenges. Especially as voice recognition software leaves the confines of the personal computer to inhabit smart devices in varied environments, we need to deal with cross-talk, white noise, and other signal muddying effects.

Jargon is another common reason for inaccuracy. For industries like law, engineering, or medicine, we usually need to train our language model on recordings that come from that specific sector so that it can learn those words.

In addition to the words we say and where we say them, we also need to look at who’s doing the talking. Speech recognition for children is just beginning to catch up to the same level as for adults, while ASR systems still often struggle to understand anyone with a head cold or other illness that changes the timbre of their voice.

There’s also the problem of WER variance among speakers of different backgrounds. In general, non-white users are understood less than their white counterparts, while those with regional or foreign accents also face similar difficulties. If you want to learn more about how these biases arise and what we can do to fight them, check out our guide to responsible AI.

Even after accounting for all these complications, our challenges to accuracy still aren’t over. Other tough nuts to crack include automatic punctuation, correct formatting, and speaker diarization, which distinguishes speakers within a conversation.

There is work to be done in the code itself when it comes to tasks like preprocessing data to mitigate noise and identify different speakers. However, the main way that we look to solve these challenges is with more data and better data. By including more audio of foreign English speakers, for instance, our speech recognition software becomes better at understanding those speakers. Speaking of data…

Training Data

Machine learning (ML), the type of AI that we use for ASR, differs from traditional software in one key way: instead of telling the computer what to do, we feed it lots of data and it learns its own. So, for instance, instead of coding the acoustic input that corresponds to the word “cat”, we train it on recordings that include that word and it gets better and better at recognizing it.

We encounter two main challenges in obtaining suitable training data: availability and cost. Training a speech model takes a lot of data, and the free datasets that we can find online, such as LibriSpeech and Bengali.Ai are great launchpads, but they’re far from sufficient. To give you some perspective, AI is so data-hungry that researchers now realize that China’s hacking efforts largely focus on “hoovering up” large quantities of data to train their AI systems.

So, unless your business model entails large-scale data collection like Google or you provide a transcription service like Rev, you’re probably going to need to buy your data—and that can add up quick. It’s not uncommon to pay $10,000 per month for data, as most data brokers charge between $100 and $300 per hour for their training data.

Try Rev.ai for Free

Implementation Costs and Time

Even if you’ve got the data, the challenges are far from over. Training and retraining ASR models take a lot of compute power for a long time. Even if we use powerful GPUs in the cloud, it can still take months and thousands of dollars to train a single language model. Deep learning models are notorious for requiring lots of computational power to train.

When we add on the opportunity cost of not being able to use the speech engine during that time, factor in the possibility that we’ll need to tune it and retrain it, and add that to the salaries for developers and data scientists, it’s not hard to see how bringing a speech recognition to market is a massive business challenge.

It’s also not a one-and-done project. Language models, just like language itself, is ever-evolving, meaning that we’re going to need to keep working on and retraining the model. This adds another wrinkle to the business case for developing custom ASR.

Lastly, voice user interfaces present an additional challenge during the UX design phase. Unlike visual technologies, there’s no menu for users to look through and no buttons to press. They lack immediate clarity, yet a user-friendly product must still be intuitive to use and easy to learn. This aspect of implementation is often overlooked, but it remains a major challenge for the voice recognition field.

Data Control

From a larger perspective, this is the biggest challenge that we face with speech recognition and AI as a whole, both for its intractability and for what’s at stake. How do companies collect, use, store, and sell the data that they collect? Who owns it? What tools do we have, as individuals, to control data about us?

Unfortunately, especially in the United States, the combination of largely unregulated data markets, weak privacy laws, and the sheer size of the biggest tech companies that are involved in this trade make it extremely difficult for us to maintain visibility into how our lived experience is rendered into data that, in turn, is used to create psychological profiles that are built to exploit and manipulate us in subtle ways through targeted advertising. It’s an incredibly profitable business model. Rather than charge for access to a search engine or a social media platform, tech companies invest in them for the same reason that a mining company invests in a new site: it’s about extracting raw materials. That mined data is then refined via AI into prediction models that they then use to sell their actual product: our behavior.

Narrowing in on voice recognition, it’s not hard to see how speech activated devices like always listening home assistants can be used as an apparatus to extend a company’s surveillance capabilities for even deeper data collection. When devices fade into ambience to the point where they’re no longer noticeable, when they’re so ubiquitous that they’re constantly capturing moments both small and large, then tech companies will have an opportunity to render our experience into data in ways that are unfathomable from within the confines of cyberspace.

The industry often talks about how to build trust with users, yet beneath the innuendo and marketing veneer remains something sinister. To truly be able to trust companies with our data, we need strong privacy laws and regulations for acceptable data use. We should, for instance, look to the EU’s General Data Protection Regulation (GDPR), which designates data control as a basic human right and provides concrete provisions for Europeans, such as accessing data about them and the right to be forgotten. Such regulations are far from the end of the road, but they are a valuable first step.

In the meantime, you’re probably wondering, “well, what can I do about this right now?” The truth is that it’s really really hard to entirely avoid predatory data extraction—this author has tried. Still, there are some easy ways to minimize one’s digital footprint, such as using a privacy enhancing extension on an open source web browser, rejecting location services and other permissions, and opting for alternative technologies like DuckDuckGo for search and ProtonMail for email.

The truth is that these measures remain insufficient, even for the privacy-obsessed and especially for the unaware. That’s why Rev urges other companies and individuals to join us in standing for responsible AI.

Conclusion

Speech recognition technology may be facing big challenges, but that doesn’t mean we can’t overcome them. A little bit of history can give us some context here. When we invented the technology that spurred the industrial revolution, it brought a host of both technical and societal issues. Not only was the machinery itself more dangerous and less efficient than what we’ve developed since, but issues like overworking employees in the factory and child labor became major problems.

It took some time, but regulations emerged. Child labor laws, the 40 hour work-week, and worker safety rules were born. That doesn’t diminish the hardship of those who lived through it. In hindsight, we can now see how bringing light to these issues and advocating for workers rights was effective. We believe that we can achieve similar progress with AI and data control.

Taking a step back and looking specifically at the technical challenges, including accuracy, training data, and implementation, these are the challenges that make up our day-to-day work at Rev. Our team of speech scientists and software engineers constantly iterate on our speech engine to improve its performance.

Starting from ground zero is tough. For companies that need access to a speech recognition platform, the question is always do I build or do I buy? It’s all about whether the costs and the challenges outweigh the gains. In general, unless speech recognition is one of your core products or you need a highly customized solution, it’s almost always better to let the experts work on the challenges of voice recognition and benefit from their labor.

Ready to get started? Learn more today about Rev.ai, our speech to text API that’s easy to integrate into your next product.

Try Rev’s Leading ASR Free for Your First 5 Hours

Topics: