The Podcast Challenge: Testing Rev.ai’s Speech Recognition Accuracy
Here on the Rev Speech R&D Team, we are constantly striving to improve Rev.ai’s Automated Speech Recognition (ASR) accuracy.
As such, we spend a lot of our time creating test suites for the many different scenarios where our customers use our speech recognition technology. One of those use cases? Podcasters who want to produce transcripts of their shows. In order to assess how well our ASR works for their particular needs, we collected a few of the most popular podcasts and used them as a test to determine how accurately Rev.ai performs.
In this blog, we will first present the results we obtained for these tests, then we will discuss the steps required to produce the test suite. Finally, we will examine the details of these particular podcasts to illustrate the difficulty of the test suite.
We hope this gives you some insight into how we think about our ASR system’s accuracy and showcases how we stack up against our competition.
How Accurate is Rev.ai’s Automated Speech Recognition?
First, some context. I wanted to add a disclaimer that we, the Rev Speech R&D team, use a proprietary toolkit to calculate Word Error Rate (WER). Fundamentally, the software still calculates the same metric, but our methodology takes into consideration synonyms, typos, and number representations (e.g. “10” as “ten”). Therefore, our methodology enables us to calculate the best WER possible for each provider. We hope to be able to share this approach in a future article explaining the details of the technique.
Also, as a reminder to the reader, WER is calculated using the following formula:
As a baseline, the graph below illustrates the accuracy results per speech recognition service, as of August 2020:
Building the Test Suite
Now, let’s take a look at how we built our test suite.
First, as an absolute golden rule, one needs to ensure that the data selected is not used in the training ASR you are testing. While randomly selecting podcasts to include in this test suite, our team made sure to carefully select audio files that were not part of the training for our ASR models, as to prevent any unfair advantage
Secondly, the amount of audio needs to be large enough for the error rate to be significant and meaningful for any analysis. That’s why our team carefully chose 30 episodes that amount to 27.5 hours of speech. We consider this to be significant enough to assess the accuracy of our models.
Finally, in order to test an ASR model properly, one should always consider as wide a range of acoustic conditions as possible, even within a given domain. This test suite covers a vast array of podcast genres, with many different speakers: storytelling with sound effects (Crimetown), group discussions with a lot of speaker overlap (The Read), and scripted news podcasts (The Daily).
In order to get accurate transcripts, we sent the 30 podcasts to human-powered Rev.com service, choosing the verbatim option (to include as many words and repetitions as possible), also including a dictionary of important words to make sure proper names, like Kwame Kilpatrick, were properly transcribed.
Detailed overview of the test suite
Let’s take a look at the kind of data included in this test suite.
List of Podcasts and Corresponding Episodes
Included below is the list of podcasts included in this test suite.
- File 1: This American Life: Episode #661 🏷️ Society & Culture
- File 2: The Read: Rebel Without a Cause 🏷️ Comedy
- File 3: The Read: Spice or Sour Cream? 🏷️ Comedy
- File 4: The Daily: The Plan to Discredit the Florida Recount 🏷️ News
- File 5: The Daily: The California Wildfires 🏷️ News
- File 6: The Moth Radio Hour: Hope and Glory 🏷️ Art
- File 7: The Moth Radio Hour: Deer Meat Dance Moves and Motherhood 🏷️ Art
- File 8: Podcasts in Color, Women Creating Podcast Networks ft. Ahyiana of @SPQPodcast 🏷️ Society & Culture
- File 9: Podcasts in Color, Creating Your Own Lane in Podcasting ft @Favyfav of @latinoswholunch 🏷️ Society & Culture
- File 10: Podcasts in Color, Podcast Tips From Berry 🏷️ Society & Culture
- File 11: Heavyweight: Episode #9 Milt 🏷️ News
- File 12: Heavyweight: Episode #10 Rose 🏷️ News
- File 13: Crimetown, Coming Soon: Season 2 🏷️ True Crime
- File 14: Crimetown, Bonus Episode: Buddy Cianci…The Musical 🏷️ True Crime
- File 15: Crimetown, Chapter 18: The Prince of Providence 🏷️ True Crime
- File 16: Pod Save America, We won. 🏷️ News
- File 17: Pod Save America, The election is nigh! 🏷️ News
- File 18: The Daily Zeitgeist, The Christian Black Panther? 🏷️ News
- File 19: The New Yorker: The Writer’s Voice, Tommy Orange Reads The State 🏷️ Art
- File 20: Skidmarks Show, Episode 66 🏷️ Leisure
- File 21: Food Psych, Episode #148 🏷️ Health
- File 22: My Favorite Murder with Karen Kilgariff and Georgia Hardstark, Episode 145 🏷️ True Crime
- File 23: Sorta Awesome, Episode 169 🏷️ Society & Culture
- File 24: Drinkin’ Bros., Episode 338 🏷️ Society & Culture
- File 25: Roads From Emmaus, What We Own is Sacred Because We Are Sacred (Oct. 14 2018). 🏷️ Religion & Spirituality
- File 26: The Bill Barnwell Show, Vince Verhei & Doug Kyed 🏷️ Leisure
- File 27: The Ross Bolen Podcast, Lemurs Are Important Pollinators 🏷️ Society & Culture
- File 28: Forked Up: A Thug Kitchen Podcast, All Casper Everything with Natalie Eva Marie 🏷️ Health
- File 29: American Fiasco, Bonus Episode with Stephen Dubner of Freakonomics Radio 🏷️ Society & Culture
- File 30: Recovery Elevator, Episode 195: What Should the Bottle Say? 🏷️ Health
Podcasts come in all different genres. We have selected a wide variety of genres that covers most of the popular genres out there:
|Genre||Number of podcasts|
|Society & Culture||8|
|Religion & Spirituality||1|
We chose enough podcasts to have a long enough test suite of around 27.5 hours of audio. Most of the podcasts chosen were below 60 minutes in length, but we also included some longer episodes to be able to test the behavior of our system for longer files.
Figure 3: Distribution of the length of podcasts used in the test suite.
A key indicator of how difficult an audio is for an ASR is how many speakers are present. Again, we made sure to include enough variety of speakers in the podcasts we chose. Some of these episodes only have two speakers for the whole file, and some have as many as 35 speakers (e.g. True Crime podcasts with many characters). Of course, any third-party ads in any given podcast counted as new speakers, as well.
Figure 4: Distribution of the number of speakers in the podcasts used in the test suite.
Another key indicator of how difficult audios are for an ASR is the signal-to-noise ratio (SNR) level. Here we share the distribution of the SNR levels for all podcasts included in the test suite.
Here, we show the average of the peak SNR measured per segment (dB), where segments are defined as 1.92 second long.
We can see that about 6 podcasts have what we would consider slightly more noisy acoustic environments (<30 dB) and the rest are very good in studio quality recordings.
Figure 5: Distribution of the average peak SNR for all podcasts in the test suite.
Detailed Overview of the Results
Below are the results organized by file. Against four other competitors, Rev.Ai has better WER on 18 of the 30 files — 60 percent. In second place came the Speechmatics v2 API, with 11 wins. Interestingly, Microsoft performed better than other API on file 13.
|File||Rev.ai||Speechmatics v2||Google Video||Microsoft||Amazon|