Today, we launched Rev’s Automated Transcription service, a new service line that allows users to easily convert audio and video to text through the use of sophisticated speech recognition algorithms.

We decided to test the Word Error Rate (WER) of our new service line alongside Google, Amazon, and Microsoft’s speech-to-text services.

The results from these tests on a collection of public podcasts reveal our new automatic speech recognition (ASR) service outperforms these major players.

Below, we walk through how we calculate WER and our methodology for this benchmarking test.

Podcasts are a popular and strong use case for transcription. Transcription helps creators in the editing process, and publishers to drive SEO. For both, the accuracy of the transcript makes a big difference. They are also representative of natural conversations with multiple speakers and cross-talk, which is why we chose to conduct these tests on podcast recordings.

Defining Word Error Rate

There are multiple tools with which you can measure the quality of an ASR service, e.g., sclite. We have developed a more robust internal testing methodology that takes into consideration synonyms, typos, and number representations (e.g. “10” as “ten”). Although we approach the problem in a slightly different way, our WER is still derived from the traditional Levenshtein distance used in the industry.

The formula is as follows: WER = (S + D + I) / N.

  • S is the number of substitutions
    • e.g., reference: “I went to the store” vs. hypothesis: “I went to the shore”
  • D is the number of deletions
    • e.g., reference: “I went to the store” vs. hypothesis: “I went to store”
  • I is the number of insertions
    • e.g., reference: “I went to the store” vs hypothesis: “I went to the party store”
  • N is the number of words in the reference

Building the Test Suite

Our speech team randomly selected a collection of 20 podcast episodes from some of the most popular podcasts like “This American Life”, “The Daily”, “My Favorite Murder”, and “Pod Save America.” Those episodes amount to about 18 hours of test audios, which exhibit many different acoustic conditions. The podcasts selected represent a wide range of podcast genres with many different speakers: story-telling with sound effects (Crimetown), group discussions with a lot of speaker overlaps (The Read), and scripted news podcasts (The Daily).

Here are the steps we took to generate Rev.com, Google, Amazon, and Microsoft’s WER for each file:

  1. Create the reference transcript (we used our human-generated verbatim transcript from Rev.com).
  2. Run each audio file through Rev Automated Transcription, Google’s enhanced video model, Amazon Transcribe, and Microsoft Speech-to-Text to get the ASR transcripts.
  3. Compare each word of the ASR transcripts to the reference transcripts and calculate the Word Error Rate (WER).

The results:

The graph below illustrates the average WER by service.

Word Error Rate of Speech Recognition Engines; Rev, Google, Amazon, Microsoft

The table below illustrates the WER by each file tested and includes each podcast WAV file and its respective transcripts for reference. You can also access all of this data in this Google Drive folder.

The Airtable above shows:

  • Rev.ai’s WER is the lowest in 15 out of the 20 podcasts.
  • Google video model’s WER is lowest in the remaining 5 podcasts.

Some considerations

  1. If you decide to re-download the podcasts from iTunes, note that the ads within the podcast may have changed, making transcripts mismatch. If you plan on using this test suite, be sure to use the audio files provided in the Google Drive (linked to above).
  2. WER is just one way to measure quality. Specifically, it only looks at the accuracy of the words. It does not take into account punctuation and speaker diarization (knowing who said what).
  3. WER does weigh all errors equally, but getting nouns and industry terminology correct is much more important than “umm” and “ah.” Adding custom vocabulary can dramatically improve the accuracy of important terms. This feature is coming soon for Rev Automated Transcription.

The Power of Speaker Diarization

Not all audio or video files involve just one person narrating into a recorder. What’s more likely is that your files contain multiple speakers. Those speakers may sometimes cut each other off or talk over each other. They may even sound fairly similar.

One of the cool features of Rev Automated Transcription is speaker diarization. The speech engine recognizes the different speakers in the room and attributes text to each. Whether it’s two people having an interview or a panel of four speakers, you can see who said what and when they said it. This is particularly useful if you’re planning to quote the speakers later. Imagine attributing a statement to the incorrect person – and even worse, getting the crux of their message wrong because of a high WER rate.

Not all ASR services offer diarization, so keep that in mind if you’re often recording multiple people talking at once. You’ll want to be able to quickly discern between them.

Other Factors to Consider

WER can be an incredibly useful tool; however, it’s just one consideration when you’re choosing an ASR service.

A key thing to remember is that your WER will be inaccurate if you don’t normalize things like capitalization, punctuation, or numbers across your transcripts. Rev Automated Transcription automatically transcribes spoken words into sentences and paragraphs. This is especially important if you are transcribing your audio files to increase accessibility. Transcripts formatted with these features will be significantly easier for your audience to read.

WER can also be influenced by a number of additional factors, such as background noise, speaker volume, and regional dialects. Think about the times you’ve recorded someone or heard an interview during an event. Were you able to find a quiet, secure room away from all the hubbub? Did the speaker have a clear, booming voice? Chances are, there were some extenuating circumstances that didn’t allow for the perfect environment – and that’s just a part of life.

Certain ASR services are unable to distinguish sounds in these situations. Others, like Rev Automated Transcription, are better at transcribing files in which the speaker volume is lower or they are further away from the recorder. Not everyone is going to have the lung capacity of Mick Jagger, and that’s fine. While we don’t require a minimum volume, other ASR services may. If you tend to interview quieter talkers or are in environments where you can’t make a lot of noise, be mindful of any requirements before making your selection.

Final Thoughts

We hope this helps you understand our process for calculating WER on the collection of public podcasts. If you have any questions or would like additional information on these results, please contact us.