Skip to content

Rev Improves Accuracy by Over 30% with Launch of New v2 ASR Model

Rev Improves Accuracy by Over 25% with Launch of New v2 ASR Model

RevBlogArtificial IntelligenceRev Improves Accuracy by Over 25% with Launch of New v2 ASR Model

At Rev, we believe we have the most accurate speech recognition service on the market. Today, we’re setting the bar even higher with the launch of our v2 ASR model, which offers an over 30% increase in accuracy when compared to our existing model.

We’ve tested our v2 ASR model extensively and found that this increase in accuracy holds good across a wide range of topics, industries and accents. This massive improvement is the result of two years of technical research and application of the latest deep learning techniques to our millions of hours of transcribed speech.

Technical Approach

Prior to v2, our model followed a so-called “hybrid approach,” based on combining multiple separately-trained components that use very powerful fundamental statistical models like Hidden Markov Models and Gaussian Mixture Models. Although extremely flexible, this hybrid system was not robust to different pronunciations, different acoustic environments, or multi-speaker audio; it was also less capable of learning from large quantities of data.

Our v2 model improves on this by using a single neural network in an end-to-end (E2E) model. Under this approach, the system is trained as a single unit, ingesting audio directly and learning as it goes. This approach largely solves key problems in accuracy, training, pronunciation/accents and diarization.

At Rev, we have taken advantage of this new approach and combined it with our large database of accurate transcripts to train the model and achieve the significant improvements mentioned above.

Benchmarks

So that’s the theory… now for some data. The two most important metrics we track are Word Error Rate (WER) and Speaker Switch WER, which we define as the WER in the region around when a speaker switch arises (a 5-word range around it).

  V1 Model V2 Model Relative Gain
Overall WER 17.09% 11.63% 32%
Speaker Switch WER 30.17% 18.46% 39%

This shows that our new model yields a 32% reduction in errors overall, and performs much better around speaker switches. The latter is particularly important for real-life scenarios such as meetings, which often have multiple speakers talking out of turn or over each other.

The table below goes a little deeper into the data and shows the distribution of Word Error Rate (WER) relative gains per different domains that we cover.

  V1 Model V2 Model Relative Gain
Overall WER 17.09% 11.63% 32%
Business 20.57% 13.19% 36%
Education 20.80% 14.22% 32%
Entertainment 16.54% 10.86% 34%
Health 18.30% 12.17% 33%
Law 23.58% 15.31% 35%
Politics 18.65% 13.41% 28%
Religion 16.62% 10.94% 34%
Science 13.71% 9.23% 33%
Sports 21.41% 14.40% 33%

These figures are based on our internal test suites.

Get Started with v2 ASR

The v2 ASR model described above is our default production model for new users as of March 7, 2022, and you can start using it today. When no transcriber option is provided, or if the transcriber option is explicitly set to machine_v2, the audio file will be transcribed by the v2 ASR model.

Here’s an example of using the v2 model in an API call:

curl --location --request POST 'https://api.rev.ai/speechtotext/v1/jobs' \
--header 'Authorization: Bearer YOUR-ACCESS-TOKEN-HERE' \
--header 'Content-Type: application/json' \
--data-raw '{
  "media_url": "https://www.rev.ai/FTC_Sample_1.mp3",
}'

Existing users who have not yet been migrated to the v2 model as their default (see below for migration dates) should explicitly include the transcriber: machine_v2 parameter. Here’s an example:

curl --location --request POST 'https://api.rev.ai/speechtotext/v1/jobs' \
--header 'Authorization: Bearer YOUR-ACCESS-TOKEN-HERE' \
--header 'Content-Type: application/json' \
--data-raw '{
  "media_url": "https://www.rev.ai/FTC_Sample_1.mp3",
  "transcriber": "machine_v2",
}'

This also applies to SDK operations, as shown below in this example for our Node SDK:

// ...
// initialize the client with your access token
var client = new RevAiApiClient(accessToken);

// set job options
const jobOptions = {
  transcriber: 'machine_v2' // optional value for transcriber
};

// submit a file
var job = await client.submitJobUrl(mediaUrl, jobOptions);
//...

For existing pay-as-you-go (PAYG) and enterprise users, the v2 ASR model will automatically become the default from April 7, 2022 (for PAYG users) and September 7, 2022 (for enterprise users). Once defaulted to the v2 ASR model, it will no longer be necessary to specify transcriber: machine_v2 in API and SDK operations. The v1 ASR model and related user preference will be deprecated on September 8, 2022.

Learn more about our Asynchronous Speech-to-Text API and transcription options (including a summary of the v1 to v2 migration roadmap).

Additional Notes

A few important points to note:

  • The v2 model only supports asynchronous mode and English language input for now. Streaming support is coming soon and currently running in a closed beta. Contact support@rev.ai if you would like to participate in the beta.
  • Transcription pricing for the v2 model is the same as under the previous model. For more information on pricing, please contact sales@rev.ai.
  • The confidence scores under the v2 model, although more accurate, might appear slightly lower than those calculated under the previous model. Any customer logic dependent on confidence score calculation will need to be adjusted accordingly.
  • The estimated turnaround time for v2 ASR is approximately 33%-40% faster than the previous model.

We’re excited about our new v2 ASR model and would love to hear your feedback and learn more about how you are using it. Let us know by emailing us at support@rev.ai.

Affordable, fast transcription. 100% Guaranteed.