The 3 Best Automatic Speech Recognition Engines & APIs

The three biggest players when it comes to ASR offerings are Rev AI, Google, and Amazon. Each of them offer affordable speech recognition APIs which allow you to upload audio and video files and receive a text transcription in return.

 

Get Rev’s Best-in-Class Speech to Text APIs

 


RevBlogResourcesOther ResourcesA.I. & Speech RecognitionThe 3 Best Automatic Speech Recognition Engines & APIs

Each of these use artificial intelligence and machine learning to convert speech to text starting from a raw audio source. However, there are some key differences in these offerings that might lead you to choose one over another. This article will examine each of the services in detail so that you can make an informed decision as to which API is right for you.

1. Rev AI

Language Support

Rev AI, much like Amazon, supports 31 of the most commonly used foreign languages. It also supports multiple variants of English (for example, British pronunciations and spellings).

Specialized Models

Rev AI does not have specialized domain models today. Rev has the most accurate general purpose model which leverages the massive amount of audio and text data it has accumulated over the years. In addition, the Rev AI model contains an extensive vocabulary that allows the engine to be accurate in most domains like medical, legal, finance, etc.

Custom Vocabularies

Allows you to specify custom vocabularies to aid in transcription. This is very useful if you have audio discussing or taken from a specific context, such as military, biology, medicine, and want to provide a list of uncommon words that the model can look to as hints when creating the transcription. You can also provide names, places, and other terms that the model is less likely to recognize. 

Service Uptime

Offers 99.9%+ guaranteed uptime.

Content Filtering

Allows you to request content filtering, for example removing profanity and other obscenities from the final transcript.

Punctuation

Rev AI supports inferring and adding punctuation right out of the box, which means less time that you need to spend in manual post-processing of a transcript.

Bias and Equity with Different Speakers

Rev AI has benchmarked its model against those of other leading ASR cloud providers. They found that their model consistently exhibited lower bias related to ethnicity and gender than other top providers.

Speaker Diarization

Speaker diarization refers to the ability for a model to match segments of audio to its corresponding speakers. Rev AI supports up to eight English speakers or six non-English ones.

Streaming and Asynchronous (Pre-Recorded) Capabilities

Rev AI offers both streaming and asynchronous transcription options. The streaming option is useful if you want to transcribe an event in real time, such as a live podcast.

Accuracy

Probably the most important data point to take into consideration when choosing an ASR provider is accuracy. Measuring machine learning model performance is somewhat of an inexact science. However, the gold standard metric to use for evaluating ASR systems is called the word error rate (WER). WER is defined according to the equation

WER = (substitutions + insertions + deletions) / number of words spoken

It is basically a way of quantifying all the different types of errors that can occur when transcribing a transcript and compressing them into a single score. A low word error rate indicates a transcript with high accuracy. Rev has been benchmarked against all major cloud ASR providers and has consistently scored the lowest word error rate. This blog post in particular shows an in-depth comparison of Rev against Google and Amazon on multiple audio files. Rev also did its own benchmarking tests here on a series of podcasts. In all instances, Rev scored better than its competitors when evaluated using both the WER and a novel text similarity metric. Rev also provides a free tool for aligning transcripts and calculating their WER so that you can assess its accuracy on your own data.

How does Rev AI beat giants like Amazon and Google? Well, Rev has more data. Tons more. Rev has 60,000+ professional human transcribers who transcribe speech to text every day and train the ASR.

Rev Beats Google Microsoft Amazon

Price

Rev AI has a base rate of $0.035/min. However, Rev offers enterprise pricing for businesses that are purchasing transcription time in bulk. This rate is heavily discounted, coming in at $0.020/min, making Rev the most affordable option at scale. Rev AI will also go even lower in price when your volume goes higher.

Rev is also committed to increasing access to its services and bolstering entrepreneurship. It offers a year of free usage and $5,000 in free credits to eligible startups.

SDK’s

Rev offers SDKs for the key languages used in backend and web development, namely Java, Node.js, and Python. However, note that SDKs are not necessary for using any of these APIs in a given language. All the SDK does is bundle some preconfigured methods for calling into the API. These methods can also be created manually by the developer.

2. Google

Language Support

Google has the largest breadth of coverage with support for 125 languages.

Specialized Models

Some of the services offer specialized models which have been trained and tailored towards a particular use case. Google provides three domain-specific models outside of their default. One is for transcription of phone calls, another is for transcription of shorter text, such as in voice commands or voice search, and the third is for transcription of video files.

Custom Vocabularies

Allows you to specify custom vocabularies to aid in transcription. Google also offers a concept called a class which allows the model to pick up on specific types of speech, even when these types are somewhat vaguely defined. For example, it can be used to identify numbers in street addresses as house numbers rather than regular digits.

Service Uptime

Offers 99.9%+ guaranteed uptime.

Content Filtering

Allows you to request content filtering, for example removing profanity and other obscenities from the final transcript.

Punctuation

Punctuation is not enabled by default on Google transcriptions. However, this feature can be turned on. 

Bias and Equity with Different Speakers

Rev AI found that its model consistently exhibited lower bias related to ethnicity and gender than other top providers.

Speaker Diarization

Speaker diarization refers to the ability for a model to match segments of audio to its corresponding speakers. Google’s speaker diarization feature is currently in beta as of when this article was written.

Streaming and Asynchronous (Pre-Recorded) Capabilities

Google supports asynchronous and streaming, but it places a 10 MB limit on streaming requests. It also only supports streaming via the gRPC protocol.

Accuracy

Rev AI is the most accurate solution, with Amazon and Google trailing behind according to this third party evaluation and this Rev.com evaluation.

Rev Beats Google Amazon Microsoft in Speech to Text Accuracy

Price

Google has a base rate of $0.036/min. You can find their detailed and bulk pricing here.

SDK’s

Google offers SDKs for these languages (excepting Node.js) as well as Javascript, .NET, Go, PHP, and Ruby.

3. Amazon

Language Support

Amazon has the same amount of foreign languages as Rev AI, with support for 31 languages.

Specialized Models

Amazon provides a model which is trained for phone calls, in particular, customer service and sales calls. They also offer a unique model called Transcribe Medical which is specifically trained for use in a health care setting, for example, dictation of clinical notes.

Custom Vocabularies

Allows you to specify custom vocabularies to aid in transcription. Amazon has some nifty add-ons to the custom vocabulary capability, namely the ability to accept acronyms as well as the ability to specify pronunciations and “sounds like” data for custom vocab terms.

Service Uptime

Offers 99.9%+ guaranteed uptime.

Content Filtering

Allows you to request content filtering, for example removing profanity and other obscenities from the final transcript. Amazon Transcribe Medical also provides the ability to redact Personally Identifiable Information which is useful, especially in a health context where patient privacy is paramount.

Punctuation

Amazon supports inferring and adding punctuation right out of the box, which means less time that you need to spend in manual post-processing of a transcript.

Bias and Equity with Different Speakers

Rev AI found that its model consistently exhibited lower bias related to ethnicity and gender than other top providers.

Speaker Diarization

Speaker diarization refers to the ability for a model to match segments of audio to its corresponding speakers. Amazon states that the feature works best for differentiating between two and five different speakers. 

Streaming and Asynchronous (Pre-Recorded) Capabilities

Amazon supports both options, with streaming transcription enabled via either HTTP/2 or WebSocket.

Accuracy

Rev AI is the most accurate solution, with Amazon and Google trailing behind according to this third party evaluation and this Rev.com evaluation.

Price

Amazon has the most inexpensive base rate at $0.024 per audio minute.

SDK Offerings

Amazon offers SDKs for these languages (excepting Node.js) as well as Javascript, .NET, Go, PHP, and Ruby.

Summary

Each service offers its own unique combination of features and strengths, and which one you elect to use should come down to your needs. If you want vast foreign language support, including exotic languages such as Zulu, then Google is the way to go.

If you want an inexpensive service with models for niche use cases such as phone calls and medical, then Amazon could be the best bet. However, if you’re looking for the best accuracy, broad language support, robust and easy-to-read documentation, fantastic customer support, and the best pricing at scale, then Rev AI is likely the service for you.