Microsoft Azure Speech Recognition vs. Rev AI Speech to Text API
Whether you have a great idea for the next Internet of Things (IoT) device, you want to add live-captioning to your media streaming service, or you’re creating a hands-free voice user interface for a mobile application, you’re going to need an automatic speech recognition (ASR) solution that’s up for the job. There’s nothing that turns users off more than the frustration of trying to talk to a device that just can’t understand them, no matter how hard they try.
If you’re picking between Azure Cognitive Services and Rev.ai for your project, you want to compare these solutions along several metrics. Depending on your unique needs and uses, some will be more important than others. No matter what, you want to get a holistic picture of each technology and how they stack up against each other.
Winner: Rev AI
By far the most important point of comparison is accuracy. After all, if the ASR engine messes up too many words, using it will be difficult at best and impossible at worst. The gold standard for accuracy benchmarking is word error rate (WER), which measures how many words the ASR tech deletes, inserts, or substitutes as an overall percentage. A 20% WER, for instance, means that it got 20% of the words wrong. Accordingly, a lower WER is better.
In our podcast transcription benchmarks, we compared Rev AI to Microsoft’s ASR for 30 podcasts and found that Rev’s WER, 14.22%, is about 2% lower than Microsoft’s, which came in at 16.51%. The reason that Rev’s AI outperforms others is because our network of over 60,000 human transcriptionists contribute data that we use to constantly improve our models.
Speaker ID and Diarization
Winner: Rev AI
Identifying who is talking and when is a key feature for high performance ASR systems. Microsoft claims their tech supports diarization, but they don’t ever say how many speakers it can handle. Rev AI, on the other hand, promises support for 8 English speakers or 6 non-English speakers.
Both solutions can identify speakers equally well.
Winner: Microsoft Azure
If you want to serve an international customer base or if you’re building anything involving translation, supporting multiple languages is essential. Rev works in 31 different languages, including diverse options such as German, French, Spanish, Russian, Japanese, Chinese, Korean, Arabic, and Turkish.
Azure performs similarly with support for 44 total languages for speech-to-text use cases, though that number drops to 30 languages for translating speech in real-time. However, note that they charge substantially more for their translation service than average transcription, over twice as much.
How fast can the ASR turn words into text? We can break this question down into two general categories: synchronous and asynchronous uses. The former includes real-time speech to text applications like providing live captions for streaming media. Rev sees an average latency between 1 and 3 milliseconds, while Azure doesn’t specify.
Asynchronous ASR, on the other hand, deals with tasks that don’t happen in real-time, such as generating transcripts from a recording. Rev AI uses batch transcription to break the recording into multiple chunks so that we can process them in parallel to achieve faster results. In effect, we achieve the following bench-marking metrics:
5 minute file = ~154 seconds
30 minute file = ~9.5 minutes
300 minute / 5 hour file = ~7 minutes
Unfortunately, Azure cognitive services does not list turnaround times publicly.
Ease of Use
Winner: It Depends on your Tech Stack
How quickly can we go from signing up for a service to making the first API call in production? Rev consistently hears from our customers that they’re able to get a proof of concept up and running within hours, a big difference from the days or even weeks that it takes with services like Azure Cognitive Services.
In general, Azure’s products are going to be far easier to use for those who are already on Azure’s infrastructure and who already leverage the Azure ecosystem. If you’re already there, then that’s the single best reason to sign up for Azure Cognitive Services ASR. Otherwise, it’s usually better to go with a more platform-agnostic product like Rev AI.
Winner: Rev AI
Of course, price will always be one of the most crucial considerations in picking the right option. Both Azure and Rev use a pay-per-use pricing model where you only pay for what you use. Rev Enterprise AI starts at $1.20 per hour of audio (or $0.20 per minute) for speech to text services and includes full flexibility options like the ability to add up to 6000 custom vocabulary words. Your price with Rev AI goes down from $0.20 per minute as you add more volume.
A similar offering from Azure, their custom text to speech, costs $1.40 per audio hour, which can add up significantly. However, their standard, non-customizable option is a bit cheaper at $1 per audio hour. They also charge a variety of other fees, such as charging each time they identify a different speaker and charging $2.50 per hour for translation.
Overall, both Azure and Rev offer great ASR solutions, and each have their strengths and weaknesses. Looking specifically at Azure, their speech-to-text offers have great integration with their other Azure offerings, so if you are already in the Azure ecosystem, it’s definitely worth considering.
However, Rev demonstrates a higher accuracy rate, is cheaper on balance, and offers other great features. While Microsoft customers have to sign up for an Azure account for a free trial of their cognitive services, Rev offers a free trial with no strings attached. We invite you to try Rev AI for yourself and experience the difference today.