What Makes Rev’s v2 Best in Class?
A deep dive into v2: End-to-End speech recognition at Rev
The R&D team has spent the last two years researching the best approach to leverage the large quantity of data processed by Rev. As it turns out, the latest deep learning techniques are a perfect fit for this. In this technical overview, we will go over the past approach (called v1 Model in the rest of this blog) and compare it with the new approach we are calling v2. We’ve also had the pleasure to present our architecture in a recent MIT Lecture on Deep Learning and Automatic Speech Recognition.
As a world leader in the content transcription space, Rev is in a unique position to leverage a massive corpus of highly accurate transcripts to create an incredibly accurate ASR model. On top of offering our model as an API, we also leverage our model internally to help our freelancers efficiently produce 99% accurate transcripts for our customers. With this virtuous cycle, we can release new models and get quick feedback on how they perform, while collecting more data and corrections to improve the models further.
Before we embark on a technical overview of the two approaches and a thorough comparison of their performance, let us take a very quick look at the overall results. The table below shows the different WER (Word Error Rate) and relative gains for the different domains we most usually see in our system.
The table clearly shows that the v2 model improves accuracy by over 30% relative to the v1 model for almost all domains. We will explain this 1,000 hours test suite in more detail below.
|Business & Financial
|Health & Medical
Let’s examine the differences between the two approaches and what makes this new v2 model so good overall.
The v1 model: Hybrid approach
For the past 15 years, the prevailing technique for building commercial ASR systems has been referred to as a hybrid approach, in which neural networks are used to estimate parts of the Acoustic Model (originally proposed in 1994 and further improved in 2012). The approach was made possible, and popularized, by the development of the Kaldi toolkit, which has been used by almost all commercially available ASR systems (with certain large exceptions of course).
Those systems were very powerful because they combined extensive linguistic knowledge with very powerful fundamental statistical models like Hidden Markov Models and Gaussian Mixture Models. On top of that, they could easily be expressed as mathematical expressions, which are more formally understood. For example, if we define an ASR system as a combination of models and processes that returns the most probable transcription (Ŵ) given an audio signal (A), then the problem of ASR can be expressed as:
In the expression above, the main components are: P(W) is the language model (LM), P(A|W) is the acoustic model (AM), and argmax_w is the decoding algorithm. One can further deconstruct the acoustic model by assuming that the acoustic observations (A) are in fact a string of sound observations (i.e. phones) given a single word; in other words, a third model, the lexicon, is present as part of the Acoustic Model.
This deconstruction into smaller separate models has proven to be very useful in the commercialization of ASR systems, as it makes the system more explainable and flexible. In part, the explainability comes from the mathematical foundation of the model and the fact that it relies on a well defined set of pronunciations. For example, e.g. Is “Miguel” pronounced “Mig el” (M IH G EH L) or “Mig Well” (M IH G W EH L)?
On top of this, the hybrid systems are more flexible since one could experiment with the different modules more easily and train them independently. This flexibility gave rise to deep learning approaches for training AMs, and enabled researchers to assess all sorts of different LM approaches. It made ASR systems more compartmentalized, which was both good and bad.
The downsides of the v1 approach
First and foremost, the use of a lexicon as a gating mechanism for the ASR system is a really nice innovation for injecting linguistic knowledge into the system, but it is also a hindrance as one has to manually add words and their pronunciations to the systems. It could also be done programmatically using g2p (grapheme-to-phoneme) techniques, but it becomes another system that needs to be properly trained and maintained.
Another major difference in the v1 system, which is specific to Rev’s v1 model, is that it was trained to be a speaker-adapted acoustic model. During experimentation, our research team found that our hybrid model performed better if we added a layer of adaptation via I-vectors as speaker representations. Unfortunately, that required us to run a process called speaker diarization before performing recognition, since it would perform better under the assumption that all audio frames belong to the same speaker during decoding. Interestingly, this made the system slightly less performant in streaming scenarios, since the model couldn’t rely on a diarization first-pass.
These main differences meant that our hybrid system, while being very flexible, was not as robust to different pronunciations, different acoustic environments, multi-speaker audio, and was less capable of learning from large quantities of data compared to more recent techniques referred to as end-to-end Deep Learning approaches.
The v2 model: The End-to-End Approach
At a very high level, the goal, or idea, of the end-to-end approaches is to completely replace all components of traditional systems by a single neural network, most likely one that is very deep and wide and that presumably can learn everything from the data directly. These novel approaches are at the heart of the most impressive recent advances shown in Computer Vision (e.g. DALL·E 2) and Natural Language Processing (e.g. Open AI’s GPT-3).
For ASR, in its purest form, these end-to-end systems even ingest audio directly and learn the representation that best suits the task at hand, which means there is no more concept of pronunciations and one no longer needs to understand knowledge-based features like Mel Frequency Cepstral Coefficients (MFCCs). Again, the idea being that we let the training regimen decide what’s important in the audio and what’s not. Additionally, the removal of pronunciations has certainly made it easier to add words to the system, and has contributed to the improved performance of our v2 model on accented speech – both regional accents/dialects and non-native speech, since it nows learns all of this from the data it is trained on.
Conformers, CTC, and Attention
At Rev, we are using a novel two-pass approach that is implemented to unify streaming and asynchronous end-to-end (E2E) speech recognition in a single model. The approach is a conformer encoder 2-pass approach:
- The first component is based on the new Conformer Encoder architecture from Gulati et al 2020. Internally, we use a configuration that is close to their Large Conformer, which has 219.3 million parameters, 18 encoder layers with dimensions of 2,048, and 8 attention heads.
- We then have a decoding step, where we generate several alternatives, which is a CTC-based decoder with a language model (invented by Alex Graves in 2006). This step is about 10 times faster than an attention-based mechanism, but slightly less accurate.
- We then have a second pass (which is called “rescoring”, where we score the options generated by the first pass and pick the best one) using an attention-based model. Attention-rescoring is almost as accurate as attention-decoding but is fast enough for production.
Our unified approach means that we have trained the model to perform well under both scenarios: asynchronous and streaming. In async mode, the model sees the whole audio segment when transcribing it. In streaming mode, the model only sees the audio that has already been transcribed + is being transcribed; it can’t see any future audio. This is all controllable at training and inference time, making the solution extremely robust to both situations.
Interestingly, our approach continues to use a hand-crafted feature extraction approach called filterbanks extraction, which is essentially the same technique as MFCC extraction, but without the discrete cosine transformation (DCT) de-correlation step, without using any of the deltas used in MFCC as well, since the convolution layers are very good at figuring out that information on their own.
End-to-end systems are trained using subword tokens, which are essentially analogous to phonemes, but can be learned automatically by the training algorithm and the data fed to it. This makes these systems more easily scalable to other languages. Because Rev has an advantage with a large quantity of data, we have experimented with a very large subword unit vocabulary – a large number of units and the units themselves can be large. This speeds up transcription quite a bit but is only possible because of our very large training dataset. In our case, we found that our data enabled us to train models with up to 10,000 subword tokens, and tokens as long as 8 characters long. This means that a word like TRANSCRIPTION might be represented by a set of tokens like [‘▁trans’, ‘cription’]. In contrast, in academia where data is more limited, systems use fewer and shorter tokens, which leads to more complicated tokenization like [‘▁tra’, ‘n’, ‘scrip’, ‘tion’].
Due to the unconstrained nature of the decoding with subwords units, it is possible for the ASR engine to invent new words, which is sometimes unsettling. To remedy this problem, we constrain the decoding to a large but fixed lexicon using a Weighted Finite-State Transducer (WFST).
Lastly, these new models have many advantages for production systems. The new model is slightly faster than the v1 model, which means that we can either process incoming data faster or that we can increase the size of our architecture to gain more accuracy. The new architecture also means that it is much faster to train our production models, making it easier to research different ideas and iteratively improve the model. Last, but certainly not least, the new v2 model is better at learning from large quantities of data, which means that the model has been exposed to more data and thus is more accurate for a larger demographic and more acoustic environments, as shown below.
Like any other research ASR team, we continually evaluate our models on a multitude of small-to-medium sized test suites. For our team, each of our test suites consists of about 30 hours in duration, and typically has at least 100 different audio files. Some of these internal test suites are customer-specific, and some use externally available data that can be used to benchmark the competition.
It is important to note that all results here are from our cloud, asynchronous v2 model. We plan on publishing a similar post on our v2 streaming model.
These external test suites are available for other research groups to download and test on their own models. We will continue to publish external benchmarks in order to make comparative analysis easier and help research in speech recognition progress with “real world” data.
|Using reference segmentation, done by an external party.
Large Internal Benchmarks
Periodically, the research team performs a major analysis of our system’s performance using a significant portion of the media files coming through the Rev marketplace. Using more than 1,000 hours of perfectly transcribed audio files as an evaluation set (i.e. data that was not used in training our algorithms) gives us some unique insights into how our different models perform in the real world. For obvious reasons, we can’t share any of this data externally, but we find it useful to share the results externally as a way to confirm the trend we are seeing with our smaller test suites and to show the performance of our ASR under different scenarios, especially given the size of the sample used here.
The two most important metrics we track are “Word Error Rate” (WER) and “Speaker Switch WER”, which we define as the WER in the region around when a speaker switch arises (a 5 word range around it).
This means that our new model yields a 32% reduction in errors overall, and performs a lot better around speaker switches (39% relative improvement!). The latter is very important since real-life scenarios involve many speakers often speaking out of turn, slightly over each other, and having an ASR that performs well there can be critical to downstream applications.
|Speaker Switch WER
As part of our Rev.com transcription service, we also have a subjective process where we label accents of the different speakers in files. We acknowledge that this process can be slightly flawed, but given the amount of data in this sample, we view these results as fairly stable and representative of reality. Clearly, our system is better at accents for which we have a lot of data (American and British), but has also improved for accents and dialects across the board.
We have been working for a while improving our systems for everyone, and we will continue to make this a focus for the team in the future.
Number of Speakers Results
It is interesting to look at what happens to the accuracy of our model given the number of speakers that are present in the audio. For example, a single speaker dictating a message is much simpler to recognize than a conversation between three or more speakers in a podcast or interview setting. The table below shows that our system has improved a lot for all cases, but performs better in simpler scenarios.
Audio Sampling Rate Results
Because our model is trained to handle any conditions, it is also interesting to track the accuracy of audio files that are sent to us as 8 kHz vs 16 kHz (and up) as a measure of how well our generic model performs under these conditions. Unfortunately, most of our audio files are sent to us in 16 kHz or more, so improving our accuracy at 8 kHz provides a modeling challenge. This is another area of focus for our team, as we are always trying to make our model excel for all use-cases.
|16khz (and up)
Another characteristic that is important for many applications is the amount of noise present in the audio files. For this, we use a power Voice Activity Detection (VAD)-based Signal to Noise ratio (SNR) approximation for each file and compare the results of both systems.
Here, we report WER on the four different levels of noisiness, similarly defined in this article, but with a slightly adjusted definition (see table for information).
|Quiet (15 db +)
|Somewhat noisy (10 to 15 db)
|Noisy (5 to 10 db)
|Very noisy (< 5 db)
Our team is always working hard to improve the accuracy of our models, and we will certainly continue to innovate on this to bring more accuracy gains to all of our customers. As a moonshot goal, this year we are aiming to further improve the accuracy of these models by a margin larger than 15% relative once again.
This report and all the amazing results of our v2 model would not have been possible without the countless hours the team spent improving the data processing, the training toolkit (and the regimen), the testing infrastructure, and the models themselves. I’ll take this time to thank everyone on the team who continue to amaze me in their work ethics and knowledge.