Rev at Interspeech 2020: Speaker Diarization Q&A
Interspeech 2020 recently invited Rev’s Director of Speech R&D, Miguel Jette, for an upcoming tutorial discussing how Speaker Diarization is used in the industry. The tutorial session is titled “Neural Models for Speaker Diarization in the context of Speech Recognition”. In this quick Q&A session with two of the organizers, Kyu and Tae, Miguel chatted about how diarization is used at Rev and the challenges we face using diarization in the industry.
Rev Focused Questions
Rev.com’s speech API boasts quality transcription services that customers need. What industry sectors are your major target areas? How do your customers leverage your company’s products for their business?
At Rev, the major industry sectors we work with are: education, tech and media. Within those very broad sectors, the type of applications vary widely. For some, the final transcript is their business needs (meeting transcripts, podcasts, and captions). For others, they use the transcripts to extract more value (market/user research) or to build a product on top (language training or personnel training).
One more thing that sets us apart from other speech APIs is that we use our APIs in our own internal products, like auto-tc rev.com, and our Rev Live Captions for Zoom. Using our own API enables us to get feedback from our pool of Revvers and drives us to improve the product in meaningful ways.
What are the most notable features differentiating Rev.com’s technology from competitors in the market?
Accuracy and ease of use.
We don’t over complicate things. We don’t force our users to use a specific cloud provider to send the audios (S3, Google cloud, etc). We only need a link. Simplicity is key for rev.ai. Our APIis easy to use, and very accurate.
How many different domain types are processed in the platform? How is speaker diarization used?
I never counted, but there are dozens of different high level domains (major ones listed above in the first question). Just as with our ASR output, the diarization result is often used as part of the output (i.e. meeting/podcast transcripts) and sometimes as part of a downstream application where it is important to know who is speaking specific words (e.g. interview analysis or organizing unstructured data).
Does Rev.com’s speech API offer streaming ASR? How can speaker diarization be considered in that context? In the form of online diarization? If yes, is there a latency or compute cost constraint with regard to speaker diarization? If not, is there any plan to develop an online speaker diarization system?
Scenarios exist in which offline diarization after the stream ended is possible (e.g. meeting transcripts), but online diarization is definitely ideal.
Streaming ASR definitely adds a set of constraints for diarization. And, as we attempt to improve diarization accuracy by adding more features, like ASR output, the latency concerns increase. Because we love simplicity at Rev, we obviously want a fast and snappy product, and we want to be able to offer it at a good price. Having said that, we strongly believe that online diarization is the next frontier, and so we are indeed currently exploring it seriously.
What do you think is the biggest pain point of speaker diarization in practice? What effort does your team make to build a good production version of a speaker diarization system?
Currently, I would say the biggest pain point of Speaker Diarization is choosing the right metrics to optimize for. From our experience, the current metrics don’t perfectly correlate with customer satisfaction.
Another pain point is that we want to offer a simple solution that doesn’t require user input. For example, that might require finding a good threshold for clustering the speakers that work ok for all scenarios, or having a smarter algorithm that can adapt to given scenarios.
Lastly, we found it quite harder to meet customer expectations. What I mean here is that Speaker Diarization is such a simple problem to the human ear, that customers have a hard time understanding why it is difficult to do a good job. At this point, customers understand that ASR is a difficult problem (in certain scenarios anyways), and so they are more tolerant when you make mistakes. It is not the case yet with Speaker diarization.
Traditionally, speaker diarization has been considered as a pre-processing for speech recognition, but in practice it can cause word cuts thus lots of deletion errors in WER since it doesn’t consider word boundaries while using uniform segmentation to get a quality speaker representation. How do you handle this problem in your product systems?
One way we handle this problem is by not using VAD (or a conservative VAD, to reduce the number of small segments) and the choice of segmentation we present to the ASR. For ASR output, we use an aggressive smoothing algorithm, and for DIAR output, we post-process using a conservative VAD solution. So, basically, we consider speaker diarization as two separate problems: ASR facing and customer facing.
Recently, a few metrics other than DER are of interest in the community to evaluate diarization systems from different viewpoints, such as word-level DER (WDER). Would you share your own perspective on this trend?
We are big fans of WDER and what we internally call DER1.
- As a diarization metric, WDER also removes the need for having accurate time boundaries in a test set. However, in practice, we have found that WDER correlates strongly with DER1 and misclassification errors in the time domain.
- We believe that this best reflects the performance of our combined ASR and diarization systems. This ties back into the “editability” for Revvers idea. WDER provides a rate across words whereas DER provides a rate across time, the former being a lot easier to reason about in terms of how far the output is from the correct transcript.
- The major downside of WDER must be used in combination with ASR WER in order to account for deletions and insertions.
- Because we don’t mind about false alarm at all, we have actually decided to not include it in our main metric. Internally, we call it DER1.
- DER1 = (MissedDetection + SpeakerConfusion) / TotalSpeechTime
General Diarization Questions
It is well known that the most representative application domains for speaker diarization are broadcast media, meetings and telephone conversations, historically. What else do you think can be considered as impactful application domains for speaker diarization technology in the present? And, what could be potential application domains in the future?
Those domains cover most of the applications I can think of today. At Rev, we process all sorts of audios without knowing the domain ahead of time and therefore are faced with a problem I would call “Speaker Diarization in the wild”. The scenario ranges from a single speaker podcast recording, to multi-speaker interviews, and even long-form audio/video recordings. In the later case, the background noises, long silences, and spontaneous speech can make diarization very challenging.
One large domain for speaker diarization is contact centers. In this domain, there are two primary use cases – agent side and client side. Separating who’s speaking can help companies perform quality assurance on the agents’ side, promoting better customer service to their customers. On the client side, it helps the company understand why clients are calling – e.g. billing, product questions, or technical problems. This knowledge can help the company improve customer interactions with better information, reduce wait times, and lower costs by deflecting calls. There are other areas in which ASR contributes to society that we we don’t often think about. Some examples include driving change in criminal justice reform by analyzing hearing outcomes (e.g. separating judges from defendants) to look for biases in results or web platforms for contentious parents that have to co-parent (e.g. cases in which who said what can be used in court).
As for the future, I would say that the main problem with speaker diarization right now is that it is not accurate enough “in the wild”. Once we break a certain accuracy barrier for the more generic case, then the scenarios are going to unfold. Think back of when speech recognition meant understanding “yes please” or “no thank you” properly in an IVR. The use-cases for speech were very limited then. I think we are at that point in time for diarization right now.
One of these future-type of applications would be the booming streaming sector right now. For example, folks are hosting twitch streams where many people are talking while they play a video game. Finding structure in this sort of unstructured audio data (either live or offline after the fact) is one place diarization could be used if it was better.
What do you think is the most challenging problem yet addressed in the field of speaker diarization? Either from a research perspective or in practice, or both.
First and foremost, I would say that the obvious next frontier in diarization is online/streaming diarization (along with good recipes for training such models in a supervised way). Humans are just so good at diarization and we do it live, in the moment, all the time. That’s the biggest challenge for research and the industry.
There are many other challenges that are important (and probably all related):
- Interpretability: The behaviour of Speaker diarization models are difficult to explain and interpret. Diarization systems should be understandable (i.e. why did it only recognize one speaker when there are clearly two speakers?).
- This is definitely related to generic accuracy problems, but also could be related to metrics. Humans view “simple” diarization errors harsher because we are so good at it. It is possible that better metrics could be conceived that would represent the quality of the diarization system better.
- Overlapping speech is an interesting topic causing issues in real applications.
- Spontaneous interruptions or very short speaker switches/turns. Very short switches are typically very hard to detect for clustering based approaches.
- Diarization metrics: DER and the variants rely on having accurate timestamps in the references, and even with forced alignment this is a noticeable strain on their reliability.
- Learning the right characteristics: On the one hand, the model needs to be able to avoid overfitting to channel/acoustic variations, and, on the other hand, it needs to be able to tolerate intra-speaker changes (e.g. high emotions, etc).
Traditionally acoustic features have been the de-facto input features for speaker diarization systems. Also, supra-segmental features like prosody information were considered to supplement the acoustic feature based systems. As discussed in this tutorial, ASR outputs can be helpful for speaker diarization, as well. Which other features could be taken into consideration?
Because the accuracy of ASR systems have improved so much in the last few years, I would agree that using the ASR output (and other derived features) is the most important next step in speaker diarization systems. This could mean including word boundary information and possibly analyzing the type of text to understand if it is a dialog, etc. On top of that, one could consider:
- Pause duration
- Punctuation and POS tagging
Other type of information that seem interesting are:
- For video sources, a dual approach would be very interesting (audio and video features)
- Integrating the idea of a known speaker database (if you process similar audios over time).