Skip to content

Nine Must-Read Research Papers from Rev’s AI Experts

nine must-read research papers from rev ai experts

RevBlogSpeech to Text TechnologyNine Must-Read Research Papers from Rev’s AI Experts

Last year, the Speech Team here at Rev began a new tradition: every two weeks, we agreed to read an academic paper and then discuss our thoughts and takeaways over lunch. Our discussions served a few different purposes — to stay up-to-date on the latest developments in speech recognition technology, to spark friendly debate among colleagues, and to help us stay in touch during an otherwise strange and unsettling year.

We read a lot of super interesting articles during these “Speech Lunch” sessions, and we thought we’d share a few of the highlights as we start a new year full of new challenges and new possibilities. Here are the team’s favorite picks. We hope you’ll find these as interesting as we did!

Miguel Jette, Director of Speech Research & Development

Racial disparities in automated speech recognition (2020) – Allison Koenecke, Andrew Nam, Emily Lake, Joe Nudell, Minnie Quartey, Zion Mengesha, Connor Toups, John R. Rickford, Dan Jurafsky, and Sharad Goel

Published in early 2020, this paper was extremely relevant, coming at a time when everyone was reflecting on racial disparities in the United States and around the globe. In partnership with a current customer, we embarked on an analysis of our own bias in our ASR. It has spurred an internal effort to train a fairer ASR model, and to be candid and open about where and how we can improve.

Miguel Del Rio, Speech Engineer

It’s Hard For Neural Networks to Learn the Game of Life (2020) – Jacob M. Springer and Garrett T. Kenyon

One of my favorite papers from our lunch sessions was “It’s Hard For Neural Networks to Learn the Game of Life.” Using the toy problem of Conway’s Game of Life, it illustrates the idea of the lottery ticket hypothesis by showing that over parametrized models are more likely to converge to a solution. While the whole paper is excellent, I especially appreciated the analysis of d-density games and the implication that the right dataset is also important for increasing the likelihood of model convergence.

Nishchal Bhandari, Senior Speech Engineer

Negated and Misprimed Probes for Pretrained Language Models: Birds Can Talk, But Cannot Fly (2020) – Nora Kassner & Hinrich Schütze

I really enjoyed this paper. As machine learning (ML) models grow to a variety of use cases, interpretability becomes critical. Kassner and Schütze propose using negation and mispriming to evaluate factual knowledge stored in pre-trained language models, and demonstrate that current state-of-the-art language models are still easily fooled in these probing tasks.

Ryan Westerman, Speech Engineer

Class LM and Word Mapping for Contextual Biasing in End-to-End ASR (2020) – Rongqing Huang, Ossama Abdel-hamid, Xinwei Li & Gunnar Evermann

As end-to-end networks for Automated Speech Recognition (ASR) continue to improve, the industry needs to find solutions to problems that may have already been solved with hybrid models. This paper does a great job of explaining an approach to training end-to-end models while incorporating user-specific named entities (similar to what we call Custom Vocabulary). 

Natalie Delworth, Speech Engineer

Which words are hard to recognize? Prosodic, lexical, and disfluency factors that increase speech recognition error rates (2010) – Sharon Goldwater, Dan Jurafsky & Christopher D. Manning

I was really fascinated by this study on individual word error rates (IWER). The authors analyzed how error rate is affected at the word level by different prosodic, lexical, contextual, disfluency features. The finding that interested me the most was that individual speaker differences had a large role in determining error rates even after accounting for all of the other word-level features in their study. (There were 44 degrees of freedom in their statistical model, so their features covered a lot!) Since this paper was published in 2010, I wonder what an updated study would look like on a model from 2020 — perhaps on something quite different from the authors’ ASR system, like an end-to-end model — and if it would yield the same strong result about IWER across speaker differences.

Joseph Palakapilly, Speech Engineer

Single Headed Attention RNN: Stop Thinking With Your Head (2019) – Stephen Merity

This is by far the most entertaining paper I’ve read in recent memory — maybe ever. I was hooked after I read the very witty abstract. (Example: “This work has undergone no intensive hyperparameter optimization and lived entirely on a commodity desktop machine that made the author’s small studio apartment far too warm in the midst of a San Franciscan summer.”) The author’s purpose is to show that the research community might have too quickly dismissed long short-term memory (LSTM) networks for language modeling. He does this well, presenting impressive results with a variation on an “outdated” model and even more impressive: without needing excessive compute resources to reach them. “Take that Sesame Street.”

Jennifer Drexler, Senior Research Scientist

Single headed attention based sequence-to-sequence model for state-of-the-art results on Switchboard (2020) – Zoltán Tüske, George Saon, Kartik Audhkhasi & Brian Kingsbury

The abundance of architectures and training regimen permutations make modern ML research and engineering quite a headache. This paper offers a simple architecture while presenting a practitioner’s guide for different regularization and optimization techniques common in seq2seq ASR. The ablation results table is something I’ll frequently reference for understanding the relative contributions of each technique.

Arthur Hinsvark, Speech Engineer

Design Choices for X-vector Based Speaker Anonymization (2020) – Brij Mohan Lal Srivastava, Natalia Tomashenko, Xin Wang, Junichi Yamagishi, Mohamed Maouche, Aurélien Bellet & Marc Tommasi

With speech recognition already a part of many people’s daily lives, protecting users’ privacy is important. The method detailed in this paper allows for randomizing the speaker while preserving the semantic information. I like the approaches the authors present and think this is a good first step toward anonymization.

Quinn McNamara, Senior Speech Engineer

High Performance Natural Language Processing EMNLP 2020 (2020) – Gabriel Ilharco, Cesar Ilharco, Iulia Turc, Tim Dettmers, Felipe Ferreira & Kenton Lee

This isn’t really a paper, but I have a soft spot for anything that explains state-of-the-art concepts well. And this thorough (albeit lengthy) tutorial and slide deck do that extremely well. They provide a stellar overview of recent advances in efficient Natural Language Processing (NLP) models and ML systems in general. In our view, maintaining reasonable training and inference runtime is paramount as language processing machine learning trends towards deeper and slower architectures.

Affordable, fast transcription. 100% Guaranteed.