The Importance of Punctuation in Speech Recognition Technology
Before you read any further, take a look at the last chapter of Ulysses by James Joyce. If you didn’t know already, now you really see the difference that punctuation—or a complete lack of it—can make.
Transcripts or captions without punctuation have an even worse case against them. Far from holding the literary accolades of stream-of-consciousness works like Joyce’s, these documents are just plain unreadable. This is largely due to the way our brains process the written word; rather than go word-by-word, we read in chunks. We anticipate the end of a sentence before we reach it because we notice the period ahead of time. Attempting to “imagine” the punctuation as we read is seriously disorienting.
On top of giving us cadence, punctuation provides clarity. It resolves ambiguities. A common example is the distinction between these two phrases: “Let’s eat, grandma!” and “Let’s eat grandma!”. Another great example is this Dear John letter, which gives different punctuation to the same words to create two very different letters.
The problem for Automatic Speech Recognition (ASR), however, is that we don’t explicitly include punctuation when speaking in the same way that we do when writing. Early ASR engines (and many of the lower quality offerings on the market) ignored punctuation entirely, leaving us to either manually insert them afterwards or painstakingly wade through a swamp of jumbled syntax.
In fact, studies have found that transcripts without punctuation are even more detrimental to understanding than a word error rate (WER) of 15 or 20%. And it’s not just us that struggle without punctuation. Artificial intelligence systems that perform natural language processing (NLP) also suffer decreased accuracy when confronting a lack of clausal boundaries.
Something had to change.
Rev’s speech engineers never back away from a challenge. In this article, you’ll see the hurdles that we’ve faced and how we’ve overcome them to bring you a state-of-the-art speech recognition solution that automatically includes punctuation.
ASR without punctuation may be bad, but poorly implemented auto-punctuation is even worse. If you tried. To read an article. Written like. This then you probably wouldn’t make it too far.
A lot of this issue comes from trying to delineate punctuation based on what linguists call prosody, which is basically all the stuff other than words (intonation, inflection, etc.) that accompanies our speech. The logic goes something like this: if someone stops speaking for a while, insert a period and start a new sentence.
The problem with that is obvious. People often run multiple sentences together without pause just like they often stop mid-sentence for thought or hesitation. While these signals do play a role in ASR punctuation, they’re far from fool-proof. A rising inflection can indicate a question mark, but it can just as easily demarcate a subject that the speaker really, really cares about.
Applications like streaming ASR, such as captioning live TV, complicate things even further because we can’t “look ahead” at what comes next to figure out what punctuation we need. Remember, punctuation is all about structuring thoughts by putting them in relation to each other. This is an especially sticky issue for punctuation marks that come in pairs, such as quotation marks and parentheses, because the ASR would have to insert the first marks before we even realize that we’re in a quote or parenthetical.
Some ASR systems give up entirely, instead relying on users to manually say “comma” or “period” if they want to punctuate their sentences. This may serve as a stop-gap measure for dictating text messages or emails, but it’s never going to work for use cases like generating podcast transcripts or smart home assistants.
Our hard work has paid off, and we’re proud to say that our ASR services provide accurate auto-punctuation, especially for common marks like commas, periods, and question marks. How did we do it?
The main technological breakthrough that’s enabled ASR punctuation is a type of deep learning called a transformer neural network. Without getting into too much of the nitty gritty, the main attraction of this machine learning (ML) tech is its ability to become asymptotically more accurate as the size of the training dataset becomes larger. Simply put, it gets better and better when we feed it more audio and the corresponding ground-truth transcripts. Large volumes of high-quality data make all the difference.
Luckily, that’s something that we have in spades here at Rev. Our army of over 60,000 transcriptionists powers both our professional transcription services and our Rev.ai speech-to-text application programming interface (API) by providing a wealth of raw materials. The resulting models use statistics to consider both prosodic and grammatical factors that influence punctuation.
Never struggle through a document that lacks punctuation again. Contact us today.