The Only Guide You Need to Speech Recognition in Python with Speech-To-Text API
The mega tech companies, Microsoft, Google, and Amazon provide speech-to-text-transcription services. These work well for most use cases, in particular consumer applications like home automation and search.
But the tech giants cast a wide net. Their solution does not work for niche applications. They are not always well-suited for or priced correctly for all software applications and devices. The reasons for that are both architecture and the need for additional features. In short, the specialized application often needs full control.
Even though they are mega-large, with the world’s top engineers, the tech giants do not have as much quality audio training data as a company like Rev. The big companies vacuum up everything with devices like Alexa turned on all the time. But doing that sucks up noise and clutter as well as clear speech. That muddles the picture.
Rev takes a different approach. We employ over 50,000 freelance human transcriptionists to continually transcribe speech to text. Rev uses that highly curated data to train its AI models, making it the best and most accurate speech recognition solution in the world, consistently beating Google, Amazon, Microsoft, and others in accuracy tests.
Serving the Needs of Software and Hardware: Speech-to-Text API Use Cases & Examples
Some common scenarios Rev.ai handles:
Rev.ai can add captions & transcripts to videos in real time streaming media. For example, Rev used Rev.ai to create a live captioning integration for Zoom.
Transcripts of Videos
The video company Loom uses Rev to transcribe videos on their video hosting platform.
Video or Audio Editing/Production
Hollywood studios & production companies often use transcription for video editing. For example, transcribing all available video footage in order to quickly find the takes or scenes to edit.
Video/Audio Accessibility & Compliance
All companies need to comply with accessibility laws and make video & audio accessible to all individuals. Think about anyone who is deaf or hard of hearing. Rev AI can help with making your software, applications, video, and audio more accessible.
Transcripts of Meetings
Virtual meetings like Zoom meetings are becoming more and more common in all industries. Any recorded meeting can be transcribed also. This is a great replacement for taking meeting notes, or improving meeting experiences for deaf & hard of hearing individuals.
Transcripts of Interviews
Documentary filmmakers, journalists, and media companies use speech recognition for interviews.
Converting massive amounts of audio or video to text creates a ton of data. You can use this data for analysis in a wide range of industries.
Police Body Cameras
Camera manufacturers can add the ability to transcribe video footage. This meets the legal requirements of the state and makes legal discovery easy, as the user can search for text instead of having to watch many hours of video. Axon uses Rev for this currently. Transcribing video footage has many use cases beyond police body cameras.
Podcasts are blowing up in popularity, and transcriptions of podcasts can create an entirely new asset for any podcast. Converting podcasts to text can improve accessibility and create an SEO asset for any podcast.
The legal industry is becoming more virtual all the time. Depositions, live court reporting, and more can benefit from speech recognition.
Python Speech Recognition Code Examples
Here we provide a code example, so a developer or CTO can understand the Rev.ai solution.
In this example we use one of the simplest, albeit most widely used programming languages, Python.
Asynchronous API Python Code Example
We call our products asynchronous (pre-recorded) and streaming (realtime).
Here is a simple asynchronous example: We transcribe Dr. Martin Luther King’s 17-minute famous “I Have a Dream” speech.
Asynchronous API Python Code Explained
To get started, log into the portal and generate an API key. Download the API from the Python public repository or use pip.
The Python code is rather simple, since Rev.ai does all the heavy lifting. For example, the API handles all the complexity of working with different 16 kHz audio, a specific set of media formats, uploading a file, issuing a callback, queuing it for processing, etc. The programmer just:
- Submits the file (base64 encoded) or URL
- Checks the status
- Downloads the transcription
Basically, the procedure is to submit an audio file or URL to the rev.ai engine. Then you poll the system and retrieve the results when the transcription is complete. Or, you can use callbacks. Callbacks tell your program when transcription is complete with no time delay.
The Python code below does the following:
- Reads the API key, which we have saved as an environment variable
- Opens a client connection to rev.ai
- Submits a URL or file
- Queries the job status by job.id
- When the job is complete, downloads the results as a text file
There are other options, like submitting audio or video files for transcription. You can read about those in the documentation.
The Complete Python Code for the Rev AI Asynchronous API
Here is the complete code:
import os from rev_ai import apiclient import time apiKey = os. environ['REVAI_APIKEY'] client = apiclient.RevAiAPIClient(apiKey) job = client.submit_job_url("https://archive.org/download/MLKDream/MLKDream_64kb.mp3") job_details = client.get_job_details(job.id) while str(job_details.status == "<JobStatus.IN_PROGRESS: 1>"): time.sleep(60) print(job_details.status) print("done", job_details.status) transcript_text = client.get_transcript_text(job.id) print(transcript_text)
The output looks like this:
Speaker 0 00:00:00 I have the pleasure to present to you, Dr. Martin Luther King Speaker 1 00:00:04 <inaudible>. Speaker 0 00:00:13 I am happy to join with you today in what will go down in history as the greatest demonstration for freedom in the history of our nation. Speaker 1 00:00:28 <inaudible> Speaker 0 00:00:37 Five score years ago, a great American in whose symbolic shadow. ---
How to Use Rev’s Streaming Speech Recognition API with Python
Working with streaming data is completely different than working with a single file.
When you work with a streaming file you have to work at a lower level on the network stack. This is because a stream is not like a file that you open and close. Rev.ai handles that complexity by working at the websocket layer. Async uses REST. Streaming uses RTMP (Real-Time Messaging Protocol).
At first glance, you might think the streaming API will transcribe streaming audio data that is sent to you. It works the other way around. You connect to Rev AI using a streaming API such as PyAudio. That maintains a persistent connection. Rev.ai processes the audio or video and sends back the transcribed text also as a stream.
The code is similarly simple. Obviously, you would have to write your own video or audio server. But that’s to be expected as Rev.ai exists to service this type of application.
You can read more about how to use Rev.ai in our detailed documentation.