Easy Way to Learn Speech Recognition in Java With a Speech-To-Text API

RevBlogResourcesOther ResourcesSpeech-to-Text APIsEasy Way to Learn Speech Recognition in Java With a Speech-To-Text API

Here we explain show how to use a speech-to-text API with two Java examples.

We will be using the Rev AI API (free for your first 5 hours) that has two different speech-to-text API’s:

  1. Asynchronous API – For pre-recorded audio or video
  2. Streaming API – For live (streaming) audio or video

Asynchronous Rev AI API Java Code Example

We will use the Rev AI Java SDK located here.  We use this short audio, on the exciting topic of HR recruiting.

First, sign up for Rev AI for free and get an access token.

Create a Java project with whatever editor you normally use.  Then add this dependency to the Maven pom.xml manifest:

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>org.example</groupId>
    <artifactId>Rev2</artifactId>
    <version>1.0-SNAPSHOT</version>
    <dependencies>
        <dependency>
            <groupId>ai.rev.speechtotext</groupId>
            <artifactId>revai-java-sdk-speechtotext</artifactId>
            <version>1.3.0</version>
        </dependency>
    </dependencies>

    <properties>
        <maven.compiler.source>11</maven.compiler.source>
        <maven.compiler.target>11</maven.compiler.target>
    </properties>

</project>

The code sample below is here. We explain it and show the output.

Submit the job from a URL:


submittedJob = apiClient.submitJobUrl(mediaUrl, revAiJobOptions);

Most of the Rev AI options are self-explanatory, for the most part.  You can use the callback to kick off downloading the transcription in another program that is on standby, listening on http, if you don’t want to use the polling method we use in this example.

  RevAiJobOptions revAiJobOptions = new RevAiJobOptions();
        revAiJobOptions.setCustomVocabularies(Arrays.asList(customVocabulary));
        revAiJobOptions.setMetadata("My first submission");
        revAiJobOptions.setCallbackUrl("https://example.com");
        revAiJobOptions.setSkipPunctuation(false);
        revAiJobOptions.setSkipDiarization(false);
        revAiJobOptions.setFilterProfanity(true);
        revAiJobOptions.setRemoveDisfluencies(true);
        revAiJobOptions.setSpeakerChannelsCount(null);
        revAiJobOptions.setDeleteAfterSeconds(2592000); // 30 days in seconds
        revAiJobOptions.setLanguage("en");

Put the program in a loop and check the job status.  Download the transcription when it is done.

RevAiJobStatus retrievedJobStatus = retrievedJob.getJobStatus();

The SDK returns captions as well as text.

srtCaptions = apiClient.getCaptions(jobId, RevAiCaptionType.SRT);
vttCaptions = apiClient.getCaptions(jobId, RevAiCaptionType.VTT);

Here is the complete code:

package ai.rev.speechtotext;

import ai.rev.speechtotext.models.asynchronous.RevAiCaptionType;
import ai.rev.speechtotext.models.asynchronous.RevAiJob;
import ai.rev.speechtotext.models.asynchronous.RevAiJobOptions;
import ai.rev.speechtotext.models.asynchronous.RevAiJobStatus;
import ai.rev.speechtotext.models.asynchronous.RevAiTranscript;
import ai.rev.speechtotext.models.vocabulary.CustomVocabulary;

import java.io.IOException;
import java.io.InputStream;
import java.util.Arrays;

public class AsyncTranscribeMediaUrl {

    public static void main(String[] args) {
        // Assign your access token to a String
        String accessToken = "your_access_token";

        // Initialize the ApiClient with your access token
        ApiClient apiClient = new ApiClient(accessToken);

        // Create a custom vocabulary for your submission
        CustomVocabulary customVocabulary =
                new CustomVocabulary(Arrays.asList("Robert Berwick", "Noam Chomsky", "Evelina Fedorenko"));

        // Initialize the RevAiJobOptions object and assign
        RevAiJobOptions revAiJobOptions = new RevAiJobOptions();
        revAiJobOptions.setCustomVocabularies(Arrays.asList(customVocabulary));
        revAiJobOptions.setMetadata("My first submission");
        revAiJobOptions.setCallbackUrl("https://example.com");
        revAiJobOptions.setSkipPunctuation(false);
        revAiJobOptions.setSkipDiarization(false);
        revAiJobOptions.setFilterProfanity(true);
        revAiJobOptions.setRemoveDisfluencies(true);
        revAiJobOptions.setSpeakerChannelsCount(null);
        revAiJobOptions.setDeleteAfterSeconds(2592000); // 30 days in seconds
        revAiJobOptions.setLanguage("en");

        RevAiJob submittedJob;

        String mediaUrl = "https://www.rev.ai/FTC_Sample_1.mp3";

        try {
            // Submit the local file and transcription options
            submittedJob = apiClient.submitJobUrl(mediaUrl, revAiJobOptions);
        } catch (IOException e) {
            throw new RuntimeException("Failed to submit url [" + mediaUrl + "] " + e.getMessage());
        }
        String jobId = submittedJob.getJobId();
        System.out.println("Job Id: " + jobId);
        System.out.println("Job Status: " + submittedJob.getJobStatus());
        System.out.println("Created On: " + submittedJob.getCreatedOn());

        // Waits 5 seconds between each status check to see if job is complete
        boolean isJobComplete = false;
        while (!isJobComplete) {
            RevAiJob retrievedJob;
            try {
                retrievedJob = apiClient.getJobDetails(jobId);
            } catch (IOException e) {
                throw new RuntimeException("Failed to retrieve job [" + jobId + "] " + e.getMessage());
            }

            RevAiJobStatus retrievedJobStatus = retrievedJob.getJobStatus();
            if (retrievedJobStatus == RevAiJobStatus.TRANSCRIBED
                    || retrievedJobStatus == RevAiJobStatus.FAILED) {
                isJobComplete = true;
            } else {
                try {
                    Thread.sleep(5000);
                } catch (InterruptedException e) {
                    e.printStackTrace();
                }
            }
        }

        // Get the transcript and caption outputs
        RevAiTranscript objectTranscript;
        String textTranscript;
        InputStream srtCaptions;
        InputStream vttCaptions;

        try {
            objectTranscript = apiClient.getTranscriptObject(jobId);
            textTranscript = apiClient.getTranscriptText(jobId);
            srtCaptions = apiClient.getCaptions(jobId, RevAiCaptionType.SRT);
            vttCaptions = apiClient.getCaptions(jobId, RevAiCaptionType.VTT);
        } catch (IOException e) {
            e.printStackTrace();
        }

   
    }
}

It responds:

Job Id: hzNOmTjrPfzB
Job Status: {status='in_progress'}
Created On: 2021-03-16T08:12:14.202Z

Process finished with exit code 0

You can get the transcript with Java.

textTranscript = apiClient.getTranscriptText(jobId);

Or go get it later with curl, noting the job id from stdout above.

curl -X GET "https://api.rev.ai/speechtotext/v1/jobs/{id}/transcript" -H "Authorization: Bearer $REV_ACCESS_TOKEN" -H "Accept: application/vnd.rev.transcript.v1.0+json"

This returns the transcription in JSON format: 

{"monologues":[{"speaker":0,"elements":[{"type":"text","value":"Hi","ts":0.27,"end_ts":0.48,"confidence":1.0},{"type":"punct","value":","},{"type":"punct","value":" "},{"type":"text","value":"my","ts":0.51,"end_ts":0.66,"confidence":1.0},{"type":"punct","value":" "},{"type":"text","value":"name's","ts":0.66,"end_ts":0.84,"confidence":0.89},{"type":"punct","value":" "},{"type":"text","value":"Jack","ts":0.84,"end_ts":1.05,"confidence":1.0},{"type":"punct","value":" "},{"type":"text","value":"Ratzinger","ts":1.05,"end_ts":1.5,"confidence":1.0},{"type":"punct","value":"."},{"type":"punct","value":" "},{"type":"text","value":"And","ts":1.53,"end_ts":1.83,"confidence":1.0},{"type":"punct","value":" "},{"type":"text","value":"I'm","ts":1.83,"end_ts":1.92,"confidence":1.0},{"type":"punct","value":" "},{"type":"text","value":"going","ts":1.92,"end_ts":2.04,"confidence":1.0},{"type":"punct","value":" "},{"type":"text","value":"to","ts":2.04,"end_ts":2.1,"confidence":1.0},{"type":"punct","value":" "},{"type":"text","value":"be","ts":2.1,"end_ts":2.16,"confidence":1.0},{"type":"punct","value":" "}

Streaming Rev AI API Java Code Example

A stream is a websocket connection from your video or audio server to the Rev AI audio-to-text entire.

We can emulate this connection by streaming a .raw file from the local hard drive to Rev AI.

One Ubuntu run:

Download the audio then convert it to .raw format as shown below.  Converted it from wav to raw with the following ffmpeg command:

ffmpeg -i Harvard_list_01.wav -f f32le -acodec pcm_f32le Harvard_list_01.raw

As you run that is gives key information about the audio file:

Input #0, wav, from 'Harvard_list_01.wav':
  Metadata:
    encoder         : Adobe Audition CC 2017.1 (Windows)
    date            : 2018-12-20
    creation_time   : 15:58:57
    time_reference  : 0
  Duration: 00:01:07.67, bitrate: 1536 kb/s
    Stream #0:0: Audio: pcm_f32le ([3][0][0][0] / 0x0003), 48000 Hz, 1 channels, flt, 1536 kb/s
Output #0, f32le, to 'Harvard_list_01.raw':
  Metadata:
    time_reference  : 0
    date            : 2018-12-20
    encoder         : Lavf56.40.101
    Stream #0:0: Audio: pcm_f32le, 48000 Hz, mono, flt, 1536 kb/s
    Metadata:
      encoder         : Lavc56.60.100 pcm_f32le
Stream mapping:

To explain, first we set a websocket connection and start streaming the file:

// Initialize your WebSocket listener
        WebSocketListener webSocketListener = new WebSocketListener();

        // Begin the streaming session
streamingClient.connect(webSocketListener, streamContentType, sessionConfig);

The important items to set here are the  sampling rate (not bit rate) and format.  We match this information from ffmpeg:    Audio: pcm_f32le, 48000 Hz

StreamContentType streamContentType = new StreamContentType();
        streamContentType.setContentType("audio/x-raw"); // audio content type
        streamContentType.setLayout("interleaved"); // layout
        streamContentType.setFormat("F32LE"); // format
        streamContentType.setRate(48000); // sample rate
        streamContentType.setChannels(1); // channels

After the client connects, the onConnected event sends a message.  We can get the jobid from there.  This will let us download the transcription later if we don’t want to get it in real-time.


type='{messageType='connected'}', id='Rwnhc2vEAwYO'}

To get the transcription in real time, listen for the onHypothesis event:

@Override
        public void onHypothesis(Hypothesis hypothesis) {
            if (hypothesis.getType() == MessageType.FINAL) {
                String textValue = "";
                for (Element element: hypothesis.getElements()) {
                    textValue = textValue + element.getValue();
                }

                System.out.println(" text transcribed " + textValue);

            }

Here is the complete code:

package com.example.stream;


import ai.rev.speechtotext.models.streaming.ConnectedMessage;
import ai.rev.speechtotext.models.streaming.Hypothesis;
import ai.rev.speechtotext.models.streaming.SessionConfig;
import ai.rev.speechtotext.models.streaming.StreamContentType;
import ai.rev.speechtotext.StreamingClient;
import ai.rev.speechtotext.RevAiWebSocketListener;
import ai.rev.speechtotext.models.streaming.MessageType;
import ai.rev.speechtotext.models.asynchronous.Element;
import okhttp3.Response;
import okio.ByteString;

import java.io.BufferedInputStream;
import java.io.DataInputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.nio.ByteBuffer;
import java.util.Arrays;

public class StreamingFromLocalFile {

    public static void main(String[] args) throws InterruptedException {
        // Assign your access token to a String
        String accessToken = "xxxxx";

        // Configure the streaming content type
        StreamContentType streamContentType = new StreamContentType();
        streamContentType.setContentType("audio/x-raw"); // audio content type
        streamContentType.setLayout("interleaved"); // layout
        streamContentType.setFormat("F32LE"); // format
        streamContentType.setRate(48000); // sample rate
        streamContentType.setChannels(1); // channels

        // Setup the SessionConfig with any optional parameters
        SessionConfig sessionConfig = new SessionConfig();
        sessionConfig.setMetaData("Streaming from the Java SDK");
        sessionConfig.setFilterProfanity(true);
        sessionConfig.setRemoveDisfluencies(true);
        sessionConfig.setDeleteAfterSeconds(2592000); // 30 days in seconds

        // Initialize your client with your access token
        StreamingClient streamingClient = new StreamingClient(accessToken);

        // Initialize your WebSocket listener
        WebSocketListener webSocketListener = new WebSocketListener();

        // Begin the streaming session
        streamingClient.connect(webSocketListener, streamContentType, sessionConfig);

        // Read file from disk
        File file = new File("/Users/walkerrowe/Downloads/Harvard_list_01.raw");

        // Convert file into byte array
        byte[] fileByteArray = new byte[(int) file.length()];
        try (final FileInputStream fileInputStream = new FileInputStream(file)) {
            BufferedInputStream bufferedInputStream = new BufferedInputStream(fileInputStream);
            try (final DataInputStream dataInputStream = new DataInputStream(bufferedInputStream)) {
                dataInputStream.readFully(fileByteArray, 0, fileByteArray.length);
            } catch (IOException e) {
                throw new RuntimeException(e.getMessage());
            }
        } catch (IOException e) {
            throw new RuntimeException(e.getMessage());
        }

        // Set the number of bytes to send in each message
        int chunk = 8000;

        // Stream the file in the configured chunk size
        for (int start = 0; start < fileByteArray.length; start += chunk) {
            streamingClient.sendAudioData(
                    ByteString.of(
                            ByteBuffer.wrap(
                                    Arrays.copyOfRange(
                                            fileByteArray, start, Math.min(fileByteArray.length, start + chunk)))));
        }

        // Wait to make sure all responses are received
        Thread.sleep(5000);

        // Close the WebSocket
        streamingClient.close();
    }

    // Your WebSocket listener for all streaming responses
    private static class WebSocketListener implements RevAiWebSocketListener {

        @Override
        public void onConnected(ConnectedMessage message) {
            System.out.println(message);
        }

        @Override
        public void onHypothesis(Hypothesis hypothesis) {
            if (hypothesis.getType() == MessageType.FINAL) {
                String textValue = "";
                for (Element element: hypothesis.getElements()) {
                    textValue = textValue + element.getValue();
                }

                System.out.println(" text transcribed " + textValue);

            }
            // unrem this for log messages
            // System.out.println(hypothesis.toString());
        }


        @Override
        public void onError(Throwable t, Response response) {
            System.out.println(response);
        }

        @Override
        public void onClose(int code, String reason) {
            System.out.println(reason);
        }

        @Override
        public void onOpen(Response response) {
            System.out.println(response.toString());
        }
    }
}

Here is what the output looks like:

Response{protocol=http/1.1, code=101, message=Switching Protocols, url=https://api.rev.ai/speechtotext/v1/stream?access_token=xxxxxxxxxx&filter_profanity=true&remove_disfluencies=true&delete_after_seconds=2592000&content_type=audio/x-raw;layout=interleaved;rate=48000;format=F32LE;channels=1}
{type='{messageType='connected'}', id='GJgJz1cwz6pe'}
 text transcribed Okay.
 text transcribed List number one.
 text transcribed The Birch canoe slid on the smooth planks.
 text transcribed Do the sheet to the dark blue background.
 text transcribed It's easy to tell the depth of a well.
 text transcribed These days, a chicken leg is a rare dish.
 text transcribed Rice is often served in round bowls.
 text transcribed The juice of lemons makes fine. Punch.
 text transcribed The box was thrown beside the parked truck.
 text transcribed The Hawks were fed chopped corn and garbage.
 text transcribed Four hours of steady work face. Does.
 text transcribed Large size and stockings is hard to sell.
End of input. Closing

What is the Best Speech Recognition API for Java?

Accuracy is what you want in a speech-to-text API, and Rev AI is a one-of-a-kind speech-to-text API in that regard.

You might ask, “So what?  Siri and Alexa already do speech-to-text, and Google has a speech cloud API.”

That’s true.  But there’s one game-changing difference: 

The data that powers Rev AI is manually collected and carefully edited.  Rev pays 50,000 freelancers to transcribe audio & caption videos for its 99% accurate transcription & captioning services. Rev AI is trained with this human-sourced data, and this produces transcripts that are far more accurate than those compiled simply by collecting audio, as Siri and Alexa do.

Rev AI’s accuracy is also snowballing, in a sense. Rev’s speech recognition system and API is constantly improving its accuracy rates as its dataset grows and the world-class engineers constantly improve the product.

Labelled Data and Machine Learning

Why is human transcription important?

If you are familiar with machine learning then you know that converting audio to text is a classification problem.  

To train the computer to transcribe audio ML programmers feed feature-label data into their model.  This data is called a training set.

Features (sound) are input and labels (the corresponding letter) are output, calculated by the classification algorithm.

Alexa and Siri vacuum up this data all day long.  So you would think they would have the largest and therefore most accurate training data.  

But that’s only half of the equation.  It takes many hours of manual work to type in the labels that correspond to the audio.  In other words, a human must listen to the audio and type the corresponding letter and word.  

This is what Rev AI has done.

It’s a business model that has taken off, because it fills a very specific need.

For example, look at closed captioning on YouTube.  YouTube can automatically add captions to it’s audio.  But it’s not always clear.  You will notice that some of what it says is nonsense. It’s just like Google Translate: it works most of the time, but not all of the time.

The giant tech companies use statistical analysis, like the frequency distribution of words, to help their models.

But they are consistently outperformed by manually trained audio-to-voice training models.