How to use batch transcription

Batch transcription is a set of REST API operations that enables you to transcribe a large amount of audio in storage. You can point to audio files by using a typical URI or a shared access signature (SAS) URI, and asynchronously receive transcription results. With the v3.0 API, you can transcribe one or more audio files, or process a whole storage container.

You can use batch transcription REST APIs to call the following methods:

Batch transcription operation Method REST API call
Creates a new transcription. POST speechtotext/v3.0/transcriptions
Retrieves a list of transcriptions for the authenticated subscription. GET speechtotext/v3.0/transcriptions
Gets a list of supported locales for offline transcriptions. GET speechtotext/v3.0/transcriptions/locales
Updates the mutable details of the transcription identified by its ID. PATCH speechtotext/v3.0/transcriptions/{id}
Deletes the specified transcription task. DELETE speechtotext/v3.0/transcriptions/{id}
Gets the transcription identified by the specified ID. GET speechtotext/v3.0/transcriptions/{id}
Gets the result files of the transcription identified by the specified ID. GET speechtotext/v3.0/transcriptions/{id}/files

For more information, see the Speech-to-text REST API v3.0 reference documentation.

Batch transcription jobs are scheduled on a best-effort basis. You can't estimate when a job will change into the running state, but it should happen within minutes under normal system load. When the job is in the running state, the transcription occurs faster than the audio runtime playback speed.

Prerequisites

As with all features of the Speech service, you create a Speech resource from the Azure portal.

Note

To use batch transcription, you need a standard Speech resource (S0) in your subscription. Free resources (F0) aren't supported. For more information, see pricing and limits.

If you plan to customize models, follow the steps in Acoustic customization and Language customization. To use the created models in batch transcription, you need their model location. You can retrieve the model location when you inspect the details of the model (the self property). A deployed custom endpoint isn't needed for the batch transcription service.

Note

As a part of the REST API, batch transcription has a set of quotas and limits. It's a good idea to review these. To take full advantage of the ability to efficiently transcribe a large number of audio files, send multiple files per request or point to an Azure Blob Storage container with the audio files to transcribe. The service transcribes the files concurrently, which reduces the turnaround time. For more information, see the Configuration section of this article.

Batch transcription API

The batch transcription API supports the following formats:

Format Codec Bits per sample Sample rate
WAV PCM 16-bit 8 kHz or 16 kHz, mono or stereo
MP3 PCM 16-bit 8 kHz or 16 kHz, mono or stereo
OGG OPUS 16-bit 8 kHz or 16 kHz, mono or stereo

For stereo audio streams, the left and right channels are split during the transcription. A JSON result file is created for each channel. To create an ordered final transcript, use the timestamps that are generated per utterance.

Configuration

Configuration parameters are provided as JSON. You can transcribe one or more individual files, process a whole storage container, and use a custom trained model in a batch transcription.

If you have more than one file to transcribe, it's a good idea to send multiple files in one request. The following example uses three files:

{
  "contentUrls": [
    "<URL to an audio file 1 to transcribe>",
    "<URL to an audio file 2 to transcribe>",
    "<URL to an audio file 3 to transcribe>"
  ],
  "properties": {
    "wordLevelTimestampsEnabled": true
  },
  "locale": "en-US",
  "displayName": "Transcription of file using default model for en-US"
}

To process a whole storage container, you can make the following configurations. Container SAS should contain r (read) and l (list) permissions:

{
  "contentContainerUrl": "<SAS URL to the Azure blob container to transcribe>",
  "properties": {
    "wordLevelTimestampsEnabled": true
  },
  "locale": "en-US",
  "displayName": "Transcription of container using default model for en-US"
}

Here's an example of using a custom trained model in a batch transcription. This example uses three files:

{
  "contentUrls": [
    "<URL to an audio file 1 to transcribe>",
    "<URL to an audio file 2 to transcribe>",
    "<URL to an audio file 3 to transcribe>"
  ],
  "properties": {
    "wordLevelTimestampsEnabled": true
  },
  "locale": "en-US",
  "model": {
    "self": "https://westus.api.cognitive.microsoft.com/speechtotext/v3.0/models/{id}"
  },
  "displayName": "Transcription of file using default model for en-US"
}

Configuration properties

Use these optional properties to configure transcription:

Parameter

Description

profanityFilterMode

Optional, defaults to Masked. Specifies how to handle profanity in recognition results. Accepted values are None to disable profanity filtering, Masked to replace profanity with asterisks, Removed to remove all profanity from the result, or Tags to add profanity tags.

punctuationMode

Optional, defaults to DictatedAndAutomatic. Specifies how to handle punctuation in recognition results. Accepted values are None to disable punctuation, Dictated to imply explicit (spoken) punctuation, Automatic to let the decoder deal with punctuation, or DictatedAndAutomatic to use dictated and automatic punctuation.

wordLevelTimestampsEnabled

Optional, false by default. Specifies if word level timestamps should be added to the output.

diarizationEnabled

Optional, false by default. Specifies that diarization analysis should be carried out on the input, which is expected to be a mono channel that contains two voices. Requires wordLevelTimestampsEnabled to be set to true.

channels

Optional, 0 and 1 transcribed by default. An array of channel numbers to process. Here, a subset of the available channels in the audio file can be specified to be processed (for example 0 only).

timeToLive

Optional, no deletion by default. A duration to automatically delete transcriptions after completing the transcription. The timeToLive is useful in mass processing transcriptions to ensure they will be eventually deleted (for example, PT12H for 12 hours).

destinationContainerUrl

Optional URL with ad hoc SAS to a writeable container in Azure. The result is stored in this container. SAS with stored access policies isn't supported. If you don't specify a container, Microsoft stores the results in a storage container managed by Microsoft. When the transcription is deleted by calling Delete transcription, the result data is also deleted.

Storage

Batch transcription can read audio from a public-visible internet URI, and can read audio or write transcriptions by using a SAS URI with Blob Storage.

Batch transcription result

For each audio input, one transcription result file is created. The Get transcriptions files operation returns a list of result files for this transcription. The only way to confirm the audio input for a transcription, is to check the source field in the transcription result file.

Each transcription result file has this format:

{
  "source": "...",                      // sas url of a given contentUrl or the path relative to the root of a given container
  "timestamp": "2020-06-16T09:30:21Z",  // creation time of the transcription, ISO 8601 encoded timestamp, combined date and time
  "durationInTicks": 41200000,          // total audio duration in ticks (1 tick is 100 nanoseconds)
  "duration": "PT4.12S",                // total audio duration, ISO 8601 encoded duration
  "combinedRecognizedPhrases": [        // concatenated results for simple access in single string for each channel
    {
      "channel": 0,                     // channel number of the concatenated results
      "lexical": "hello world",
      "itn": "hello world",
      "maskedITN": "hello world",
      "display": "Hello world."
    }
  ],
  "recognizedPhrases": [                // results for each phrase and each channel individually
    {
      "recognitionStatus": "Success",   // recognition state, e.g. "Success", "Failure"
      "speaker": 1,                     // if `diarizationEnabled` is `true`, this is the identified speaker (1 or 2), otherwise this property is not present
      "channel": 0,                     // channel number of the result
      "offset": "PT0.07S",              // offset in audio of this phrase, ISO 8601 encoded duration
      "duration": "PT1.59S",            // audio duration of this phrase, ISO 8601 encoded duration
      "offsetInTicks": 700000.0,        // offset in audio of this phrase in ticks (1 tick is 100 nanoseconds)
      "durationInTicks": 15900000.0,    // audio duration of this phrase in ticks (1 tick is 100 nanoseconds)

      // possible transcriptions of the current phrase with confidences
      "nBest": [
        {
          "confidence": 0.898652852,    // confidence value for the recognition of the whole phrase
          "lexical": "hello world",
          "itn": "hello world",
          "maskedITN": "hello world",
          "display": "Hello world.",

          // if wordLevelTimestampsEnabled is `true`, there will be a result for each word of the phrase, otherwise this property is not present
          "words": [
            {
              "word": "hello",
              "offset": "PT0.09S",
              "duration": "PT0.48S",
              "offsetInTicks": 900000.0,
              "durationInTicks": 4800000.0,
              "confidence": 0.987572
            },
            {
              "word": "world",
              "offset": "PT0.59S",
              "duration": "PT0.16S",
              "offsetInTicks": 5900000.0,
              "durationInTicks": 1600000.0,
              "confidence": 0.906032
            }
          ]
        }
      ]
    }
  ]
}

The result contains the following fields:

Field

Content

lexical

The actual words recognized.

itn

The inverse-text-normalized (ITN) form of the recognized text. Abbreviations (for example, "doctor smith" to "dr smith"), phone numbers, and other transformations are applied.

maskedITN

The ITN form with profanity masking applied.

display

The display form of the recognized text. Added punctuation and capitalization are included.

Speaker separation (diarization)

Diarization is the process of separating speakers in a piece of audio. The batch pipeline supports diarization and is capable of recognizing two speakers on mono channel recordings. The feature isn't available on stereo recordings.

The output of transcription with diarization enabled contains a Speaker entry for each transcribed phrase. If diarization isn't used, the Speaker property isn't present in the JSON output. For diarization, the speakers are identified as 1 or 2.

To request diarization, set the diarizationEnabled property to true. Here's an example:

{
  "contentUrls": [
    "<URL to an audio file to transcribe>",
  ],
  "properties": {
    "diarizationEnabled": true,
    "wordLevelTimestampsEnabled": true,
    "punctuationMode": "DictatedAndAutomatic",
    "profanityFilterMode": "Masked"
  },
  "locale": "en-US",
  "displayName": "Transcription of file using default model for en-US"
}

Word-level timestamps must be enabled, as the parameters in this request indicate.

Best practices

The batch transcription service can handle a large number of submitted transcriptions. You can query the status of your transcriptions with Get transcriptions. Call Delete transcription regularly from the service, after you retrieve the results. Alternatively, set the timeToLive property to ensure the eventual deletion of the results.

Tip

You can use the Ingestion Client tool and resulting solution to process a high volume of audio.

Sample code

Complete samples are available in the GitHub sample repository, inside the samples/batch subdirectory.

Update the sample code with your subscription information, service region, URI pointing to the audio file to transcribe, and model location if you're using a custom model.

var newTranscription = new Transcription
{
    DisplayName = DisplayName, 
    Locale = Locale, 
    ContentUrls = new[] { RecordingsBlobUri },
    //ContentContainerUrl = ContentAzureBlobContainer,
    Model = CustomModel,
    Properties = new TranscriptionProperties
    {
        IsWordLevelTimestampsEnabled = true,
        TimeToLive = TimeSpan.FromDays(1)
    }
};

newTranscription = await client.CreateTranscriptionAsync(newTranscription).ConfigureAwait(false);
Console.WriteLine($"Created transcription {newTranscription.Self}");

The sample code sets up the client and submits the transcription request. It then polls for the status information and prints details about the transcription progress.

// get the status of our transcriptions periodically and log results
int completed = 0, running = 0, notStarted = 0;
while (completed < 1)
{
    completed = 0; running = 0; notStarted = 0;

    // get all transcriptions for the user
    paginatedTranscriptions = null;
    do
    {
        // <transcriptionstatus>
        if (paginatedTranscriptions == null)
        {
            paginatedTranscriptions = await client.GetTranscriptionsAsync().ConfigureAwait(false);
        }
        else
        {
            paginatedTranscriptions = await client.GetTranscriptionsAsync(paginatedTranscriptions.NextLink).ConfigureAwait(false);
        }

        // delete all pre-existing completed transcriptions. If transcriptions are still running or not started, they will not be deleted
        foreach (var transcription in paginatedTranscriptions.Values)
        {
            switch (transcription.Status)
            {
                case "Failed":
                case "Succeeded":
                    // we check to see if it was one of the transcriptions we created from this client.
                    if (!createdTranscriptions.Contains(transcription.Self))
                    {
                        // not created form here, continue
                        continue;
                    }

                    completed++;

                    // if the transcription was successful, check the results
                    if (transcription.Status == "Succeeded")
                    {
                        var paginatedfiles = await client.GetTranscriptionFilesAsync(transcription.Links.Files).ConfigureAwait(false);

                        var resultFile = paginatedfiles.Values.FirstOrDefault(f => f.Kind == ArtifactKind.Transcription);
                        var result = await client.GetTranscriptionResultAsync(new Uri(resultFile.Links.ContentUrl)).ConfigureAwait(false);
                        Console.WriteLine("Transcription succeeded. Results: ");
                        Console.WriteLine(JsonConvert.SerializeObject(result, SpeechJsonContractResolver.WriterSettings));
                    }
                    else
                    {
                        Console.WriteLine("Transcription failed. Status: {0}", transcription.Properties.Error.Message);
                    }

                    break;

                case "Running":
                    running++;
                    break;

                case "NotStarted":
                    notStarted++;
                    break;
            }
        }

        // for each transcription in the list we check the status
        Console.WriteLine(string.Format("Transcriptions status: {0} completed, {1} running, {2} not started yet", completed, running, notStarted));
    }
    while (paginatedTranscriptions.NextLink != null);

    // </transcriptionstatus>
    // check again after 1 minute
    await Task.Delay(TimeSpan.FromMinutes(1)).ConfigureAwait(false);
}

For full details about the preceding calls, see the Speech-to-text REST API v3.0 reference documentation. For the full sample shown here, go to GitHub in the samples/batch subdirectory.

This sample uses an asynchronous setup to post audio and receive transcription status. The PostTranscriptions method sends the audio file details, and the GetTranscriptions method receives the states. PostTranscriptions returns a handle, and GetTranscriptions uses it to create a handle to get the transcription status.

This sample code doesn't specify a custom model. The service uses the base model for transcribing the file or files. To specify the model, you can pass on the same method the model reference for the custom model.

Note

For baseline transcriptions, you don't need to declare the ID for the base model.

Next steps