Why use Batch transcription?

Batch transcription is ideal if you want to transcribe a large quantity of audio in storage, such as Azure Blobs. By using the dedicated REST API, you can point to audio files with a shared access signature (SAS) URI and asynchronously receive transcriptions.


Subscription Key

As with all features of the Speech service, you create a subscription key from the Azure portal by following our Get started guide. If you plan to get transcriptions from our baseline models, creating a key is all you need to do.


A standard subscription (S0) for Speech service is required to use batch transcription. Free subscription keys (F0) will not work. For additional information, see pricing and limits.

Custom models

If you plan to customize acoustic or language models, follow the steps in Customize acoustic models and Customizing language models. To use the created models in batch transcription you need their model IDs. This ID is not the endpoint ID that you find on the Endpoint Details view, it is the model ID that you can retrieve when you select the details of the models.

The Batch Transcription API

The Batch Transcription API offers asynchronous speech-to-text transcription, along with additional features. It is a REST API that exposes methods for:

  1. Creating batch processing requests
  2. Query Status
  3. Downloading transcriptions


The Batch Transcription API is ideal for call centers, which typically accumulate thousands of hours of audio. It makes it easy to transcribe large volumes of audio recordings.

Supported formats

The Batch Transcription API supports the following formats:

Format Codec Bitrate Sample Rate
WAV PCM 16-bit 8 or 16 kHz, mono, stereo
MP3 PCM 16-bit 8 or 16 kHz, mono, stereo
OGG OPUS 16-bit 8 or 16 kHz, mono, stereo

For stereo audio streams, the Batch transcription API splits the left and right channel during the transcription. The two JSON files with the result are each created from a single channel. The timestamps per utterance enable the developer to create an ordered final transcript. This sample request includes properties for profanity filtering, punctuation, and word level timestamps.


Configuration parameters are provided as JSON:

  "recordingsUrl": "<URL to the Azure blob to transcribe>",
  "models": [{"Id":"<optional acoustic model ID>"},{"Id":"<optional language model ID>"}],
  "locale": "<locale to use, for example en-US>",
  "name": "<user defined name of the transcription batch>",
  "description": "<optional description of the transcription>",
  "properties": {
    "ProfanityFilterMode": "Masked",
    "PunctuationMode": "DictatedAndAutomatic",
    "AddWordLevelTimestamps" : "True",
    "AddSentiment" : "True"


The Batch Transcription API uses a REST service for requesting transcriptions, their status, and associated results. You can use the API from any language. The next section describes how the API is used.

Configuration properties

Use these optional properties to configure transcription:

Parameter Description
ProfanityFilterMode Specifies how to handle profanity in recognition results. Accepted values are None which disables profanity filtering, masked which replaces profanity with asterisks, removed which removes all profanity from the result, or tags which adds "profanity" tags. The default setting is masked.
PunctuationMode Specifies how to handle punctuation in recognition results. Accepted values are None which disables punctuation, dictated which implies explicit punctuation, automatic which lets the decoder deal with punctuation, or dictatedandautomatic which implies dictated punctuation marks or automatic.
AddWordLevelTimestamps Specifies if word level timestamps should be added to the output. Accepted values are true which enables word level timestamps and false (the default value) to disable it.
AddSentiment Specifies sentiment should be added to the utterance. Accepted values are true which enables sentiment per utterance and false (the default value) to disable it.
AddDiarization Specifies that diarization analysis should be carried out on the input which is expected to be mono channel containing two voices. Accepted values are true which enables diarization and false (the default value) to disable it. It also requires AddWordLevelTimestamps to be set to true.


Batch transcription supports Azure Blob storage for reading audio and writing transcriptions to storage.

Speaker Separation (Diarization)

Diarization is the process of separating speakers in a piece of audio. Our Batch pipeline supports Diarization and is capable of recognizing two speakers on mono channel recordings.

To request that your audio transcription request is processed for diarization, you simply have to add the relevant parameter in the HTTP request as shown below.

 "recordingsUrl": "<URL to the Azure blob to transcribe>",
 "models": [{"Id":"<optional acoustic model ID>"},{"Id":"<optional language model ID>"}],
 "locale": "<locale to us, for example en-US>",
 "name": "<user defined name of the transcription batch>",
 "description": "<optional description of the transcription>",
 "properties": {
   "AddWordLevelTimestamps" : "True",
   "AddDiarization" : "True"

Word level timestamps would also have to be 'turned on' as the parameters in the above request indicate.

The corresponding audio will contain the speakers identified by a number (currently we support only two voices, so the speakers will be identified as 'Speaker 1 'and 'Speaker 2') followed by the transcription output.

Also note that Diarization is not available in Stereo recordings. Furthermore, all JSON output will contain the Speaker tag. If diarization is not used, it will show 'Speaker: Null' in the JSON output.


Diarization is available in all regions and for all locales!


Sentiment is a new feature in Batch Transcription API and is an important feature in the call center domain. Customers can use the AddSentiment parameters to their requests to

  1. Get insights on customer satisfaction
  2. Get insight on the performance of the agents (team taking the calls)
  3. Pinpoint the exact point in time when a call took a turn in a negative direction
  4. Pinpoint what went well when turning negative calls to positive
  5. Identify what customers like and what they dislike about a product or a service

Sentiment is scored per audio segment where an audio segment is defined as the time lapse between the start of the utterance (offset) and the detection silence of end of byte stream. The entire text within that segment is used to calculate sentiment. We DO NOT calculate any aggregate sentiment values for the entire call or the entire speech of each channel. These aggregations are left to the domain owner to further apply.

Sentiment is applied on the lexical form.

A JSON output sample looks like below:

  "AudioFileResults": [
      "AudioFileName": "Channel.0.wav",
      "AudioFileUrl": null,
      "SegmentResults": [
          "RecognitionStatus": "Success",
          "ChannelNumber": null,
          "Offset": 400000,
          "Duration": 13300000,
          "NBest": [
              "Confidence": 0.976174,
              "Lexical": "what's the weather like",
              "ITN": "what's the weather like",
              "MaskedITN": "what's the weather like",
              "Display": "What's the weather like?",
              "Words": null,
              "Sentiment": {
                "Negative": 0.206194,
                "Neutral": 0.793785,
                "Positive": 0.0

The feature uses a Sentiment model, which is currently in Beta.

Sample code

Complete samples are available in the GitHub sample repository inside the samples/batch subdirectory.

You have to customize the sample code with your subscription information, the service region, the SAS URI pointing to the audio file to transcribe, and model IDs in case you want to use a custom acoustic or language model.

// Replace with your subscription key
private const string SubscriptionKey = "YourSubscriptionKey";

// Update with your service region
private const string Region = "YourServiceRegion";
private const int Port = 443;

// recordings and locale
private const string Locale = "en-US";
private const string RecordingsBlobUri = "<SAS URI pointing to an audio file stored in Azure Blob Storage>";

// For usage of baseline models, no acoustic and language model needs to be specified.
private static Guid[] modelList = new Guid[0];

// For use of specific acoustic and language models:
// - comment the previous line
// - uncomment the next lines to create an array containing the guids of your required model(s)
// private static Guid AdaptedAcousticId = new Guid("<id of the custom acoustic model>");
// private static Guid AdaptedLanguageId = new Guid("<id of the custom language model>");
// private static Guid[] modelList = new[] { AdaptedAcousticId, AdaptedLanguageId };

//name and description
private const string Name = "Simple transcription";
private const string Description = "Simple transcription description";

The sample code will set up the client and submit the transcription request. It will then poll for status information and print details about the transcription progress.

// get all transcriptions for the user
transcriptions = await client.GetTranscriptionsAsync().ConfigureAwait(false);

completed = 0; running = 0; notStarted = 0;
// for each transcription in the list we check the status
foreach (var transcription in transcriptions)
    switch (transcription.Status)
        case "Failed":
        case "Succeeded":
            // we check to see if it was one of the transcriptions we created from this client.
            if (!createdTranscriptions.Contains(transcription.Id))
                // not created form here, continue

            // if the transcription was successful, check the results
            if (transcription.Status == "Succeeded")
                var resultsUri0 = transcription.ResultsUrls["channel_0"];

                WebClient webClient = new WebClient();

                var filename = Path.GetTempFileName();
                webClient.DownloadFile(resultsUri0, filename);
                var results0 = File.ReadAllText(filename);
                var resultObject0 = JsonConvert.DeserializeObject<RootObject>(results0);

                Console.WriteLine("Transcription succeeded. Results: ");
                Console.WriteLine("Transcription failed. Status: {0}", transcription.StatusMessage);

        case "Running":

        case "NotStarted":

For full details about the preceding calls, see our Swagger document. For the full sample shown here, go to GitHub in the samples/batch subdirectory.

Take note of the asynchronous setup for posting audio and receiving transcription status. The client that you create is a .NET HTTP client. There's a PostTranscriptions method for sending the audio file details and a GetTranscriptions method for receiving the results. PostTranscriptions returns a handle, and GetTranscriptions uses it to create a handle to get the transcription status.

The current sample code doesn't specify a custom model. The service uses the baseline models for transcribing the file or files. To specify the models, you can pass on the same method as the model IDs for the acoustic and the language model.


For baseline transcriptions, you don't need to declare the ID for the baseline models. If you only specify a language model ID (and no acoustic model ID), a matching acoustic model is automatically selected. If you only specify an acoustic model ID, a matching language model is automatically selected.

Download the sample

You can find the sample in the samples/batch directory in the GitHub sample repository.


Batch transcription jobs are scheduled on a best effort basis, there is no time estimate for when a job will change into the running state. Once in running state, the actual transcription is processed faster than the audio real time.

Next steps