為何使用 Batch 轉譯?Why use Batch transcription?

如果您想要轉譯儲存體 (例如 Azure Blob) 中數量龐大的音訊,則適用批次轉譯。Batch transcription is ideal if you want to transcribe a large quantity of audio in storage, such as Azure Blobs. 透過使用該專屬 REST API,您可以使用共用存取簽章 (SAS) URI 來指向音訊檔案,並以非同步方式接收轉譯。By using the dedicated REST API, you can point to audio files with a shared access signature (SAS) URI and asynchronously receive transcriptions.


訂用帳戶金鑰Subscription Key

如同語音服務的所有功能,您可以依照我們的快速入門指南Azure 入口網站建立訂用帳戶金鑰。As with all features of the Speech service, you create a subscription key from the Azure portal by following our Get started guide. 如果您打算從我們的基準模型取得轉譯,您只要建立金鑰即可。If you plan to get transcriptions from our baseline models, creating a key is all you need to do.


語音服務的標準訂用帳戶 (S0) 才能使用批次轉譯。A standard subscription (S0) for Speech Services is required to use batch transcription. 免費訂用帳戶金鑰 (F0) 將無法運作。Free subscription keys (F0) will not work. 如需詳細資訊,請參閱定價和限制For additional information, see pricing and limits.

自訂模型Custom models

如果您打算自訂原音或語言模型,請依照自訂原音模型自訂語言模型中的步驟進行。If you plan to customize acoustic or language models, follow the steps in Customize acoustic models and Customizing language models. 若要使用批次轉譯中建立的模型,您需要其模型識別碼。To use the created models in batch transcription you need their model IDs. 此識別碼不是您在 [端點詳細資料] 檢視上所找到的端點識別碼,這是您選取該模型的詳細資料時可擷取的模型識別碼。This ID is not the endpoint ID that you find on the Endpoint Details view, it is the model ID that you can retrieve when you select the details of the models.

Batch 轉譯 APIThe Batch Transcription API

Batch 轉譯 API 提供非同步語音轉換文字轉譯及其他功能。The Batch Transcription API offers asynchronous speech-to-text transcription, along with additional features. 此 REST API 所公開的方法可以:It is a REST API that exposes methods for:

  1. 建立批次的處理要求Creating batch processing requests
  2. 查詢狀態Query Status
  3. 下載轉譯Downloading transcriptions


Batch 轉譯 API 很適合話務中心,它們通常會累積數千小時的音訊。The Batch Transcription API is ideal for call centers, which typically accumulate thousands of hours of audio. 這能輕鬆轉譯大量的音訊錄製。It makes it easy to transcribe large volumes of audio recordings.

支援的格式Supported formats

Batch 轉譯 API 支援下列格式:The Batch Transcription API supports the following formats:

格式Format 轉碼器Codec BitrateBitrate 採樣速率Sample Rate
WAVWAV PCMPCM 16 位元16-bit 8 或 16 kHz、單聲道、立體聲8 or 16 kHz, mono, stereo
MP3MP3 PCMPCM 16 位元16-bit 8 或 16 kHz、單聲道、立體聲8 or 16 kHz, mono, stereo
OGGOGG OPUSOPUS 16 位元16-bit 8 或 16 kHz、單聲道、立體聲8 or 16 kHz, mono, stereo

針對立體聲音訊資料流,Batch 轉譯 API 會在轉譯期間分離左右聲道。For stereo audio streams, the Batch transcription API splits the left and right channel during the transcription. 這會建立兩個 JSON 結果檔案,每個聲道各建立一個。The two JSON files with the result are each created from a single channel. 每個語句的時間戳記可讓開發人員建立排序的最終文字記錄。The timestamps per utterance enable the developer to create an ordered final transcript. 此範例要求包含不雅內容篩選、標點符號和字組層級時間戳記的屬性。This sample request includes properties for profanity filtering, punctuation, and word level timestamps.


組態參數以 JSON 格式提供:Configuration parameters are provided as JSON:

  "recordingsUrl": "<URL to the Azure blob to transcribe>",
  "models": [{"Id":"<optional acoustic model ID>"},{"Id":"<optional language model ID>"}],
  "locale": "<locale to us, for example en-US>",
  "name": "<user defined name of the transcription batch>",
  "description": "<optional description of the transcription>",
  "properties": {
    "ProfanityFilterMode": "Masked",
    "PunctuationMode": "DictatedAndAutomatic",
    "AddWordLevelTimestamps" : "True",
    "AddSentiment" : "True"


Batch 轉譯 API 使用 REST 服務來要求轉譯、其狀態及相關結果。The Batch Transcription API uses a REST service for requesting transcriptions, their status, and associated results. 您可以使用任何語言的 API。You can use the API from any language. 下一節描述 API 的使用方式。The next section describes how the API is used.

組態屬性Configuration properties

使用這些選擇性屬性來設定轉譯:Use these optional properties to configure transcription:

參數Parameter 描述Description
ProfanityFilterMode 指定如何處理辨識結果中的不雅內容。Specifies how to handle profanity in recognition results. 接受的值為 None (會停用不雅內容過濾)、masked (為以星號取代不雅內容)、removed (會移除結果中的所有不雅內容) 或 tags (會新增「不雅內容」標記)。Accepted values are None which disables profanity filtering, masked which replaces profanity with asterisks, removed which removes all profanity from the result, or tags which adds "profanity" tags. 預設設定為 maskedThe default setting is masked.
PunctuationMode 指定如何處理辨識結果中的標點符號。Specifies how to handle punctuation in recognition results. 接受的值為None (會停用標點符號)、dictated (暗示明確的標點符號)、automatic (會讓解碼器處理標點符號) 或 dictatedandautomatic (暗示口述的標點符號或自動)。Accepted values are None which disables punctuation, dictated which implies explicit punctuation, automatic which lets the decoder deal with punctuation, or dictatedandautomatic which implies dictated punctuation marks or automatic.
AddWordLevelTimestamps 指定是否將字組層級時間戳記新增至輸出。Specifies if word level timestamps should be added to the output. 接受的值為true 會啟用字組層級時間戳記,而 false (預設值) 會停用。Accepted values are true which enables word level timestamps and false (the default value) to disable it.
AddSentiment 指定應該將情感新增至語句。Specifies sentiment should be added to the utterance. 接受的值true為, 可讓每個false語句的情感和 (預設值) 停用它。Accepted values are true which enables sentiment per utterance and false (the default value) to disable it.
AddDiarization 指定應該在輸入上執行 diarization 分析, 這應該是包含兩個語音的 mono 通道。Specifies that diarization analysis should be carried out on the input which is expected to be mono channel containing two voices. 接受true的值為, 可讓false diarization 和 (預設值) 停用它。Accepted values are true which enables diarization and false (the default value) to disable it. 它也需要AddWordLevelTimestamps設定為 true。It also requires AddWordLevelTimestamps to be set to true.


批次轉譯支援Azure Blob 儲存體來讀取音訊, 並將轉譯寫入儲存體。Batch transcription supports Azure Blob storage for reading audio and writing transcriptions to storage.


輪詢轉譯狀態可能不是最高效能, 也不能提供最佳的使用者體驗。Polling for transcription status may not be the most performant, or provide the best user experience. 若要輪詢狀態, 您可以註冊回呼, 這會在長時間執行的轉譯工作完成時通知用戶端。To poll for status, you can register callbacks, which will notify the client when long-running transcription tasks have completed.

如需詳細資訊, 請參閱webhookFor more details, see Webhooks.

說話者分隔 (Diarization)Speaker Separation (Diarization)

Diarization 是在一段音訊中分隔喇叭的程式。Diarization is the process of separating speakers in a piece of audio. 我們的批次管線支援 Diarization, 而且能夠在 mono 通道錄製上辨識兩個喇叭。Our Batch pipeline supports Diarization and is capable of recognizing two speakers on mono channel recordings.

若要要求針對 diarization 處理您的音訊轉譯要求, 您只需要在 HTTP 要求中新增相關的參數, 如下所示。To request that your audio transcription request is processed for diarization, you simply have to add the relevant parameter in the HTTP request as shown below.

 "recordingsUrl": "<URL to the Azure blob to transcribe>",
 "models": [{"Id":"<optional acoustic model ID>"},{"Id":"<optional language model ID>"}],
 "locale": "<locale to us, for example en-US>",
 "name": "<user defined name of the transcription batch>",
 "description": "<optional description of the transcription>",
 "properties": {
   "AddWordLevelTimestamps" : "True",
   "AddDiarization" : "True"

Word 層級時間戳記也必須「開啟」, 因為上述要求中的參數表示。Word level timestamps would also have to be 'turned on' as the parameters in the above request indicate.

對應的音訊將包含數位所識別的喇叭 (目前只支援兩個語音, 因此喇叭會被視為「喇叭1」和「喇叭2」), 後面接著轉譯輸出。The corresponding audio will contain the speakers identified by a number (currently we support only two voices, so the speakers will be identified as 'Speaker 1 'and 'Speaker 2') followed by the transcription output.

另請注意, Diarization 無法用於身歷聲記錄。Also note that Diarization is not available in Stereo recordings. 此外, 所有 JSON 輸出都會包含喇叭標記。Furthermore, all JSON output will contain the Speaker tag. 如果未使用 diarization, 它會顯示「喇叭:Null ' (在 JSON 輸出中)。If diarization is not used, it will show 'Speaker: Null' in the JSON output.


Diarization 適用于所有區域和所有地區設定!Diarization is available in all regions and for all locales!


情感是 Batch 轉譯 API 中的一項新功能, 而且是撥接中心網域中的重要功能。Sentiment is a new feature in Batch Transcription API and is an important feature in the call center domain. 客戶可以使用其AddSentiment要求的參數Customers can use the AddSentiment parameters to their requests to

  1. 取得客戶滿意度的見解Get insights on customer satisfaction
  2. 取得代理程式效能的深入解析 (接受呼叫的小組)Get insight on the performance of the agents (team taking the calls)
  3. 找出呼叫使用負方向的確切時間點Pinpoint the exact point in time when a call took a turn in a negative direction
  4. 找出否定的正值呼叫時, 會發生什麼狀況Pinpoint what went well when turning negative calls to positive
  5. 識別客戶喜歡什麼, 以及他們對產品或服務不喜歡什麼Identify what customers like and what they dislike about a product or a service

情感是每個音訊區段的計分, 其中音訊區段定義為語句 (位移) 開始與位元組資料流程結尾的偵測無聲之間的時間間隔。Sentiment is scored per audio segment where an audio segment is defined as the time lapse between the start of the utterance (offset) and the detection silence of end of byte stream. 該區段內的整個文字會用來計算情感。The entire text within that segment is used to calculate sentiment. 我們不會針對整個呼叫或每個通道的整個語音計算任何匯總情感值。We DO NOT calculate any aggregate sentiment values for the entire call or the entire speech of each channel. 這些匯總會留給網域擁有者進一步套用。These aggregations are left to the domain owner to further apply.

情感會套用在詞法形式上。Sentiment is applied on the lexical form.

JSON 輸出範例如下所示:A JSON output sample looks like below:

  "AudioFileResults": [
      "AudioFileName": "Channel.0.wav",
      "AudioFileUrl": null,
      "SegmentResults": [
          "RecognitionStatus": "Success",
          "ChannelNumber": null,
          "Offset": 400000,
          "Duration": 13300000,
          "NBest": [
              "Confidence": 0.976174,
              "Lexical": "what's the weather like",
              "ITN": "what's the weather like",
              "MaskedITN": "what's the weather like",
              "Display": "What's the weather like?",
              "Words": null,
              "Sentiment": {
                "Negative": 0.206194,
                "Neutral": 0.793785,
                "Positive": 0.0

此功能會使用情感模型, 其目前為搶鮮版 (Beta)。The feature uses a Sentiment model, which is currently in Beta.

範例程式碼Sample code

您可以在samples/batch子目錄內的GitHub 範例存放庫中取得完整範例。Complete samples are available in the GitHub sample repository inside the samples/batch subdirectory.

您自訂的範例程式碼要有訂用帳戶資訊、服務區域、指向音訊檔的 SAS URI 以轉譯,以及模型識別碼,以防您想要使用自訂原音或語言模型。You have to customize the sample code with your subscription information, the service region, the SAS URI pointing to the audio file to transcribe, and model IDs in case you want to use a custom acoustic or language model.

// Replace with your subscription key
private const string SubscriptionKey = "YourSubscriptionKey";

// Update with your service region
private const string Region = "YourServiceRegion";
private const int Port = 443;

// recordings and locale
private const string Locale = "en-US";
private const string RecordingsBlobUri = "<SAS URI pointing to an audio file stored in Azure Blob Storage>";

// For usage of baseline models, no acoustic and language model needs to be specified.
private static Guid[] modelList = new Guid[0];

// For use of specific acoustic and language models:
// - comment the previous line
// - uncomment the next lines to create an array containing the guids of your required model(s)
// private static Guid AdaptedAcousticId = new Guid("<id of the custom acoustic model>");
// private static Guid AdaptedLanguageId = new Guid("<id of the custom language model>");
// private static Guid[] modelList = new[] { AdaptedAcousticId, AdaptedLanguageId };

//name and description
private const string Name = "Simple transcription";
private const string Description = "Simple transcription description";

範例程式碼將設定用戶端並提交轉譯要求。The sample code will setup the client and submit the transcription request. 接著會輪詢狀態資訊並列印轉譯進度的詳細資料。It will then poll for status information and print details about the transcription progress.

// get all transcriptions for the user
transcriptions = await client.GetTranscriptionsAsync().ConfigureAwait(false);

completed = 0; running = 0; notStarted = 0;
// for each transcription in the list we check the status
foreach (var transcription in transcriptions)
    switch (transcription.Status)
        case "Failed":
        case "Succeeded":
            // we check to see if it was one of the transcriptions we created from this client.
            if (!createdTranscriptions.Contains(transcription.Id))
                // not created form here, continue

            // if the transcription was successful, check the results
            if (transcription.Status == "Succeeded")
                var resultsUri0 = transcription.ResultsUrls["channel_0"];

                WebClient webClient = new WebClient();

                var filename = Path.GetTempFileName();
                webClient.DownloadFile(resultsUri0, filename);
                var results0 = File.ReadAllText(filename);
                var resultObject0 = JsonConvert.DeserializeObject<RootObject>(results0);

                Console.WriteLine("Transcription succeeded. Results: ");
                Console.WriteLine("Transcription failed. Status: {0}", transcription.StatusMessage);

        case "Running":

        case "NotStarted":

如需上述呼叫的完整詳細資訊,請參閱我們的 Swagger 文件For full details about the preceding calls, see our Swagger document. 如需此處所顯示的完整範例,請前往 samples/batch 子目錄中的 GitHubFor the full sample shown here, go to GitHub in the samples/batch subdirectory.

請留意張貼音訊和接收轉譯狀態的非同步設定。Take note of the asynchronous setup for posting audio and receiving transcription status. 您建立的用戶端是 .NET HTTP 用戶端。The client that you create is a .NET HTTP client. PostTranscriptions 方法可傳送音訊檔案詳細資料,而 GetTranscriptions 方法可接收結果。There's a PostTranscriptions method for sending the audio file details and a GetTranscriptions method for receiving the results. PostTranscriptions 會傳回控制代碼,然後 GetTranscriptions 使用它來取得轉譯狀態。PostTranscriptions returns a handle, and GetTranscriptions uses it to create a handle to get the transcription status.

目前的範例程式碼沒有指定自訂模型。The current sample code doesn't specify a custom model. 該服務會使用基準模型來轉譯一或多個檔案。The service uses the baseline models for transcribing the file or files. 若要指定模型,您可以傳遞相同的方法,如同原音和語言模型的模型識別碼。To specify the models, you can pass on the same method as the model IDs for the acoustic and the language model.


針對基準轉譯,您不需要宣告基準模型的識別碼。For baseline transcriptions, you don't need to declare the ID for the baseline models. 如果您只有指定語言模型識別碼 (且沒有任何的原音模型識別碼),會自動選取相符的原音模型。If you only specify a language model ID (and no acoustic model ID), a matching acoustic model is automatically selected. 如果您只有指定原音模型識別碼,則會自動選取相符的語言模型。If you only specify an acoustic model ID, a matching language model is automatically selected.

下載範例Download the sample

您可以在 GitHub 範例存放庫中的 samples/batch 目錄找到範例。You can find the sample in the samples/batch directory in the GitHub sample repository.


批次轉譯作業會盡可能排程,作業何時會變更為執行狀態則無法預估。Batch transcription jobs are scheduled on a best effort basis, there is no time estimate for when a job will change into the running state. 只要處於執行中狀態,實際轉譯的即時處理速度會比音訊快。Once in running state, the actual transcription is processed faster than the audio real time.

後續步驟Next steps