什麼是文字轉換語音?What is text-to-speech?

語音服務中的文字轉換語音可讓您的應用程式、工具或裝置將文字轉換成類似人類的合成語音。Text-to-speech from the Speech service enables your applications, tools, or devices to convert text into human-like synthesized speech. 從標準和類神經語音中選擇,或為您的產品或品牌建立獨特的自訂語音。Choose from standard and neural voices, or create a custom voice unique to your product or brand. 75 + standard 語音提供45以上的語言和地區設定,而5類神經語音則提供一系列精選的語言和地區設定。75+ standard voices are available in more than 45 languages and locales, and 5 neural voices are available in a select number of languages and locales. 如需支援的語音、語言和地區設定的完整清單,請參閱支援的語言For a full list of supported voices, languages, and locales, see supported languages.


Bing 語音已于2019年10月15日解除委任。Bing Speech was decommissioned on October 15, 2019. 如果您的應用程式、工具或產品使用 Bing 語音 Api 或自訂語音,我們建立了指南,協助您遷移至語音服務。If your applications, tools, or products are using the Bing Speech APIs or Custom Speech, we've created guides to help you migrate to the Speech service.

核心功能Core features

  • 語音合成-使用語音 SDKREST API ,使用標準、類神經或自訂語音將文字轉換為語音。Speech synthesis - Use the Speech SDK or REST API to convert text-to-speech using standard, neural, or custom voices.

  • 非同步合成長音訊-使用長音訊 API以非同步方式合成長達10分鐘的文字轉換語音檔案(例如,音訊書籍或演講)。Asynchronous synthesis of long audio - Use the Long Audio API to asynchronously synthesize text-to-speech files longer than 10 minutes (for example audio books or lectures). 不同于使用語音 SDK 或語音轉換文字 REST API 執行的合成,回應不會即時傳回。Unlike synthesis performed using the Speech SDK or speech-to-text REST API, responses aren't returned in real time. 預期的情況是要求會以非同步方式傳送、針對輪詢回應,並在從服務提供時下載合成的音訊。The expectation is that requests are sent asynchronously, responses are polled for, and that the synthesized audio is downloaded when made available from the service. 僅支援類神經語音。Only neural voices are supported.

  • 標準語音-使用統計參數合成和(或)串連合成技術所建立。Standard voices - Created using Statistical Parametric Synthesis and/or Concatenation Synthesis techniques. 這些聲音具有高度可理解和自然的音效。These voices are highly intelligible and sound natural. 您可以輕鬆地讓您的應用程式以多種語音選項,在45以上的語言中說話。You can easily enable your applications to speak in more than 45 languages, with a wide range of voice options. 這些聲音提供高發音精確度,包括縮寫、縮寫擴充、日期/時間解讀、polyphones 等等的支援。These voices provide high pronunciation accuracy, including support for abbreviations, acronym expansions, date/time interpretations, polyphones, and more. 如需標準語音的完整清單,請參閱支援的語言For a full list of standard voices, see supported languages.

  • 類神經語音-深度類神經網路可用來克服傳統語音合成的限制,以及語音語言的壓力和聲調。Neural voices - Deep neural networks are used to overcome the limits of traditional speech synthesis with regards to stress and intonation in spoken language. 韻律預測和語音合成會同時執行,這會導致更流暢且自然發音的輸出。Prosody prediction and voice synthesis are performed simultaneously, which results in more fluid and natural-sounding outputs. 神經語音可以用來與聊天機器人和語音助理互動,使其更自然且吸引人、將數位文字(例如電子書)轉換成有聲書;,以及增強汽車內流覽系統。Neural voices can be used to make interactions with chatbots and voice assistants more natural and engaging, convert digital texts such as e-books into audiobooks, and enhance in-car navigation systems. 有了人類的自然韻律和清楚的單字文字清晰度,神經語音會在您與 AI 系統互動時,大幅減少聆聽的疲勞。With the human-like natural prosody and clear articulation of words, neural voices significantly reduce listening fatigue when you interact with AI systems. 如需神經語音的完整清單,請參閱支援的語言For a full list of neural voices, see supported languages.

  • 語音合成標記語言(SSML)-以 XML 為基礎的標記語言,用來自訂語音轉換文字輸出。Speech Synthesis Markup Language (SSML) - An XML-based markup language used to customize speech-to-text outputs. 透過 SSML,您可以調整音調、新增暫停、改善發音、加速或減緩說話速率、增加或減少音量,以及將多個聲音屬性設為單一檔。With SSML, you can adjust pitch, add pauses, improve pronunciation, speed up or slow down speaking rate, increase or decrease volume, and attribute multiple voices to a single document. 請參閱SSMLSee SSML.

開始使用Get started

文字轉換語音服務可透過語音 SDK取得。The text-to-speech service is available via the Speech SDK. 在各種不同的語言和平臺中,有幾個常見的案例可做為快速入門:There are several common scenarios available as quickstarts, in various languages and platforms:

如果您想要的話,文字轉換語音服務可以透過REST來存取。If you prefer, the text-to-speech service is accessible via REST.

範例程式碼Sample code

您可以在 GitHub 上取得文字轉換語音的範例程式碼。Sample code for text-to-speech is available on GitHub. 這些範例涵蓋了最受歡迎的程式設計語言的文字轉換語音。These samples cover text-to-speech conversion in most popular programming languages.


除了標準和類神經語音以外,您還可以建立及微調您產品或品牌特有的自訂語音。In addition to standard and neural voices, you can create and fine-tune custom voices unique to your product or brand. 開始使用是一些音訊檔案和相關聯的轉譯。All it takes to get started are a handful of audio files and the associated transcriptions. 如需詳細資訊,請參閱開始使用自訂語音For more information, see Get started with Custom Voice

定價注意事項Pricing note

使用文字轉換語音服務時,會向您收取轉換成語音的每個字元的費用,包括標點符號。When using the text-to-speech service, you are billed for each character that is converted to speech, including punctuation. 雖然 SSML 檔本身無法計費,但用來調整文字轉換成語音的選擇性元素(例如音素和音調)會視為計費字元。While the SSML document itself is not billable, optional elements that are used to adjust how the text is converted to speech, like phonemes and pitch, are counted as billable characters. 以下是可計費的清單:Here's a list of what's billable:

  • 在要求的 SSML 主體中傳遞至文字轉換語音服務的文字Text passed to the text-to-speech service in the SSML body of the request
  • 在要求本文的文字欄位中,所有的標記都是 SSML 格式,但 <speak><voice> 標記除外All markup within the text field of the request body in the SSML format, except for <speak> and <voice> tags
  • 字母、標點符號、空格、定位字元、標記和所有空白字元Letters, punctuation, spaces, tabs, markup, and all white-space characters
  • 以 Unicode 定義的每個字碼指標Every code point defined in Unicode

如需詳細資訊,請參閱定價For detailed information, see Pricing.


每個中文、日文和韓文語言字元會視為兩個要計費的字元。Each Chinese, Japanese, and Korean language character is counted as two characters for billing.

參考文件Reference docs

後續步驟Next steps