文字轉換語音是什麼?What is text-to-speech?

文字轉換語音,Azure 語音服務是一項服務,可讓應用程式、 工具或裝置,若要將文字轉換成自然的人性合成語音。Text-to-speech from Azure Speech Services is a service that enables your applications, tools, or devices to convert text into natural human-like synthesized speech. 從標準和類神經的語音,選擇或建立您自己自訂的語音特有您的產品或品牌。Choose from standard and neural voices, or create your own custom voice unique to your product or brand. 75 個以上標準語音有 45 個以上的語言和地區設定,和 5 的類神經語音有 4 個語言和地區設定。75+ standard voices are available in more than 45 languages and locales, and 5 neural voices are available in 4 languages and locales. 如需完整清單,請參閱 < 支援的語言For a full list, see supported languages.

文字轉換語音技術可讓內容作者與使用者互動,以不同的方式。Text-to-speech technology allows content creators to interact with their users in different ways. 文字轉換語音可以讓使用者能用語音與內容互動的選項,以改善協助工具。Text-to-speech can improve accessibility by providing users with an option to interact with content audibly. 無論使用者有視覺障礙,學習傷殘保險,還是需要瀏覽資訊在駕駛時,文字轉換語音可以改善現有的體驗。Whether the user has a visual impairment, a learning disability, or requires navigation information while driving, text-to-speech can improve an existing experience. 文字轉換語音也是重要的附加元件,如語音 bot 和虛擬的助理。Text-to-speech is also a valuable add-on for voice bots and virtual assistants.

使用文字轉換語音服務的開發人員可以利用語音合成標記語言 (SSML),一種以 XML 為基礎的標記語言,指定如何輸入的文字轉換成合成語音。By leveraging Speech Synthesis Markup Language (SSML), an XML-based markup language, developers using the text-to-speech service can specify how input text is converted into synthesized speech. 使用 SSML,您可以調整音調、 發音讀出速率、 磁碟區,以及更多。With SSML, you can adjust pitch, pronunciation, speaking rate, volume, and more. 如需詳細資訊,請參閱 < SSMLFor more information, see SSML.

標準語音Standard voices

標準的語音會建立使用統計的參數式合成及/或串連合成的技術。Standard voices are created using Statistical Parametric Synthesis and/or Concatenation Synthesis techniques. 這些語音都是高度可理解和音效的自然。These voices are highly intelligible and sound natural. 您可以輕鬆地啟用您的應用程式,說出超過 45 國語言,使用各種不同的語音選項。You can easily enable your applications to speak in more than 45 languages, with a wide range of voice options. 這些語音提供高發音精確度,包括支援的縮寫、 首字母縮略字擴充、 日期/時間的方式解讀、 polyphones,和更多功能。These voices provide high pronunciation accuracy, including support for abbreviations, acronym expansions, date/time interpretations, polyphones, and more. 若要改善您的應用程式和服務的協助工具可讓使用者與內容互動的輪流使用標準的聲音。Use standard voice to improve accessibility for your applications and services by allowing users to interact with your content audibly.

神經語音Neural voices

類神經的語音使用深度類神經網路來克服在比對的壓力和口說語言中,在和中由單位的語音合成到電腦聲音的聲調模式中的傳統文字轉換語音系統的限制。Neural voices use deep neural networks to overcome the limits of traditional text-to-speech systems in matching the patterns of stress and intonation in spoken language, and in synthesizing the units of speech into a computer voice. 標準的文字轉換語音細分成個別的語言分析和獨立的模型,可能會導致 muffled 的語音合成所控管的原音預測步驟的韻律。Standard text-to-speech breaks down prosody into separate linguistic analysis and acoustic prediction steps that are governed by independent models, which can result in muffled voice synthesis. 我們的類神經功能會韻律預測和語音合成同時,這會導致更流暢且自然發音的語音。Our neural capability does prosody prediction and voice synthesis simultaneously, which results in a more fluid and natural-sounding voice.

神經語音可用來讓與聊天機器人及虛擬小幫手的互動變得更加自然有趣;例如將數位文字 (例如電子書) 轉換成有聲書;以及增強車上導航系統。Neural voices can be used to make interactions with chatbots and virtual assistants more natural and engaging, convert digital texts such as e-books into audiobooks and enhance in-car navigation systems. 人性自然韻律和表明的字數,類神經語音大幅降低接聽疲勞 AI 系統互動時。With the human-like natural prosody and clear articulation of words, neural voices significantly reduce listening fatigue when you interact with AI systems.

類神經的語音支援不同的樣式,例如中性和聽起來。Neural voices support different styles, such as neutral and cheerful. 比方說,傑 (EN-US) 的語音可以唸出 cheerfully,其中最適合用於暖、 快樂的交談。For example, the Jessa (en-US) voice can speak cheerfully, which is optimized for warm, happy conversation. 您可以調整語音輸出,例如音,音調、 並加快速度使用語音合成標記語言You can adjust the voice output, like tone, pitch, and speed using Speech Synthesis Markup Language. 可用的語音的完整清單,請參閱 < 支援的語言For a full list of available voices, see supported languages.

若要深入了解類神經的語音的優點,請參閱Microsoft 的新的類神經文字轉換語音服務可協助機器等人說To learn more about the benefits of neural voices, see Microsoft’s new neural text-to-speech service helps machines speak like people.

自訂語音Custom voices

語音自訂可讓您建立可辨識、 一個獨特的聲音您的品牌。Voice customization lets you create a recognizable, one-of-a-kind voice for your brand. 若要建立自訂的語音字型,您可以讓 studio 錄製,並上傳做為訓練資料相關聯的指令碼。To create your custom voice font, you make a studio recording and upload the associated scripts as the training data. 此服務會接著建立專為您的錄音調整的獨特語音模型。The service then creates a unique voice model tuned to your recording. 您可以使用這個自訂的語音字型合成語音。You can use this custom voice font to synthesize speech. 如需詳細資訊,請參閱 < 自訂語音For more information, see custom voices.

語音合成標記語言 (SSML)Speech Synthesis Markup Language (SSML)

語音合成標記語言 (SSML) 是一種以 XML 為基礎的標記語言,可讓開發人員指定如何輸入的文字會轉換成使用文字轉換語音服務的合成語音。Speech Synthesis Markup Language (SSML) is an XML-based markup language that lets developers specify how input text is converted into synthesized speech using the text-to-speech service. 相較於純文字,SSML 可讓開發人員来微調的音調、 發音讀出速率、 磁碟區,以及多個文字轉換語音輸出。Compared to plain text, SSML allows developers to fine-tune the pitch, pronunciation, speaking rate, volume, and more of the text-to-speech output. 一般標點符號,例如暫停期間過後,或問號結尾的句子時,請使用正確的聲調會自動處理。Normal punctuation, such as pausing after a period, or using the correct intonation when a sentence ends with a question mark are automatically handled.

傳送到文字轉換語音服務的所有文字輸入必須是結構都化成 SSML。All text inputs sent to the text-to-speech service must be structured as SSML. 如需詳細資訊,請參閱 < 語音合成標記語言For more information, see Speech Synthesis Markup Language.

定價的附註Pricing note

使用文字轉換語音服務時,您需支付每個轉換成語音,包括標點符號的字元。When using the text-to-speech service, you are billed for each character that is converted to speech, including punctuation. 雖然 SSML 文件本身不計費,用來調整 如何將文字轉換語音,例如音素和音調、 的選擇性項目會視為可計費的字元。While the SSML document itself is not billable, optional elements that are used to adjust how the text is converted to speech, like phonemes and pitch, are counted as billable characters. 以下是一份內容會列入計費:Here's a list of what's billable:

  • 傳遞至要求的 SSML 主體中的文字轉換語音服務的文字Text passed to the text-to-speech service in the SSML body of the request
  • SSML 格式的要求主體的文字欄位中的所有標記以外<speak><voice>標記All markup within the text field of the request body in the SSML format, except for <speak> and <voice> tags
  • 字母、 標點符號、 空格、 定位點、 標記和所有泛空白字元Letters, punctuation, spaces, tabs, markup, and all white-space characters
  • 以 Unicode 定義的每個字碼指標Every code point defined in Unicode

如需詳細資訊,請參閱 < 定價For detailed information, see Pricing.


每個中文、 日文和韓文的語言字元都會計算為計費的兩個字元。Each Chinese, Japanese, and Korean language character is counted as two characters for billing.

核心功能Core features

下表列出文字轉換語音的核心功能︰This table lists the core features for text-to-speech:

文字轉換成語音。Convert text to speech. Yes Yes
上傳聲音調適的資料集。Upload datasets for voice adaptation. No 是*Yes*
建立和管理語音字型模型。Create and manage voice font models. No 是*Yes*
建立和管理語音字型部署。Create and manage voice font deployments. No 是*Yes*
建立和管理語音字型測試。Create and manage voice font tests. No 是*Yes*
管理訂用帳戶。Manage subscriptions. No 是*Yes*

* 這些服務可使用 cris.ai 端點。請參閱Swagger 參考。這些自訂的語氣訓練課程和 Api 管理會實作節流,限制為 5 秒,每 25 的要求而語音合成 API 本身會實作節流,允許以最高每秒 200 個要求。當節流發生時,透過訊息標頭就會通知您。* These services are available using the cris.ai endpoint. See Swagger reference. These custom voice training and management APIs implement throttling that limits requests to 25 per 5 seconds, while the speech synthesis API itself implements throttling that allows 200 requests per second as the highest. When throttling occurs, you'll be notified via message headers.

開始使用文字轉換語音Get started with text to speech

我們提供可讓您在 10 分鐘內執行程式碼的快速入門。We offer quickstarts designed to have you running code in less than 10 minutes. 此資料表包含一份依語言的文字轉換語音快速入門。This table includes a list of text-to-speech quickstarts organized by language.

SDK 快速入門SDK quickstarts

快速入門 (SDK)Quickstart (SDK) 平台Platform API 參考資料API reference
C#, .NET CoreC#, .NET Core WindowsWindows BrowseBrowse
C#.NET frameworkC#, .NET Framework WindowsWindows BrowseBrowse
C#, UWPC#, UWP WindowsWindows BrowseBrowse
C#UnityC#, Unity Windows、 AndroidWindows, Android BrowseBrowse
C++C++ WindowsWindows BrowseBrowse
C++C++ LinuxLinux BrowseBrowse

REST 快速入門REST quickstarts

快速入門 (REST)Quickstart (REST) 平台Platform API 參考資料API reference
C#, .NET CoreC#, .NET Core Windows、 macOS、 LinuxWindows, macOS, Linux BrowseBrowse
Node.jsNode.js Windows、macOS、LinuxWindow, macOS, Linux BrowseBrowse
PythonPython Windows、macOS、LinuxWindow, macOS, Linux BrowseBrowse

範例程式碼Sample code

使用 GitHub 上的文字轉換語音程式碼範例。Sample code for text-to-speech is available on GitHub. 這些範例涵蓋最受歡迎的程式設計語言中的文字轉換語音轉換。These samples cover text-to-speech conversion in most popular programming languages.

參考文件Reference docs

後續步驟Next steps