使用語音合成標記語言(SSML)改善合成Improve synthesis with Speech Synthesis Markup Language (SSML)

語音合成標記語言(SSML)是以 XML 為基礎的標記語言,可讓開發人員指定如何使用文字轉換語音服務,將輸入文字轉換成合成語音。Speech Synthesis Markup Language (SSML) is an XML-based markup language that lets developers specify how input text is converted into synthesized speech using the text-to-speech service. 相較于純文字,SSML 讓開發人員能夠微調文字到語音轉換輸出的音調、發音、說話速度、音量和更多。Compared to plain text, SSML allows developers to fine-tune the pitch, pronunciation, speaking rate, volume, and more of the text-to-speech output. 一般標點符號(例如在句號之後暫停),或使用正確的聲調(當句子以問號結尾時)會自動處理。Normal punctuation, such as pausing after a period, or using the correct intonation when a sentence ends with a question mark are automatically handled.

SSML 的語音服務執行是以全球資訊網協會的語音合成標記語言1.0 版為基礎。The Speech service implementation of SSML is based on World Wide Web Consortium's Speech Synthesis Markup Language Version 1.0.

重要

中文、日文和韓文字元的計費方式為兩個字元。Chinese, Japanese, and Korean characters count as two characters for billing. 如需詳細資訊,請參閱定價For more information, see Pricing.

標準、類神經和自訂語音Standard, neural, and custom voices

從標準和類神經語音中選擇,或為您的產品或品牌建立專屬的自訂語音。Choose from standard and neural voices, or create your own custom voice unique to your product or brand. 75 + standard 語音提供45以上的語言和地區設定,而5類神經語音則提供四種語言和地區設定。75+ standard voices are available in more than 45 languages and locales, and 5 neural voices are available in four languages and locales. 如需支援的語言、地區設定和語音 (神經和標準) 的完整清單,請參閱語言支援For a complete list of supported languages, locales, and voices (neural and standard), see language support.

若要深入瞭解標準、類神經和自訂語音,請參閱文字轉換語音的總覽To learn more about standard, neural, and custom voices, see Text-to-speech overview.

特殊字元Special characters

使用 SSML 時,請記住,特殊字元(例如引號、撇號和方括弧)必須經過轉義。While using SSML, keep in mind that special characters, such as quotation marks, apostrophes, and brackets must be escaped. 如需詳細資訊,請參閱可延伸標記語言 (XML) (XML)1.0:附錄 DFor more information, see Extensible Markup Language (XML) 1.0: Appendix D.

支援的 SSML 元素Supported SSML elements

每個 SSML 檔都是使用 SSML 元素(或標記)所建立。Each SSML document is created with SSML elements (or tags). 這些元素可用來調整音調、韻律、音量等等。These elements are used to adjust pitch, prosody, volume, and more. 下列各節將詳細說明每個專案的使用方式,以及當元素為必要或選擇性時。The following sections detail how each element is used, and when an element is required or optional.

重要

別忘了在屬性值前後使用雙引號。Don't forget to use double quotes around attribute values. 格式正確且有效的 XML 的標準需要以雙引號括住屬性值。Standards for well-formed, valid XML requires attribute values to be enclosed in double quotation marks. 例如,是格式正確 <prosody volume="90"> 且有效的元素,但不是 <prosody volume=90>For example, <prosody volume="90"> is a well-formed, valid element, but <prosody volume=90> is not. SSML 可能無法辨識不是以引號括住的屬性值。SSML may not recognize attribute values that are not in quotes.

建立 SSML 檔Create an SSML document

speak是根項目,而且是所有 SSML 檔的必要專案。speak is the root element, and is required for all SSML documents. speak元素包含重要資訊,例如版本、語言和標記詞彙定義。The speak element contains important information, such as version, language, and the markup vocabulary definition.

語法Syntax

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="string"></speak>

屬性Attributes

屬性Attribute 說明Description 必要/選用Required / Optional
version 表示用來解讀檔標記的 SSML 規格版本。Indicates the version of the SSML specification used to interpret the document markup. 目前的版本為1.0。The current version is 1.0. 必要Required
xml:lang 指定根文檔的語言。Specifies the language of the root document. 此值可包含小寫、兩個字母的語言代碼(例如 en ),或語言代碼和大寫國家/地區(例如 en-US )。The value may contain a lowercase, two-letter language code (for example, en), or the language code and uppercase country/region (for example, en-US). 必要Required
xmlns 指定檔的 URI,以定義 SSML 檔的標記詞彙(元素類型和屬性名稱)。Specifies the URI to the document that defines the markup vocabulary (the element types and attribute names) of the SSML document. 目前的 URI 是 http://www.w3.org/2001/10/synthesisThe current URI is http://www.w3.org/2001/10/synthesis. 必要Required

選擇文字轉換語音的語音Choose a voice for text-to-speech

voice需要元素。The voice element is required. 它是用來指定文字轉換語音所使用的語音。It is used to specify the voice that is used for text-to-speech.

語法Syntax

<voice name="string">
    This text will get converted into synthesized speech.
</voice>

屬性Attributes

屬性Attribute 說明Description 必要/選用Required / Optional
name 識別文字到語音轉換輸出所使用的語音。Identifies the voice used for text-to-speech output. 如需支援的語音的完整清單,請參閱語言支援For a complete list of supported voices, see Language support. 必要Required

範例Example

注意

這個範例會使用 en-US-AriaRUS 語音。This example uses the en-US-AriaRUS voice. 如需支援的語音的完整清單,請參閱語言支援For a complete list of supported voices, see Language support.

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US">
    <voice name="en-US-AriaRUS">
        This is the text that is spoken.
    </voice>
</speak>

使用多重音源Use multiple voices

speak 元素內,您可以為文字到語音轉換輸出指定多個語音。Within the speak element, you can specify multiple voices for text-to-speech output. 這些語音可以採用不同的語言。These voices can be in different languages. 針對每個聲音,文字必須包裝在元素中 voiceFor each voice, the text must be wrapped in a voice element.

屬性Attributes

屬性Attribute 說明Description 必要/選用Required / Optional
name 識別文字到語音轉換輸出所使用的語音。Identifies the voice used for text-to-speech output. 如需支援的語音的完整清單,請參閱語言支援For a complete list of supported voices, see Language support. 必要Required

重要

多個語音與「字邊界」功能不相容。Multiple voices are incompatible with the word boundary feature. 必須停用「字邊界」功能,才能使用多個語音。The word boundary feature needs to be disabled in order to use multiple voices.

停用字邊界Disable word boundary

視語音 SDK 語言而定,您會在 "SpeechServiceResponse_Synthesis_WordBoundaryEnabled" 物件的實例上將屬性設定為 false SpeechConfigDepending on the Speech SDK language, you'll set the "SpeechServiceResponse_Synthesis_WordBoundaryEnabled" property to false on an instance of the SpeechConfig object.

如需詳細資訊, SetProperty 請參閱。For more information, see SetProperty .

speechConfig.SetProperty(
    "SpeechServiceResponse_Synthesis_WordBoundaryEnabled", "false");

範例Example

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US">
    <voice name="en-US-AriaRUS">
        Good morning!
    </voice>
    <voice name="en-US-Guy24kRUS">
        Good morning to you too Aria!
    </voice>
</speak>

調整說話樣式Adjust speaking styles

重要

說話樣式的調整僅適用于類神經語音。The adjustment of speaking styles will only work with neural voices.

根據預設,文字轉換語音服務會針對標準和類神經語音使用中性說話樣式來合成文字。By default, the text-to-speech service synthesizes text using a neutral speaking style for both standard and neural voices. 使用神經語音時,您可以調整說話樣式以表達不同的表情(例如 cheerfulness、理解和冷靜),或使用 <mstts: express-as> 元素,針對不同的案例(例如,自訂服務、newscasting 和語音助理)優化語音。With neural voices, you can adjust the speaking style to express different emotions like cheerfulness, empathy, and calm, or optimize the voice for different scenarios like custom service, newscasting and voice assistant, using the mstts:express-as element. 這是語音服務特有的選擇性元素。This is an optional element unique to the Speech service.

目前,這些類神經語音支援說話的樣式調整:Currently, speaking style adjustments are supported for these neural voices:

  • en-US-AriaNeural
  • zh-CN-XiaoxiaoNeural
  • zh-CN-YunyangNeural

變更會在句子層級套用,而樣式會因語音而有所不同。Changes are applied at the sentence level, and style vary by voice. 如果樣式不受支援,服務將會以預設的中性說話樣式傳回語音。If a style isn't supported, the service will return speech in the default neutral speaking style.

語法Syntax

<mstts:express-as style="string"></mstts:express-as>

屬性Attributes

屬性Attribute 說明Description 必要/選用Required / Optional
style 指定說話的樣式。Specifies the speaking style. 目前,說話樣式是語音特有的。Currently, speaking styles are voice-specific. 如果要調整類神經語音的說話樣式,則為必要。Required if adjusting the speaking style for a neural voice. 如果使用 mstts:express-as ,則必須提供 style。If using mstts:express-as, then style must be provided. 如果提供了不正確值,則會忽略此元素。If an invalid value is provided, this element will be ignored.

使用此表格來判斷每個類神經語音支援哪些說話樣式。Use this table to determine which speaking styles are supported for each neural voice.

語音Voice 樣式Style 說明Description
en-US-AriaNeural style="newscast-formal" 正式、自信且授權的新聞傳遞語氣A formal, confident and authoritative tone for news delivery
style="newscast-casual" 一般新聞傳遞的多用途和休閒音A versatile and casual tone for general news delivery
style="customerservice" 為客戶支援表示易記且有用的語氣Expresses a friendly and helpful tone for customer support
style="chat" 表示隨意且寬鬆的色調Expresses a casual and relaxed tone
style="cheerful" 表達正面且滿意的語氣Expresses a positive and happy tone
style="empathetic" 表達管也和認知的意義Expresses a sense of caring and understanding
zh-CN-XiaoxiaoNeural style="newscast" 表達 narrating 新聞的正式和專業音調Expresses a formal and professional tone for narrating news
style="customerservice" 為客戶支援表示易記且有用的語氣Expresses a friendly and helpful tone for customer support
style="assistant" 表達數位助理的暖和寬鬆音調Expresses a warm and relaxed tone for digital assistants
style="lyrical" 以 melodic 和感情的方式表達表情Expresses emotions in a melodic and sentimental way
zh-CN-YunyangNeural style="customerservice" 為客戶支援表示易記且有用的語氣Expresses a friendly and helpful tone for customer support

範例Example

這個 SSML 程式碼片段說明如何 <mstts:express-as> 使用專案將說話風格變更為 cheerfulThis SSML snippet illustrates how the <mstts:express-as> element is used to change the speaking style to cheerful.

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
       xmlns:mstts="https://www.w3.org/2001/mstts" xml:lang="en-US">
    <voice name="en-US-AriaNeural">
        <mstts:express-as style="cheerful">
            That'd be just amazing!
        </mstts:express-as>
    </voice>
</speak>

新增或移除中斷/暫停Add or remove a break/pause

使用 break 元素在單字之間插入暫停(或中斷),或防止文字轉換語音服務自動暫停。Use the break element to insert pauses (or breaks) between words, or prevent pauses automatically added by the text-to-speech service.

注意

使用此元素可覆寫單字或片語的文字轉換語音(TTS)的預設行為(如果該單字或片語的合成語音非自然)。Use this element to override the default behavior of text-to-speech (TTS) for a word or phrase if the synthesized speech for that word or phrase sounds unnatural. 設定 strengthnone 以防止韻律中斷,這會由文字轉換語音服務自動插入。Set strength to none to prevent a prosodic break, which is automatically inserted by the text-to-speech service.

語法Syntax

<break strength="string" />
<break time="string" />

屬性Attributes

屬性Attribute 說明Description 必要/選用Required / Optional
strength 使用下列其中一個值,指定暫停的相對持續時間:Specifies the relative duration of a pause using one of the following values:
  • none
  • x-弱式x-weak
  • 不足weak
  • 中(預設值)medium (default)
  • 強式strong
  • x-強式x-strong
選擇性Optional
time 指定暫停的絕對持續時間(以秒或毫秒為單位)。Specifies the absolute duration of a pause in seconds or milliseconds. 有效值的範例包括 2s500Examples of valid values are 2s and 500 選擇性Optional
程度Strength 說明Description
無; 如果未提供任何值,則為None, or if no value provided 0毫秒0 ms
x-弱式x-weak 250 毫秒250 ms
不足weak 500 毫秒500 ms
medium 750 毫秒750 ms
強式strong 1000毫秒1000 ms
x-強式x-strong 1250毫秒1250 ms

範例Example

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US">
    <voice name="en-US-AriaNeural">
        Welcome to Microsoft Cognitive Services <break time="100ms" /> Text-to-Speech API.
    </voice>
</speak>

指定段落和句子Specify paragraphs and sentences

ps 元素分別用來表示段落和句子。p and s elements are used to denote paragraphs and sentences, respectively. 如果沒有這些元素,文字轉換語音服務會自動決定 SSML 檔的結構。In the absence of these elements, the text-to-speech service automatically determines the structure of the SSML document.

p元素可能包含文字和下列元素: audiobreakphonemeprosody 、、、 say-as sub mstts:express-assThe p element may contain text and the following elements: audio, break, phoneme, prosody, say-as, sub, mstts:express-as, and s.

s元素可能包含文字和下列元素: audiobreakphonemeprosodysay-asmstts:express-assubThe s element may contain text and the following elements: audio, break, phoneme, prosody, say-as, mstts:express-as, and sub.

語法Syntax

<p></p>
<s></s>

範例Example

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US">
    <voice name="en-US-AriaRUS">
        <p>
            <s>Introducing the sentence element.</s>
            <s>Used to mark individual sentences.</s>
        </p>
        <p>
            Another simple paragraph.
            Sentence structure in this paragraph is not explicitly marked.
        </p>
    </voice>
</speak>

使用音素來改善發音Use phonemes to improve pronunciation

ph元素用於 SSML 檔中的語音發音。The ph element is used to for phonetic pronunciation in SSML documents. ph元素只能包含文字,沒有其他元素。The ph element can only contain text, no other elements. 一律提供人類可讀的語音做為回復。Always provide human-readable speech as a fallback.

語音字母由電話組成,其由字母、數位或字元組成,有時會組合。Phonetic alphabets are composed of phones, which are made up of letters, numbers, or characters, sometimes in combination. 每個電話都會說明語音的獨特聲音。Each phone describes a unique sound of speech. 這與拉丁字母相較之下,其中任何字母可能代表多個說話的聲音。This is in contrast to the Latin alphabet, where any letter may represent multiple spoken sounds. 請考慮字母 "c" 在「糖果」和「停止」單字中的不同發音,或在「內容」和「那些」單字中,字母組合「th」的不同發音。Consider the different pronunciations of the letter "c" in the words "candy" and "cease", or the different pronunciations of the letter combination "th" in the words "thing" and "those".

語法Syntax

<phoneme alphabet="string" ph="string"></phoneme>

屬性Attributes

屬性Attribute 說明Description 必要/選用Required / Optional
alphabet 指定合成屬性中字串的發音時,所要使用的語音字母 phSpecifies the phonetic alphabet to use when synthesizing the pronunciation of the string in the ph attribute. 指定字母的字串必須以小寫字母指定。The string specifying the alphabet must be specified in lowercase letters. 以下是您可以指定的可能字母。The following are the possible alphabets that you may specify.
字母僅適用于 phoneme 元素中的。The alphabet applies only to the phoneme in the element..
選擇性Optional
ph 包含電話的字串,指定元素中單字的發音 phonemeA string containing phones that specify the pronunciation of the word in the phoneme element. 如果指定的字串包含無法辨識的手機,文字轉換語音(TTS)服務會拒絕整個 SSML 檔,而且不會產生任何在檔中指定的語音輸出。If the specified string contains unrecognized phones, the text-to-speech (TTS) service rejects the entire SSML document and produces none of the speech output specified in the document. 如果使用音素,則為必要。Required if using phonemes.

範例Examples

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US">
    <voice name="en-US-AriaRUS">
        <phoneme alphabet="ipa" ph="t&#x259;mei&#x325;&#x27E;ou&#x325;"> tomato </phoneme>
    </voice>
</speak>
<speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis" xml:lang="en-US">
    <voice name="en-US-AriaRUS">
        <phoneme alphabet="sapi" ph="iy eh n y uw eh s"> en-US </phoneme>
    </voice>
</speak>
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US">
    <voice name="en-US-AriaRUS">
        <s>His name is Mike <phoneme alphabet="ups" ph="JH AU"> Zhou </phoneme></s>
    </voice>
</speak>

使用自訂字典改善發音Use custom lexicon to improve pronunciation

有時文字轉換語音服務無法精確地發音單字。Sometimes the text-to-speech service cannot accurately pronounce a word. 例如,公司的名稱或醫療詞彙。For example, the name of a company, or a medical term. 開發人員可以使用和標記定義在 SSML 中讀取單一實體的方式 phoneme subDevelopers can define how single entities are read in SSML using the phoneme and sub tags. 不過,如果您需要定義多個實體的讀取方式,您可以使用標記來建立自訂字典 lexiconHowever, if you need to define how multiple entities are read, you can create a custom lexicon using the lexicon tag.

注意

自訂字典目前支援 UTF-8 編碼。Custom lexicon currently supports UTF-8 encoding.

語法Syntax

<lexicon uri="string"/>

屬性Attributes

屬性Attribute 說明Description 必要/選用Required / Optional
uri 外部另外檔的位址。The address of the external PLS document. 必要。Required.

使用量Usage

若要定義多個實體的讀取方式,您可以建立自訂的詞典,它會儲存為 .xml 或. 另外檔案。To define how multiple entities are read, you can create a custom lexicon, which is stored as an .xml or .pls file. 以下是範例 .xml 檔案。The following is a sample .xml file.

<?xml version="1.0" encoding="UTF-8"?>
<lexicon version="1.0" 
      xmlns="http://www.w3.org/2005/01/pronunciation-lexicon"
      xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
      xsi:schemaLocation="http://www.w3.org/2005/01/pronunciation-lexicon 
        http://www.w3.org/TR/2007/CR-pronunciation-lexicon-20071212/pls.xsd"
      alphabet="ipa" xml:lang="en-US">
  <lexeme>
    <grapheme>BTW</grapheme> 
    <alias>By the way</alias> 
  </lexeme>
  <lexeme>
    <grapheme> Benigni </grapheme> 
    <phoneme> bɛˈniːnji</phoneme>
  </lexeme>
</lexicon>

lexicon元素至少包含一個 lexeme 元素。The lexicon element contains at least one lexeme element. 每個 lexeme 元素都包含至少一個專案, grapheme 以及一個或多個 graphemealiasphoneme 元素。Each lexeme element contains at least one grapheme element and one or more grapheme, alias, and phoneme elements. grapheme元素包含描述orthography 的文字。The grapheme element contains text describing the orthography . alias元素可用來指示縮略字或縮寫詞彙的發音。The alias elements are used to indicate the pronunciation of an acronym or an abbreviated term. phoneme元素會提供描述如何發音的文字 lexemeThe phoneme element provides text describing how the lexeme is pronounced.

請務必注意,您無法使用自訂字典直接設定單字的發音。It's important to note, that you cannot directly set the pronunciation of a word using the custom lexicon. 如果您需要設定縮寫或縮寫詞彙的發音,請先提供 alias ,然後將與產生關聯 phoneme aliasIf you need to set the pronunciation for an acronym or an abbreviated term, first provide an alias, then associate the phoneme with that alias. 例如:For example:

  <lexeme>
    <grapheme>Scotland MV</grapheme> 
    <alias>ScotlandMV</alias> 
  </lexeme>
  <lexeme>
    <grapheme>ScotlandMV</grapheme> 
    <phoneme>ˈskɒtlənd.ˈmiːdiəm.weɪv</phoneme>
  </lexeme>

重要

phoneme使用 .ipa 時,元素不能包含空白字元。The phoneme element cannot contain white spaces when using IPA.

如需自訂字典檔案的詳細資訊,請參閱發音字典規格(另外)版本 1.0For more information about custom lexicon file, see Pronunciation Lexicon Specification (PLS) Version 1.0.

接下來,發佈您的自訂字典檔案。Next, publish your custom lexicon file. 雖然我們不會限制儲存此檔案的位置,但我們建議使用Azure Blob 儲存體While we don't have restrictions on where this file can be stored, we do recommend using Azure Blob Storage.

發行自訂字典之後,您就可以從 SSML 參考它。After you've published your custom lexicon, you can reference it from your SSML.

注意

lexicon元素必須在 voice 元素內。The lexicon element must be inside the voice element.

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" 
          xmlns:mstts="http://www.w3.org/2001/mstts" 
          xml:lang="en-US">
    <voice name="en-US-AriaRUS">
        <lexicon uri="http://www.example.com/customlexicon.xml"/>
        BTW, we will be there probably at 8:00 tomorrow morning.
        Could you help leave a message to Robert Benigni for me?
    </voice>
</speak>

使用此自訂字典時,"您" 將會以「方式」閱讀。When using this custom lexicon, "BTW" will be read as "By the way". "Benigni" 將會使用提供的 .IPA "bɛ今天 ni ː nji" 讀取。"Benigni" will be read with the provided IPA "bɛˈniːnji".

限制Limitations

  • 檔案大小:自訂字典檔案大小上限為 100 KB,如果超過此大小,合成要求將會失敗。File size: custom lexicon file size maximum limit is 100KB, if beyond this size, synthesis request will fail.
  • 辭典快取重新整理:自訂字典會在第一次載入 TTS 服務時,以 URI 作為金鑰來進行快取。Lexicon cache refresh: custom lexicon will be cached with URI as key on TTS Service when it's first loaded. 具有相同 URI 的字典不會在15分鐘內重載,因此自訂的詞典變更必須等候15分鐘才會生效。Lexicon with same URI won't be reloaded within 15 mins, so custom lexicon change needs to wait at most 15 mins to take effect.

語音服務拼音設定Speech service phonetic sets

在上述範例中,我們使用國際語音字母(也稱為 .IPA 電話集)。In the sample above, we're using the International Phonetic Alphabet, also known as the IPA phone set. 我們建議開發人員使用 .IPA,因為它是國際標準。We suggest developers use the IPA, because it is the international standard. 對於某些 .IPA 字元,它們會在以 Unicode 表示時具有 ' precomposed ' 和「分解」版本。For some IPA characters, they have the 'precomposed' and 'decomposed' version when being represented with Unicode. 自訂字典僅支援分解的 unicodes。Custom lexicon only support the decomposed unicodes.

考慮到 .ipa 並不容易記住,語音服務會為七種語言( en-USfr-FR 、、、、 de-DE es-ES ja-JP zh-CNzh-TW )定義一組語音。Considering that the IPA is not easy to remember, the Speech service defines a phonetic set for seven languages (en-US, fr-FR, de-DE, es-ES, ja-JP, zh-CN, and zh-TW).

您可以使用 sapi 作為 alphabet 具有自訂字典之屬性的 back,如下所示:You can use the sapi as the vale for the alphabet attribute with custom lexicons as demonstrated below:

<?xml version="1.0" encoding="UTF-8"?>
<lexicon version="1.0" 
      xmlns="http://www.w3.org/2005/01/pronunciation-lexicon"
      xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      xsi:schemaLocation="http://www.w3.org/2005/01/pronunciation-lexicon
        http://www.w3.org/TR/2007/CR-pronunciation-lexicon-20071212/pls.xsd"
      alphabet="sapi" xml:lang="en-US">
  <lexeme>
    <grapheme>BTW</grapheme>
    <alias> By the way </alias>
  </lexeme>
  <lexeme>
    <grapheme> Benigni </grapheme>
    <phoneme> b eh 1 - n iy - n y iy </phoneme>
  </lexeme>
</lexicon>

如需詳細語音服務拼音字母的詳細資訊,請參閱語音服務拼音設定For more information on the detailed Speech service phonetic alphabet, see the Speech service phonetic sets.

調整韻律Adjust prosody

prosody元素是用來指定文字到語音轉換輸出的音調、輪廓、範圍、速率、持續時間和音量的變更。The prosody element is used to specify changes to pitch, contour, range, rate, duration, and volume for the text-to-speech output. prosody元素可能包含文字和下列元素: audiobreakpphoneme 、、、 prosody say-as subsThe prosody element may contain text and the following elements: audio, break, p, phoneme, prosody, say-as, sub, and s.

由於韻律屬性的值可能會隨著寬範圍而有所不同,因此語音辨識器會將指派的值解釋為所選語音的實際韻律值應為的建議。Because prosodic attribute values can vary over a wide range, the speech recognizer interprets the assigned values as a suggestion of what the actual prosodic values of the selected voice should be. 文字轉換語音服務會限制或替代不支援的值。The text-to-speech service limits or substitutes values that are not supported. 不支援值的範例是 1 MHz 或磁片區120。Examples of unsupported values are a pitch of 1 MHz or a volume of 120.

語法Syntax

<prosody pitch="value" contour="value" range="value" rate="value" duration="value" volume="value"></prosody>

屬性Attributes

屬性Attribute 說明Description 必要/選用Required / Optional
pitch 表示文字的基準間距。Indicates the baseline pitch for the text. 您可以用下列方式表達音調:You may express the pitch as:
  • 絕對值,以數位表示,後面接著 "Hz" (赫茲)。An absolute value, expressed as a number followed by "Hz" (Hertz). 例如,600 Hz。For example, 600 Hz.
  • 以數位表示的相對值,前面加上 "+" 或 "-",後面接著 "Hz" 或 "st",以指定要變更音調的數量。A relative value, expressed as a number preceded by "+" or "-" and followed by "Hz" or "st", that specifies an amount to change the pitch. 例如: + 80 Hz 或-2st。For example: +80 Hz or -2st. "St" 表示變更單位是 semitone,這是標準 diatonic 尺規上的一半色調(半步驟)。The "st" indicates the change unit is semitone, which is half of a tone (a half step) on the standard diatonic scale.
  • 常數值:A constant value:
    • x-低x-low
    • lowlow
    • medium
    • highhigh
    • x-高x-high
    • 預設default
..
選擇性Optional
contour 等高線現在支援類神經和標準語音。Contour now supports both neural and standard voices. 等高線代表音調中的變更。Contour represents changes in pitch. 這些變更會在語音輸出中的指定時間位置以目標陣列表示。These changes are represented as an array of targets at specified time positions in the speech output. 每個目標都是由一組參數配對所定義。Each target is defined by sets of parameter pairs. 例如:For example:

<prosody contour="(0%,+20Hz) (10%,-2st) (40%,+10Hz)">

每一組參數中的第一個值會指定音調變更的位置,以文字持續時間的百分比表示。The first value in each set of parameters specifies the location of the pitch change as a percentage of the duration of the text. 第二個值指定要增加或減少音調的數量,使用相對值或用於音調的列舉值(請參閱 pitch )。The second value specifies the amount to raise or lower the pitch, using a relative value or an enumeration value for pitch (see pitch).
選擇性Optional
range 值,表示文字的音調範圍。A value that represents the range of pitch for the text. 您可以 range 使用相同的絕對值、相對值或用來描述的列舉值來表示 pitchYou may express range using the same absolute values, relative values, or enumeration values used to describe pitch. 選擇性Optional
rate 表示文字的說話速率。Indicates the speaking rate of the text. 您可以表達 rate 如下:You may express rate as:
  • 相對值,以做為預設值之乘數的數位來表示。A relative value, expressed as a number that acts as a multiplier of the default. 例如,值1會導致速率不會變更。For example, a value of 1 results in no change in the rate. 0.5的值會產生速率的減半。A value of 0.5 results in a halving of the rate. 值為3會產生速率的增加三倍。A value of 3 results in a tripling of the rate.
  • 常數值:A constant value:
    • x-慢x-slow
    • slowslow
    • medium
    • 快速fast
    • x-快速x-fast
    • 預設default
選擇性Optional
duration 語音合成(TTS)服務讀取文字(以秒或毫秒為單位)時所經過的時間長度。The period of time that should elapse while the speech synthesis (TTS) service reads the text, in seconds or milliseconds. 例如,21800msFor example, 2s or 1800ms. 選擇性Optional
volume 表示說話語音的音量層級。Indicates the volume level of the speaking voice. 您可以將磁片區表示為:You may express the volume as:
  • 絕對值,以0.0 到100.0 範圍內的數位表示,從quietestloudestAn absolute value, expressed as a number in the range of 0.0 to 100.0, from quietest to loudest. 例如,75。For example, 75. 預設值為100.0。The default is 100.0.
  • 以數位表示的相對值,其前面加上 "+" 或 "-",以指定要變更磁片區的數量。A relative value, expressed as a number preceded by "+" or "-" that specifies an amount to change the volume. 例如,+ 10 或-5.5。For example, +10 or -5.5.
  • 常數值:A constant value:
    • silentsilent
    • x-軟x-soft
    • soft
    • medium
    • loud
    • x-大聲x-loud
    • 預設default
選擇性Optional

改變說話速度Change speaking rate

說話率可以套用至單字或句子層級的類神經語音和標準語音。Speaking rate can be applied to Neural voices and standard voices at the word or sentence-level.

範例Example

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US">
    <voice name="en-US-GuyNeural">
        <prosody rate="+30.00%">
            Welcome to Microsoft Cognitive Services Text-to-Speech API.
        </prosody>
    </voice>
</speak>

變更音量Change volume

磁片區變更可以套用至單字或句子層級的標準語音。Volume changes can be applied to standard voices at the word or sentence-level. 而磁片區變更只能套用至句子層級的類神經語音。Whereas volume changes can only be applied to neural voices at the sentence level.

範例Example

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US">
    <voice name="en-US-AriaRUS">
        <prosody volume="+20.00%">
            Welcome to Microsoft Cognitive Services Text-to-Speech API.
        </prosody>
    </voice>
</speak>

變更音高Change pitch

音調變更可以套用至單字或句子層級的標準語音。Pitch changes can be applied to standard voices at the word or sentence-level. 而音調變更只能套用至句子層級的類神經語音。Whereas pitch changes can only be applied to neural voices at the sentence level.

範例Example

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US">
    <voice name="en-US-Guy24kRUS">
        Welcome to <prosody pitch="high">Microsoft Cognitive Services Text-to-Speech API.</prosody>
    </voice>
</speak>

變更音高結構Change pitch contour

重要

類神經語音現在支援音調輪廓變更。Pitch contour changes are now supported with neural voices.

範例Example

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US">
    <voice name="en-US-AriaNeural">
        <prosody contour="(60%,-60%) (100%,+80%)" >
            Were you the only person in the room? 
        </prosody>
    </voice>
</speak>

假設為元素say-as element

say-as是選擇性元素,表示專案文字的內容類型(例如數位或日期)。say-as is an optional element that indicates the content type (such as number or date) of the element's text. 這會提供語音合成引擎關於如何朗讀文字的指引。This provides guidance to the speech synthesis engine about how to pronounce the text.

語法Syntax

<say-as interpret-as="string" format="digit string" detail="string"> <say-as>

屬性Attributes

屬性Attribute 說明Description 必要/選用Required / Optional
interpret-as 表示元素文字的內容類型。Indicates the content type of element's text. 如需類型清單,請參閱下表。For a list of types, see the table below. 必要Required
format 針對可能有不明確格式的內容類型,提供元素文字精確格式的其他資訊。Provides additional information about the precise formatting of the element's text for content types that may have ambiguous formats. SSML 會定義使用它們之內容類型的格式(請參閱下表)。SSML defines formats for content types that use them (see table below). 選擇性Optional
detail 表示要讀出的詳細資料層級。Indicates the level of detail to be spoken. 例如,此屬性可能會要求語音合成引擎發音標點符號。For example, this attribute might request that the speech synthesis engine pronounce punctuation marks. 沒有針對定義的標準值 detailThere are no standard values defined for detail. 選擇性Optional

以下是和屬性支援的內容類型 interpret-as formatThe following are the supported content types for the interpret-as and format attributes. 只有在 format interpret-as 設為日期和時間時,才包含屬性。Include the format attribute only if interpret-as is set to date and time.

解讀為interpret-as formatformat 解譯Interpretation
address 文字會以位址的形式讀出。The text is spoken as an address. 語音合成引擎 pronounces:The speech synthesis engine pronounces:

I'm at <say-as interpret-as="address">150th CT NE, Redmond, WA</say-as>

「我在150th 法院的美國華盛頓州 redmond」。As "I'm at 150th court north east redmond washington."
cardinal, numbercardinal, number 文字是以基本數位來讀出。The text is spoken as a cardinal number. 語音合成引擎 pronounces:The speech synthesis engine pronounces:

There are <say-as interpret-as="cardinal">3</say-as> alternatives

「有三種替代方案」。As "There are three alternatives."
characters, spell-outcharacters, spell-out 文字是以個別字母讀出(拼法)。The text is spoken as individual letters (spelled out). 語音合成引擎 pronounces:The speech synthesis engine pronounces:

<say-as interpret-as="characters">test</say-as>

As "T E S T"。As "T E S T."
date dmy、mdy、ymd、ydm、ym、my、md、dm、d、m、ydmy, mdy, ymd, ydm, ym, my, md, dm, d, m, y 文字會以日期說出。The text is spoken as a date. format屬性會指定日期的格式(d = day、m = month 和 y = year)。The format attribute specifies the date's format (d=day, m=month, and y=year). 語音合成引擎 pronounces:The speech synthesis engine pronounces:

Today is <say-as interpret-as="date" format="mdy">10-19-2016</say-as>

As 「今天是2016年10月的第十九個」。As "Today is October nineteenth two thousand sixteen."
digits, number_digitdigits, number_digit 文字是以一系列的個別數位來讀出。The text is spoken as a sequence of individual digits. 語音合成引擎 pronounces:The speech synthesis engine pronounces:

<say-as interpret-as="number_digit">123456789</say-as>

做為 "1 2 3 4 5 6 7 8 9"。As "1 2 3 4 5 6 7 8 9."
fraction 文字會以小數的形式讀出。The text is spoken as a fractional number. 語音合成引擎 pronounces:The speech synthesis engine pronounces:

<say-as interpret-as="fraction">3/8</say-as> of an inch

做為「一種八分之的一英寸」。As "three eighths of an inch."
ordinal 文字會以序號的形式讀出。The text is spoken as an ordinal number. 語音合成引擎 pronounces:The speech synthesis engine pronounces:

Select the <say-as interpret-as="ordinal">3rd</say-as> option

做為「選取第三個選項」。As "Select the third option".
telephone 文字會以電話號碼的形式讀出。The text is spoken as a telephone number. format屬性可以包含代表國家/地區代碼的數位。The format attribute may contain digits that represent a country code. 例如,美國的 "1" 或義大利的 "39"。For example, "1" for the United States or "39" for Italy. 語音合成引擎可能會使用這項資訊來引導其電話號碼的發音。The speech synthesis engine may use this information to guide its pronunciation of a phone number. 電話號碼也可能包含國家/地區代碼,若是如此,則會優先于中的國家(地區)代碼 formatThe phone number may also include the country code, and if so, takes precedence over the country code in the format. 語音合成引擎 pronounces:The speech synthesis engine pronounces:

The number is <say-as interpret-as="telephone" format="1">(888) 555-1212</say-as>

As 「我的數位是區功能變數代碼 8 8 8 5 5 5 1 2 1 2」。As "My number is area code eight eight eight five five five one two one two."
time hms12, hms24hms12, hms24 文字會以一段時間讀出。The text is spoken as a time. format屬性會指定是否使用12小時制(hms12)或24小時制(hms24)來指定時間。The format attribute specifies whether the time is specified using a 12-hour clock (hms12) or a 24-hour clock (hms24). 使用冒號來分隔代表小時、分鐘和秒數的數位。Use a colon to separate numbers representing hours, minutes, and seconds. 以下是有效的時間範例:12:35、1:14:32、08:15 和02:50:45。The following are valid time examples: 12:35, 1:14:32, 08:15, and 02:50:45. 語音合成引擎 pronounces:The speech synthesis engine pronounces:

The train departs at <say-as interpret-as="time" format="hms12">4:00am</say-as>

「訓練離開在四個 M」。As "The train departs at four A M."

使用量Usage

say-as元素只能包含文字。The say-as element may contain only text.

範例Example

語音合成引擎會將下列範例說為「您的第一個要求是在10月第十九個 20 10 的一個聊天室,並于下午 12 35 PM 提早抵達。」The speech synthesis engine speaks the following example as "Your first request was for one room on October nineteenth twenty ten with early arrival at twelve thirty five PM."

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US">
    <voice name="en-US-AriaRUS">
        <p>
        Your <say-as interpret-as="ordinal"> 1st </say-as> request was for <say-as interpret-as="cardinal"> 1 </say-as> room
        on <say-as interpret-as="date" format="mdy"> 10/19/2010 </say-as>, with early arrival at <say-as interpret-as="time" format="hms12"> 12:35pm </say-as>.
        </p>
    </voice>
</speak>

新增錄製的音訊Add recorded audio

audio是選擇性元素,可讓您將 MP3 音訊插入 SSML 檔中。audio is an optional element that allows you to insert MP3 audio into an SSML document. 音訊元素的主體可能包含純文字或 SSML 標記,如果音訊檔無法使用或播放,就會說出來。The body of the audio element may contain plain text or SSML markup that's spoken if the audio file is unavailable or unplayable. 此外, audio 元素可以包含文字和下列元素: audiobreakps 、、、 phoneme prosody say-assubAdditionally, the audio element can contain text and the following elements: audio, break, p, s, phoneme, prosody, say-as, and sub.

SSML 檔中包含的任何音訊都必須符合下列需求:Any audio included in the SSML document must meet these requirements:

  • MP3 必須裝載在可存取網際網路的 HTTPS 端點上。The MP3 must be hosted on an Internet-accessible HTTPS endpoint. 需要 HTTPS,而且裝載 MP3 檔案的網域必須提供有效、受信任的 TLS/SSL 憑證。HTTPS is required, and the domain hosting the MP3 file must present a valid, trusted TLS/SSL certificate.
  • MP3 必須是有效的 MP3 檔案(MPEG v2)。The MP3 must be a valid MP3 file (MPEG v2).
  • 位元速率必須是 48 kbps。The bit rate must be 48 kbps.
  • 取樣速率必須是 16000 Hz。The sample rate must be 16,000 Hz.
  • 單一回應中所有文字和音訊檔案的總時間總和不能超過90(90)秒。The combined total time for all text and audio files in a single response cannot exceed ninety (90) seconds.
  • MP3 不得包含任何客戶特定或其他機密資訊。The MP3 must not contain any customer-specific or other sensitive information.

語法Syntax

<audio src="string"/></audio>

屬性Attributes

屬性Attribute 說明Description 必要/選用Required / Optional
src 指定音訊檔案的位置/URL。Specifies the location/URL of the audio file. 如果在您的 SSML 檔中使用音訊元素,則為必要專案。Required if using the audio element in your SSML document.

範例Example

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US">
    <voice name="en-US-AriaRUS">
        <p>
            <audio src="https://contoso.com/opinionprompt.wav"/>
            Thanks for offering your opinion. Please begin speaking after the beep.
            <audio src="https://contoso.com/beep.wav">
                Could not play the beep, please voice your opinion now.
            </audio>
        </p>
    </voice>
</speak>

新增背景音訊Add background audio

mstts:backgroundaudio元素可讓您將背景音訊新增至 SSML 檔(或混合具有文字轉換語音的音訊檔案)。The mstts:backgroundaudio element allows you to add background audio to your SSML documents (or mix an audio file with text-to-speech). 有了, mstts:backgroundaudio 您可以在背景中迴圈音訊檔案、從文字到語音的開頭淡入,然後在文字轉換語音的結尾淡出。With mstts:backgroundaudio you can loop an audio file in the background, fade in at the beginning of text-to-speech, and fade out at the end of text-to-speech.

如果提供的背景音訊短于文字轉換語音或淡出,則會迴圈。If the background audio provided is shorter than the text-to-speech or the fade out, it will loop. 如果超過文字轉換語音,則會在淡出完成時停止。If it is longer than the text-to-speech, it will stop when the fade out has finished.

每一份 SSML 檔只能有一個背景音訊檔案。Only one background audio file is allowed per SSML document. 不過,您可以在專案 audio 內散置標記,以在 voice SSML 檔中新增其他音訊。However, you can intersperse audio tags within the voice element to add additional audio to your SSML document.

語法Syntax

<mstts:backgroundaudio src="string" volume="string" fadein="string" fadeout="string"/>

屬性Attributes

屬性Attribute 說明Description 必要/選用Required / Optional
src 指定背景音訊檔案的位置/URL。Specifies the location/URL of the background audio file. 如果您在 SSML 檔中使用背景音訊,則為必要項。Required if using background audio in your SSML document.
volume 指定背景音訊檔案的磁片區。Specifies the volume of the background audio file. 接受的值 0100 包含(含)。Accepted values: 0 to 100 inclusive. 預設值是 1The default value is 1. 選擇性Optional
fadein 指定背景音訊「淡入」的持續時間(以毫秒為單位)。Specifies the duration of the background audio "fade in" as milliseconds. 預設值為 0 ,這相當於「不淡入」。The default value is 0, which is the equivalent to no fade in. 接受的值 010000 包含(含)。Accepted values: 0 to 10000 inclusive. 選擇性Optional
fadeout 指定背景音訊的持續時間(以毫秒為單位)。Specifies the duration of the background audio fade out in milliseconds. 預設值為 0 ,這相當於 [不淡出]。接受的值 010000 包含(含)。The default value is 0, which is the equivalent to no fade out. Accepted values: 0 to 10000 inclusive. 選擇性Optional

範例Example

<speak version="1.0" xml:lang="en-US" xmlns:mstts="http://www.w3.org/2001/mstts">
    <mstts:backgroundaudio src="https://contoso.com/sample.wav" volume="0.7" fadein="3000" fadeout="4000"/>
    <voice name="Microsoft Server Speech Text to Speech Voice (en-US, AriaRUS)">
        The text provided in this document will be spoken over the background audio.
    </voice>
</speak>

後續步驟Next steps