您现在访问的是微软AZURE全球版技术文档网站,若需要访问由世纪互联运营的MICROSOFT AZURE中国区技术文档网站,请访问 https://docs.azure.cn.

语音合成标记语言 (SSML)Speech Synthesis Markup Language (SSML)

语音合成标记语言 (SSML) 是一种基于 XML 的标记语言,可让开发人员指定如何使用文本转语音服务将输入文本转换为合成语音。Speech Synthesis Markup Language (SSML) is an XML-based markup language that lets developers specify how input text is converted into synthesized speech using the text-to-speech service. 与纯文本相比,SSML 可让开发人员微调音节、发音、语速、音量以及文本转语音输出的其他属性。Compared to plain text, SSML allows developers to fine-tune the pitch, pronunciation, speaking rate, volume, and more of the text-to-speech output. SSML 可自动处理正常的停顿(例如,在句号后面暂停片刻),或者在以问号结尾的句子中使用正确的音调。Normal punctuation, such as pausing after a period, or using the correct intonation when a sentence ends with a question mark are automatically handled.

SSML 的语音服务实现基于万维网联合会的 语音合成标记语言版本 1.0The Speech Services implementation of SSML is based on World Wide Web Consortium's Speech Synthesis Markup Language Version 1.0.

重要

中文、日语和韩语字符按两个字符计费。Chinese, Japanese, and Korean characters count as two characters for billing. 有关详细信息,请参阅定价For more information, see Pricing.

标准、神经和自定义语音Standard, neural, and custom voices

从标准和神经语音中选择,或创建自己产品或品牌特有的自定义语音。Choose from standard and neural voices, or create your own custom voice unique to your product or brand. 75 多种标准语音可在 45 种以上的语言和区域设置中使用,5 种神经语音可在 4 种语言和区域设置中使用。75+ standard voices are available in more than 45 languages and locales, and 5 neural voices are available in 4 languages and locales. 有关支持的语言、区域设置和语音(神经和标准)的完整列表,请参阅语言支持For a complete list of supported languages, locales, and voices (neural and standard), see language support.

若要详细了解标准、神经和自定义语音,请参阅文本转语音概述To learn more about standard, neural, and custom voices, see Text-to-speech overview.

特殊字符Special characters

使用 SSML 将文本转换为合成语音时,请记住,与 XML 一样,特殊字符(例如引号、撇号和括号)必须经过转义。While using SSML to convert text-to-synthesized speech, keep in mind that just like with XML, special characters, such as quotation marks, apostrophes, and brackets must be escaped. 有关详细信息,请参阅可扩展标记语言 (XML) 1.0:附录 DFor more information, see Extensible Markup Language (XML) 1.0: Appendix D.

支持的 SSML 元素Supported SSML elements

每个 SSML 文档是使用 SSML 元素(或标记)创建的。Each SSML document is created with SSML elements (or tags). 这些元素用于调整音节、韵律、音量等。These elements are used to adjust pitch, prosody, volume, and more. 以下部分详细说明了每个元素的用法,以及该元素是必需的还是可选的。The following sections detail how each element is used, and when an element is required or optional.

重要

不要忘记将属性值括在双引号中。Don't forget to use double quotes around attribute values. 适当格式的有效 XML 的标准要求将属性值括在双引号中。Standards for well-formed, valid XML requires attribute values to be enclosed in double quotation marks. 例如,<prosody volume="90"> 是适当格式的有效元素,而 <prosody volume=90> 则不是。For example, <prosody volume="90"> is a well-formed, valid element, but <prosody volume=90> is not. SSML 无法识别未括在引号中的属性值。SSML may not recognize attribute values that are not in quotes.

创建 SSML 文档Create an SSML document

speak 是根元素,对于所有 SSML 文档都是必需的speak is the root element, and is required for all SSML documents. speak 元素包含重要信息,例如版本、语言和标记词汇定义。The speak element contains important information, such as version, language, and the markup vocabulary definition.

语法Syntax

<speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis" xml:lang="string"></speak>

属性Attributes

特性Attribute 描述Description 必需/可选Required / Optional
versionversion 指示用于解释文档标记的 SSML 规范的版本。Indicates the version of the SSML specification used to interpret the document markup. 当前版本为 1.0。The current version is 1.0. 必填Required
xml:langxml:lang 指定根文档的语言。Specifies the language of the root document. 该值可以包含小写的双字母语言代码(例如 en),或者语言代码加上大写的国家/地区代码(例如 en-US)。The value may contain a lowercase, two-letter language code (for example, en), or the language code and uppercase country/region (for example, en-US). 必填Required
xmlnsxmlns 指定文档的 URI,用于定义 SSML 文档的标记词汇(元素类型和属性名称)。Specifies the URI to the document that defines the markup vocabulary (the element types and attribute names) of the SSML document. 当前 URI 为 https://www.w3.org/2001/10/synthesisThe current URI is https://www.w3.org/2001/10/synthesis. 必填Required

选择文本转语音所用的语音Choose a voice for text-to-speech

voice 元素是必需的。The voice element is required. 它用于指定文本转语音所用的语音。It is used to specify the voice that is used for text-to-speech.

语法Syntax

<voice name="string">
    This text will get converted into synthesized speech.
</voice>

属性Attributes

特性Attribute 描述Description 必需/可选Required / Optional
namename 标识用于文本转语音输出的语音。Identifies the voice used for text-to-speech output. 有关支持的语音的完整列表,请参阅语言支持For a complete list of supported voices, see Language support. 必填Required

示例Example

备注

本示例使用 en-US-Jessa24kRUS 语音。This example uses the en-US-Jessa24kRUS voice. 有关支持的语音的完整列表,请参阅语言支持For a complete list of supported voices, see Language support.

<speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis" xml:lang="en-US">
    <voice  name="en-US-Jessa24kRUS">
        This is the text that is spoken.
    </voice>
</speak>

使用多个语音Use multiple voices

speak 元素中,可为文本转语音输出指定多种语音。Within the speak element, you can specify multiple voices for text-to-speech output. 这些语音可以采用不同的语言。These voices can be in different languages. 对于每种语音,必须将文本包装在 voice 元素中。For each voice, the text must be wrapped in a voice element.

属性Attributes

特性Attribute 描述Description 必需/可选Required / Optional
namename 标识用于文本转语音输出的语音。Identifies the voice used for text-to-speech output. 有关支持的语音的完整列表,请参阅语言支持For a complete list of supported voices, see Language support. 必填Required

示例Example

<speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis" xml:lang="en-US">
    <voice  name="en-US-Jessa24kRUS">
        Good morning!
    </voice>
    <voice  name="en-US-Guy24kRUS">
        Good morning to you too Jessa!
    </voice>
</speak>

调整讲话风格Adjust speaking styles

重要

此功能仅适用于神经语音。This feature will only work with neural voices.

默认情况下,对于标准和神经语音,文本转语音服务将使用中性讲话风格合成文本。By default, the text-to-speech service synthesizes text using a neutral speaking style for both standard and neural voices. 使用神经语音时,可以使用 <mstts:express-as> 元素调整讲话风格,以表达喜悦、同情或情绪。With neural voices, you can adjust the speaking style to express cheerfulness, empathy, or sentiment with the <mstts:express-as> element. 这是 Azure 语音服务特有的可选元素。This is an optional element unique to Azure Speech Services.

目前,支持调整以下神经语音的讲话风格:Currently, speaking style adjustments are supported for these neural voices:

  • en-US-JessaNeural
  • zh-CN-XiaoxiaoNeural

更改将在句子级别应用,风格因语音而异。Changes are applied at the sentence level, and style vary by voice. 如果某种风格不受支持,该服务将以默认的中性讲话风格返回语音。If a style isn't supported, the service will return speech in the default neutral speaking style.

语法Syntax

<mstts:express-as type="string"></mstts:express-as>

属性Attributes

特性Attribute 描述Description 必需/可选Required / Optional
typetype 指定讲话风格。Specifies the speaking style. 目前,讲话风格特定于语音。Currently, speaking styles are voice specific. 如果调整神经语音的讲话风格,则此属性是必需的。Required if adjusting the speaking style for a neural voice. 如果使用 mstts:express-as,则必须提供类型。If using mstts:express-as, then type must be provided. 如果提供无效的值,将忽略此元素。If an invalid value is provided, this element will be ignored.

参考下表来确定每种神经语音支持的讲话风格。Use this table to determine which speaking styles are supported for each neural voice.

语音Voice typeType 描述Description
en-US-JessaNeural type=cheerfultype=cheerful 表达积极和愉快的情感Expresses an emotion that is positive and happy
type=empathytype=empathy 表达关心和理解Expresses a sense of caring and understanding
type=chattype=chat 以一种随性、放松的音调讲话Speak in a casual, relaxed tone
zh-CN-XiaoxiaoNeural type=newscasttype=newscast 以正式的音调表达,类似于新闻发布会Expresses a formal tone, similar to news broadcasts
type=sentimenttype=sentiment 传达感人的祝词或经历Conveys a touching message or a story

示例Example

此 SSML 代码片段演示如何使用 <mstts:express-as> 元素将讲话风格更改为 cheerfulThis SSML snippet illustrates how the <mstts:express-as> element is used to change the speaking style to cheerful.

<speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis" xmlns:mstts="https://www.w3.org/2001/mstts" xml:lang="en-US">
    <voice name="en-US-JessaNeural">
        <mstts:express-as type="cheerful">
            That'd be just amazing!
        </mstts:express-as>
    </voice>
</speak>

添加或删除中断/暂停Add or remove a break/pause

使用元素 break 可在单词之间插入暂停(或中断),或者防止文本转语音服务自动添加暂停。Use the break element to insert pauses (or breaks) between words, or prevent pauses automatically added by the text-to-speech service.

备注

如果某个单词或短语的合成语音听起来不自然,可以使用此元素来重写该单词或短语的默认文本转语音 (TTS) 行为。Use this element to override the default behavior of text-to-speech (TTS) for a word or phrase if the synthesized speech for that word or phrase sounds unnatural. strength 设置为 none 可防止文本转语音服务自动插入的韵律中断。Set strength to none to prevent a prosodic break, which is automatically inserted by the text-to-speech service.

语法Syntax

<break strength="string" />
<break time="string" />

属性Attributes

特性Attribute 描述Description 必需/可选Required / Optional
strengthstrength 使用以下值之一指定暂停的相对持续时间:Specifies the relative duration of a pause using one of the following values:
  • none
  • x-weakx-weak
  • weakweak
  • medium(默认值)medium (default)
  • strongstrong
  • x-strongx-strong
可选Optional
timetime 指定暂停的绝对持续时间,以秒或毫秒为单位。Specifies the absolute duration of a pause in seconds or milliseconds. 例如,2s 和 500 是有效值Examples of valid values are 2s and 500 可选Optional
StrengthStrength 描述Description
None,或者不提供任何值None, or if no value provided 0 毫秒0 ms
x-weakx-weak 250 毫秒250 ms
weakweak 500 毫秒500 ms
中等medium 750 毫秒750 ms
strongstrong 1000 毫秒1000 ms
x-strongx-strong 1250 毫秒1250 ms

示例Example

<speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis" xml:lang="en-US">
    <voice  name="en-US-Jessa24kRUS">
        Welcome to Microsoft Cognitive Services <break time="100ms" /> Text-to-Speech API.
    </voice>
</speak>

指定段落和句子Specify paragraphs and sentences

ps 元素分别用于表示段落和句子。p and s elements are used to denote paragraphs and sentences, respectively. 如果不指定这些元素,则文本转语音服务会自动确定 SSML 文档的结构。In the absence of these elements, the text-to-speech service automatically determines the structure of the SSML document.

p 元素可包含文本和以下元素:audiobreakphonemeprosodysay-assubmstts:express-assThe p element may contain text and the following elements: audio, break, phoneme, prosody, say-as, sub, mstts:express-as, and s.

s 元素可包含文本和以下元素:audiobreakphonemeprosodysay-asmstts:express-assubThe s element may contain text and the following elements: audio, break, phoneme, prosody, say-as, mstts:express-as, and sub.

语法Syntax

<p></p>
<s></s>

示例Example

<speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis" xml:lang="en-US">
    <voice  name="en-US-Jessa24kRUS">
        <p>
            <s>Introducing the sentence element.</s>
            <s>Used to mark individual sentences.</s>
        </p>
        <p>
            Another simple paragraph.
            Sentence structure in this paragraph is not explicitly marked.
        </p>
    </voice>
</speak>

使用音素改善发音Use phonemes to improve pronunciation

ph 元素用于 SSML 文档中的发音。The ph element is used to for phonetic pronunciation in SSML documents. ph 元素只能包含文本,而不能包含其他元素。The ph element can only contain text, no other elements. 始终提供人类可读的语音作为回退。Always provide human-readable speech as a fallback.

音标由音素构成,而这些音素由字母、数字或字符(有时是它们的组合)构成。Phonetic alphabets are composed of phones, which are made up of letters, numbers, or characters, sometimes in combination. 每个音素描述独特的语音。Each phone describes a unique sound of speech. 这与拉丁音标不同,其中的任一字母可以表示多种语音。This is in contrast to the Latin alphabet, where any letter may represent multiple spoken sounds. 想像一下单词“candy”和“cease”中字母“c”的不同发音,或者字母组合“th”在单词“thing”和“those”中的不同发音。Consider the different pronunciations of the letter "c" in the words "candy" and "cease", or the different pronunciations of the letter combination "th" in the words "thing" and "those".

语法Syntax

<phoneme alphabet="string" ph="string"></phoneme>

属性Attributes

特性Attribute 描述Description 必需/可选Required / Optional
alphabetalphabet 指定在 ph 属性中合成字符串发音时要使用的音标。Specifies the phonetic alphabet to use when synthesizing the pronunciation of the string in the ph attribute. 指定音标的字符串必须以小写字母指定。The string specifying the alphabet must be specified in lowercase letters. 下面是可以指定的可能音标。The following are the possible alphabets that you may specify.
  • ipa – 国际音标ipa – International Phonetic Alphabet
  • sapi – 语音 API 音素集sapi – Speech API Phone Set
  • ups – 通用音素集ups – Universal Phone Set
音标仅适用于元素中的音素。The alphabet applies only to the phoneme in the element. 有关详细信息,请参阅音标参考For more information, see Phonetic Alphabet Reference.
可选Optional
phph 一个字符串,包含用于在 phoneme 元素中指定单词发音的音素。A string containing phones that specify the pronunciation of the word in the phoneme element. 如果指定的字符串包含无法识别的音素,则文本转语音 (TTS) 服务将拒绝整个 SSML 文档,并且不会生成文档中指定的任何语音输出。If the specified string contains unrecognized phones, the text-to-speech (TTS) service rejects the entire SSML document and produces none of the speech output specified in the document. 如果使用音素,则此属性是必需的。Required if using phonemes.

示例Examples

<speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis" xml:lang="en-US">
    <voice  name="en-US-Jessa24kRUS">
        <s>His name is Mike <phoneme alphabet="ups" ph="JH AU"> Zhou </phoneme></s>
    </voice>
</speak>
<speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis" xml:lang="en-US">
    <voice  name="en-US-Jessa24kRUS">
        <phoneme alphabet="ipa" ph="t&#x259;mei&#x325;&#x27E;ou&#x325;"> tomato </phoneme>
    </voice>
</speak>

调整韵律Adjust prosody

prosody 元素用于指定文本转语音输出的音节、调型、范围、速率、持续时间和音量的变化。The prosody element is used to specify changes to pitch, countour, range, rate, duration, and volume for the text-to-speech output. prosody 元素可包含文本和以下元素:audiobreakpphonemeprosodysay-assubsThe prosody element may contain text and the following elements: audio, break, p, phoneme, prosody, say-as, sub, and s.

由于韵律属性值可在较大范围内变化,因此,语音识别器会将分配的值解释为所选语音的建议实际韵律值。Because prosodic attribute values can vary over a wide range, the speech recognizer interprets the assigned values as a suggestion of what the actual prosodic values of the selected voice should be. 文本转语音服务将限制或替代不支持的值。The text-to-speech service limits or substitutes values that are not supported. 例如,音节 1 MHz 或音量 120 就是不支持的值。Examples of unsupported values are a pitch of 1 MHz or a volume of 120.

语法Syntax

<prosody pitch="value" contour="value" range="value" rate="value" duration="value" volume="value"></prosody>

属性Attributes

特性Attribute 描述Description 必需/可选Required / Optional
音节pitch 指示文本的基线音节。Indicates the baseline pitch for the text. 可将音节表述为:You may express the pitch as:
  • 以某个数字后接“Hz”(赫兹)表示的绝对值。An absolute value, expressed as a number followed by "Hz" (Hertz). 例如 600Hz。For example, 600Hz.
  • 以前面带有“+”或“-”的数字,后接“Hz”或“st”(用于指定音节的变化量)表示的相对值。A relative value, expressed as a number preceded by "+" or "-" and followed by "Hz" or "st", that specifies an amount to change the pitch. 例如:+80Hz 或 -2st。For example: +80Hz or -2st. “st”表示变化单位为半音,即,标准全音阶中的半调(半步)。The "st" indicates the change unit is semitone, which is half of a tone (a half step) on the standard diatonic scale.
  • 常量值:A constant value:
    • x-lowx-low
    • low
    • 中等medium
    • high
    • x-highx-high
    • 默认default
..
可选Optional
contourcontour 神经语音不支持调型。Contour isn't supported for neural voices. 调型以语音输出中位于指定时间处的目标数组形式表示语音内容的音节变化。Contour represents changes in pitch for speech content as an array of targets at specified time positions in the speech output. 每个目标由参数对的集定义。Each target is defined by sets of parameter pairs. 例如:For example:

<prosody contour="(0%,+20Hz) (10%,-2st) (40%,+10Hz)">

每参数集中的第一个值以文本持续时间百分比的形式指定音节变化的位置。The first value in each set of parameters specifies the location of the pitch change as a percentage of the duration of the text. 第二个值使用音节的相对值或枚举值指定音节的升高或降低量(请参阅 pitch)。The second value specifies the amount to raise or lower the pitch, using a relative value or an enumeration value for pitch (see pitch).
可选Optional
rangerange 表示文本音节范围的值。A value that represents the range of pitch for the text. 可以使用用于描述 pitch 的相同绝对值、相对值或枚举值表示 rangeYou may express range using the same absolute values, relative values, or enumeration values used to describe pitch. 可选Optional
raterate 指示文本的讲出速率。Indicates the speaking rate of the text. 可将 rate 表述为:You may express rate as:
  • 以充当默认值倍数的数字表示的相对值。A relative value, expressed as a number that acts as a multiplier of the default. 例如,如果值为 1,则速率不会变化。For example, a value of 1 results in no change in the rate. 如果值为 0.5,则速率会减慢一半。A value of .5 results in a halving of the rate. 如果值为 3,则速率为三倍。A value of 3 results in a tripling of the rate.
  • 常量值:A constant value:
    • x-slowx-slow
    • slowslow
    • 中等medium
    • fastfast
    • x-fastx-fast
    • 默认default
可选Optional
持续时间duration 语音合成 (TTS) 服务读取文本时应该消逝的时长,以秒或毫秒为单位。The period of time that should elapse while the speech synthesis (TTS) service reads the text, in seconds or milliseconds. 例如 2s1800msFor example, 2s or 1800ms. 可选Optional
volume 指示语音的音量级别。Indicates the volume level of the speaking voice. 可将音量表述为:You may express the volume as:
  • 以从 0.0 到 100.0(从最安静到最大声)的数字表示的绝对值。An absolute value, expressed as a number in the range of 0.0 to 100.0, from quietest to loudest. 例如 75。For example, 75. 默认值为 100.0。The default is 100.0.
  • 以前面带有“+”或“-”的数字表示的相对值,指定音量的变化量。A relative value, expressed as a number preceded by "+" or "-" that specifies an amount to change the volume. 例如 +10 或 -5.5。For example +10 or -5.5.
  • 常量值:A constant value:
    • silentsilent
    • x-softx-soft
    • softsoft
    • 中等medium
    • loudloud
    • x-loudx-loud
    • 默认default
可选Optional

更改语速Change speaking rate

可以在单词或句子级别对标准语音应用语速。Speaking rate can be applied to standard voices at the word or sentence-level. 只能在句子级别对神经语音应用语速。Whereas speaking rate can only be applied to neural voices at the sentence level.

示例Example

<speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis" xml:lang="en-US">
    <voice  name="en-US-Guy24kRUS">
        <prosody rate="+30.00%">
            Welcome to Microsoft Cognitive Services Text-to-Speech API.
        </prosody>
    </voice>
</speak>

更改音量Change volume

可以在单词或句子级别对标准语音应用音量变化。Volume changes can be applied to standard voices at the word or sentence-level. 只能在句子级别对神经语音应用音量变化。Whereas volume changes can only be applied to neural voices at the sentence level.

示例Example

<speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis" xml:lang="en-US">
    <voice  name="en-US-Jessa24kRUS">
        <prosody volume="+20.00%">
            Welcome to Microsoft Cognitive Services Text-to-Speech API.
        </prosody>
    </voice>
</speak>

更改音高Change pitch

可以在单词或句子级别对标准语音应用音节变化。Pitch changes can be applied to standard voices at the word or sentence-level. 只能在句子级别对神经语音应用音节变化。Whereas pitch changes can only be applied to neural voices at the sentence level.

示例Example

<speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis" xml:lang="en-US">
    <voice  name="en-US-Guy24kRUS">
        Welcome to <prosody pitch="high">Microsoft Cognitive Services Text-to-Speech API.</prosody>
    </voice>
</speak>

更改音高升降曲线Change pitch contour

重要

神经语音不支持音节调型变化。Pitch contour changes aren't supported with neural voices.

示例Example

<speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis" xml:lang="en-US">
    <voice  name="en-US-Jessa24kRUS">
        <prosody contour="(80%,+20%) (90%,+30%)" >
            Good morning.
        </prosody>
    </voice>
</speak>

假设元素say-as element

say-as 是一个可选元素,它指示元素文本的内容类型(如数字或日期)。say-as is an optional element that indicates the content type (such as number or date) of the element's text. 这为语音合成引擎提供有关如何发音文本的指导。This provides guidance to the speech synthesis engine about how to pronounce the text.

语法Syntax

<say-as interpret-as="string" format="digit string" detail="string"> <say-as>

属性Attributes

特性Attribute 描述Description 必需/可选Required / Optional
解释为interpret-as 指示元素的文本的内容类型。Indicates the content type of element's text. 有关类型的列表,请参阅下表。For a list of types, see the table below. 必填Required
formatformat 为可能具有不明确格式的内容类型提供有关元素文本的精确格式设置的其他信息。Provides additional information about the precise formatting of the element's text for content types that may have ambiguous formats. SSML 为使用它们的内容类型定义格式(请参阅下表)。SSML defines formats for content types that use them (see table below). 可选Optional
详细信息detail 指示要口述的详细信息的级别。Indicates the level of detail to be spoken. 例如,此属性可能会请求语音合成引擎发音标点标记。For example, this attribute might request that the speech synthesis engine pronounce punctuation marks. 没有为 @no__t 定义标准值。There are no standard values defined for detail. 可选Optional

下面是 interpret-asformat 属性支持的内容类型。The following are the supported content types for the interpret-as and format attributes. 仅当 @no__t 设置为日期和时间时,才包含 format 属性。Include the format attribute only if interpret-as is set to date and time.

解释为interpret-as formatformat 破解Interpretation
地址address 该文本称为地址。The text is spoken as an address. 语音合成引擎 pronounces:The speech synthesis engine pronounces:

I'm at <say-as interpret-as="address">150th CT NE, Redmond, WA</say-as>

作为 "我在150th 的法庭 redmond 华盛顿州"。As "I'm at 150th court north east redmond washington."
基数、数字cardinal, number 此文本被称为基数数字。The text is spoken as a cardinal number. 语音合成引擎 pronounces:The speech synthesis engine pronounces:

There are <say-as interpret-as="cardinal">3</say-as> alternatives

如 "有三个替代方法。"As "There are three alternatives."
字符,拼写输出characters, spell-out 此文本被称为单个字母(拼写出)。The text is spoken as individual letters (spelled out). 语音合成引擎 pronounces:The speech synthesis engine pronounces:

<say-as interpret-as="characters">test</say-as>

作为 "T E S T"。As "T E S T."
datedate dmy、mdy、ymd、ydm、ym、my、md、dm、d、m、ydmy, mdy, ymd, ydm, ym, my, md, dm, d, m, y 文本被称为日期。The text is spoken as a date. @No__t-0 属性指定日期的格式(d = day、m = month 和 y = year)。The format attribute specifies the date's format (d=day, m=month, and y=year). 语音合成引擎 pronounces:The speech synthesis engine pronounces:

Today is <say-as interpret-as="date" format="mdy">10-19-2016</say-as>

"今天是10月第19个 2016"。As "Today is October nineteenth two thousand sixteen."
数字,number_digitdigits, number_digit 文本被称为单个数字的序列。The text is spoken as a sequence of individual digits. 语音合成引擎 pronounces:The speech synthesis engine pronounces:

<say-as interpret-as="number_digit">123456789</say-as>

为 "1 2 3 4 5 6 7 8 9"。As "1 2 3 4 5 6 7 8 9."
部分fraction 该文本称为小数。The text is spoken as a fractional number. 语音合成引擎 pronounces:The speech synthesis engine pronounces:

<say-as interpret-as="fraction">3/8</say-as> of an inch

为 "八分之三英寸"。As "three eighths of an inch."
ordinalordinal 此文本被称为序号。The text is spoken as an ordinal number. 语音合成引擎 pronounces:The speech synthesis engine pronounces:

Select the <say-as interpret-as="ordinal">3rd</say-as> option

"选择第三个选项"。As "Select the third option".
telephonetelephone 此文本被称为电话号码。The text is spoken as a telephone number. @No__t-0 属性可以包含表示国家/地区代码的数字。The format attribute may contain digits that represent a country code. 例如,"1" 表示美国,"39" 表示意大利。For example, "1" for the United States or "39" for Italy. 语音合成引擎可能会使用此信息来指导其电话号码的发音。The speech synthesis engine may use this information to guide its pronunciation of a phone number. 电话号码也可能包含国家/地区代码,如果是,则优先于 format 中的国家/地区代码。The phone number may also include the country code, and if so, takes precedence over the country code in the format. 语音合成引擎 pronounces:The speech synthesis engine pronounces:

The number is <say-as interpret-as="telephone" format="1">(888) 555-1212</say-as>

"我的数字是区域代码 8 8 8 5 5 5 1 2 1 2"。As "My number is area code eight eight eight five five five one two one two."
timetime hms12, hms24hms12, hms24 该文本称为 "一次"。The text is spoken as a time. @No__t-0 属性指定是使用12小时制(hms12)还是24小时制(hms24)指定时间。The format attribute specifies whether the time is specified using a 12-hour clock (hms12) or a 24-hour clock (hms24). 使用冒号分隔表示小时、分钟和秒的数字。Use a colon to separate numbers representing hours, minutes, and seconds. 下面是有效的时间示例:12:35、1:14:32、08:15 和02:50:45。The following are valid time examples: 12:35, 1:14:32, 08:15, and 02:50:45. 语音合成引擎 pronounces:The speech synthesis engine pronounces:

The train departs at <say-as interpret-as="time" format="hms12">4:00am</say-as>

"训练离开为四个 A M"。As "The train departs at four A M."

使用情况Usage

@No__t-0 元素只能包含文本。The say-as element may contain only text.

示例Example

语音合成引擎的示例如下所示: "第一次请求的时间是10月第19个 20 10 上的一个房间,早到达 12 35 P M。"The speech synthesis engine speaks the following example as "Your first request was for one room on October nineteenth twenty ten with early arrival at twelve thirty five P M."

<speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis" xml:lang="en-US">
    <voice  name="en-US-Jessa24kRUS">
    <p>
    Your <say-as interpret-as="ordinal"> 1st </say-as> request was for <say-as interpret-as="cardinal"> 1 </say-as> room
    on <say-as interpret-as="date" format="mdy"> 10/19/2010 </say-as>, with early arrival at <say-as interpret-as="time" format="hms12"> 12:35pm </say-as>.
    </p>
</speak>

添加录制的音频Add recorded audio

audio 是一个可选元素,用于将 MP3 音频插入 SSML 文档。audio is an optional element that allows you to insert MP3 audio into an SSML document. 如果音频文件不可用或不可播放,可在音频元素的正文中包含可讲述的纯文本或 SSML 标记。The body of the audio element may contain plain text or SSML markup that's spoken if the audio file is unavailable or unplayable. 此外,audio 元素可包含文本和以下元素:audiobreakpsphonemeprosodysay-assubAdditionally, the audio element can contain text and the following elements: audio, break, p, s, phoneme, prosody, say-as, and sub.

包含在 SSML 文档中的任何音频必须满足以下要求:Any audio included in the SSML document must meet these requirements:

  • MP3 必须托管在可通过 Internet 访问的 HTTPS 终结点上。The MP3 must be hosted on an Internet-accessible HTTPS endpoint. 必须使用 HTTPS,托管 MP3 文件的域必须提供有效的受信任 SSL 证书。HTTPS is required, and the domain hosting the MP3 file must present a valid, trusted SSL certificate.
  • MP3 必须是有效的 MP3 文件 (MPEG v2)。The MP3 must be a valid MP3 file (MPEG v2).
  • 比特率必须是 48 kbps。The bit rate must be 48 kbps.
  • 采样率必须是 16000 Hz。The sample rate must be 16000 Hz.
  • 单个响应中所有文本和音频文件的总时间不能超过 90 秒。The combined total time for all text and audio files in a single response cannot exceed ninety (90) seconds.
  • MP3 不得包含任何客户特定的信息或其他敏感信息。The MP3 must not contain any customer-specific or other sensitive information.

语法Syntax

<audio src="string"/></audio>

属性Attributes

特性Attribute 描述Description 必需/可选Required / Optional
srcsrc 指定音频文件的位置/URL。Specifies the location/URL of the audio file. 在 SSML 文档中使用音频元素时,此属性是必需的。Required if using the audio element in your SSML document.

示例Example

<speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis" xml:lang="en-US">
    <p>
        <audio src="https://contoso.com/opinionprompt.wav"/>
        Thanks for offering your opinion. Please begin speaking after the beep.
        <audio src="https://contoso.com/beep.wav">
        Could not play the beep, please voice your opinion now. </audio>
    </p>
</speak>

添加背景音频Add background audio

使用 mstts:backgroundaudio 元素可将背景音频添加到 SSML 文档(或者使用文本转语音来混合音频文件)。The mstts:backgroundaudio element allows you to add background audio to your SSML documents (or mix an audio file with text-to-speech). 使用 mstts:backgroundaudio 可以在后台循环音频文件,在文本转语音的开头淡入,并在文本转语音的末尾淡出。With mstts:backgroundaudio you can loop an audio file in the background, fade in at the beginning of text-to-speech, and fade out at the end of text-to-speech.

如果提供的背景音频短于文本转语音或淡出持续时间,则会循环该音频。If the background audio provided is shorter than the text-to-speech or the fade out, it will loop. 如果其长度超过文本转语音的持续时间,则它在完成淡出后将会停止。If it is longer than the text-to-speech, it will stop when the fade out has finished.

每个 SSML 文档仅允许一个背景音频文件。Only one background audio file is allowed per SSML document. 但是,可以在 voice 元素中散布 audio 标记,以将更多的音频添加到 SSML 文档。However, you can intersperse audio tags within the voice element to add additional audio to your SSML document.

语法Syntax

<mstts:backgroundaudio src="string" volume="string" fadein="string" fadeout="string"/>

属性Attributes

特性Attribute 描述Description 必需/可选Required / Optional
srcsrc 指定背景音频文件的位置/URL。Specifies the location/URL of the background audio file. 如果在 SSML 文档中使用背景音频,则此属性是必需的。Required if using background audio in your SSML document.
volume 指定背景音频文件的音量。Specifies the volume of the background audio file. 接受的值0100(含)。Accepted values: 0 to 100 inclusive. 默认值为 1The default value is 1. 可选Optional
fadeinfadein 指定背景音频淡入的持续时间,以毫秒为单位。Specifies the duration of the background audio fade in in milliseconds. 默认值为 0,即,不淡入。The default value is 0, which is the equivalent to no fade in. 接受的值010000(含)。Accepted values: 0 to 10000 inclusive. 可选Optional
fadeoutfadeout 指定背景音频淡出的持续时间,以毫秒为单位。Specifies the duration of the background audio fade out in milliseconds. 默认值为 0,即,不淡出。接受的值010000(含)。The default value is 0, which is the equivalent to no fade out. Accepted values: 0 to 10000 inclusive. 可选Optional

示例Example

<speak version="1.0" xml:lang="en-US" xmlns:mstts="http://www.w3.org/2001/mstts">
    <mstts:backgroundaudio src="https://contoso.com/sample.wav" volume="0.7" fadein="3000" fadeout="4000"/>
    <voice name="Microsoft Server Speech Text to Speech Voice (en-US, Jessa24kRUS)">
        The text provided in this document will be spoken over the background audio.
    </voice>
</speak>

后续步骤Next steps