Article
05/29/2012

Note

Please see Azure Cognitive Services for Speech documentation for the latest supported speech solutions.

Microsoft Speech Platform

Use SSML to Create Prompts and Control TTS

A text-to-speech (TTS) engine generates synthesized speech from textual input, which is usually called a prompt. You can author the content for TTS prompts using the XML format that conforms to the Speech Synthesis Markup Language (SSML) Version 1.0. By using SSML tags to format the text content of a prompt, you can control many aspects of synthetic speech production. Follow these links for simple examples that use a variety of SSML elements in prompts to generate and control speech output.

Pass SSML to the Speak method
Select a speaking voice
Specify speech output characteristics
Guide the TTS engine's pronunciation of specific words
Control the rhythm of speech output
Play back recorded audio
Insert markers that trigger events

Note: The Microsoft Speech Platform provides support for prompts authored using SSML, but differs from the SSML specification in some areas. See Speech Synthesis Markup Language Reference for more information.

Pass SSML to the Speak method

You can pass a prompt that is formatted with SSML markup as an argument to the pwcs parameter of the ISpVoice::Speak method. The SSML markup can be a string, though more typically you will point to a file that contains SSML. In either case, the SSML must be a complete and valid SSML document that includes the opening <speak> tag with its required version, xml:lang, and xmlns attributes, and a closing </speak> tag. (The XML declaration may be omitted from the top of the prompt).

When passing a file that contains SSML markup to a Speak call, you must set the SPF_IS_FILENAME flag from the SPEAKFLAGS enumeration to inform the Speech Platform that you are passing the name of a file, rather than a string to speak. The Speech Platform automatically detects the SSML format of the XML markup in the file and parses the contents of the file as SSML.

The following is an example of passing an SSML file to a Speak method. You can use the following code to build a console application that runs the SSML examples presented in this topic.

HRESULT hr = S_OK;
// Create a voice and set its token.
CComPtr<ISpVoice> cpVoice;
if (SUCCEEDED(hr))
{
hr = cpVoice.CoCreateInstance(CLSID_SpVoice);
}
// Speak an SSML prompt from a file.
if (SUCCEEDED(hr))
{

hr = cpVoice->Speak(L"C:\Test\FileName.xml", SPF_IS_FILENAME, 0);
}

Select a speaking voice

Using SSML, you can select an installed speaking voice (represented by a voice token) to speak a prompt. You can specify a speaking voice by its name or by its attributes.

Select a speaking voice by name

The following example uses SSML to select a speaking voice by specifying its name. The SSML prompt contains two phrases; one in English, and one in French. In the opening <speak> tag, the required xml:lang attribute specifies the US English language for the prompt. The Speech Platform selects a voice token that supports US English to speak the text "For English, press 1.".

Then the opening <voice> tag dictates a change in the language of the speaking voice. It specifies by name an installed voice token that speaks French to speak the second phrase in the prompt. The two phrases will be pronounced correctly only if Runtime Languages that support English and French have been installed.

<speak version="1.0"
xmlns="http://www.w3.org/2001/10/synthesis"
xml:lang="en-US">
For English, press 1.
<voice name="Microsoft Server Speech Text to Speech Voice (fr-FR, Hortense)"> Pour le français, appuyez sur 2 </voice>
</speak>

Select a speaking voice by attributes

You can select a speaking voice by specifying one or more attributes other than Name. A voice token may have any of the following attributes: Age, Gender, Language, Name, Vendor, and VendorPreferred. The Speech Platform will select a voice that matches the attributes you specify. If no voice can be found that matches all the specified attributes, the default voice for the system will be used. The following example specifies a French, female speaking voice to speak the second phrase in the prompt.

<speak version="1.0"
xmlns="http://www.w3.org/2001/10/synthesis"
xml:lang="en-US">
For English, press 1.
<voice xml:lang="fr-FR" gender="female">
Pour le français, appuyez sur 2 </voice>
</speak>

You can also change to a speaking voice in a different language using the xml:lang attribute of the <s> and elements. The following example illustrates this, provided that both English and French speaking voices are installed.

<?xml version="1.0" encoding="ISO-8859-1"?>
<speak version="1.0"
xmlns="http://www.w3.org/2001/10/synthesis"
xml:lang="en-US">
one, two, three
<s xml:lang="fr-FR"> quatre, cinq, six </s>
</speak>

Specify speech output characteristics

You can use the SSML <prosody> element to modify the the speech output of a voice. The SSML definition for the <prososdy> element specifies attributes for pitch, rate, volume, contour, range, and duration. Of these, only pitch, rate, and volume are currently supported by Microsoft text-to-speech (TTS) voices. See prosody Element (Microsoft.Speech) for more information.

The following example demonstrates setting the pitch, rate, and volume of the speaking voice.

<speak version="1.0"
xmlns="http://www.w3.org/2001/10/synthesis"
xml:lang="en-US">
<prosody pitch="low"> This is low pitch. </prosody>
<prosody pitch="medium"> This is medium pitch. </prosody>
<prosody pitch="high"> This is high pitch. </prosody>
<prosody rate="slow"> This is slow speech. </prosody>
<prosody rate="1"> This is medium speech. </prosody>
<prosody rate="fast"> This is fast speech. </prosody>
<prosody volume="x-soft"> This is extra soft volume. </prosody>
<prosody volume="medium"> This is medium volume. </prosody>
<prosody volume="x-loud"> This is extra loud volume. </prosody>
</speak>

The pitch, rate, and volume attributes all accept absolute, relative, and enumerated values to specify characteristics of speech output. See prosody Element (Microsoft.Speech) for more information.

SSML also provides an <emphasis> element to specify the stress or prominence to apply when speaking a specified word or phrase. However, currently the Microsoft TTS engines do not support this element.

Guide the pronunciation of specific words

Using the <say-as>, <phoneme>, and elements, you can specify information that the TTS engine can use to guide its pronunciation of specific words.

say-as element

You can use the <say-as> element to enter information about the type of content represented by specific words, numbers, or characters in a prompt. The TTS engine will use this information to guide its pronunciation of dates, times, numbers, and other content types. See say-as Element (Microsoft.Speech) for more information.

The following example shows some common uses for the <say-as> element.

<?xml version="1.0" encoding="ISO-8859-1"?>
<speak version="1.0"
xmlns="http://www.w3.org/2001/10/synthesis"
xml:lang="en-US">
Your reservation for <say-as interpret-as="cardinal"> 2 </say-as> rooms on
the <say-as interpret-as="ordinal"> 4th </say-as> floor of the hotel on
<say-as interpret-as="date" format="mdy"> 3/21/2012 </say-as>, with early
arrival at <say-as interpret-as="time" format="hms12"> 12:35pm </say-as>
has been confirmed. Please call <say-as interpret-as="telephone" format="1">
(888) 555-1212 </say-as> with any questions.
</speak>

The TTS engine should pronounce:

"Your reservation for two rooms on the fourth floor of the hotel on March twenty-first twenty twelve with early arrival at twelve thirty-five P M has been confirmed. Please call eight eight eight five five five one two one two with any questions."

phoneme element

Using the <phoneme> element, you can specify a phonetic pronunciation for a word or phrase. Microsoft TTS engines are adept at creating pronunciations on-the-fly for unfamiliar words. However, specifying a phonetic spelling may improve the TTS engine's pronunciation of uncommon words, particularly those with uncommon spellings. See phoneme Element (Microsoft.Speech) and About Lexicons and Phonetic Alphabets (Microsoft.Speech) for more information.

The following example specifies a pronunciation for the slang word "whatchamacallit".

<?xml version="1.0" encoding="ISO-8859-1"?>
<speak version="1.0"
xmlns="http://www.w3.org/2001/10/synthesis"
xml:lang="en-US">
Gimme the <phoneme alphabet="x-microsoft-ups" ph="S1 W AA T . CH AX . M AX . S2 K AA L . IH T">
whatchamacallit </phoneme>.
</speak>

In the example above, the TTS engine ignores the contents of the <phoneme> element, and instead pronounces the contents of the ph attribute. The ph attribute contains a phonetic spelling that consists of phones from the Universal Phone Set (UPS), a phonetic alphabet created by Microsoft, and based on the International Phonetic Alphabet (IPA). A phone represents a discreet sound in a spoken language. See Phonetic Alphabet Reference (Microsoft.Speech) for more information.

The pronunciation specified inline in a <phoneme> element applies only to the single occurrence of a word. You can also create custom pronunciations in a lexicon document, and add a link to the lexicon in the prompt document. See Pronunciation Lexicon Reference (Microsoft.Speech). Pronunciations specified in custom lexicons override pronunciations in the TTS engine's internal lexicon, and apply as long as the prompt is active. Custom pronunciations specified inline override those in custom lexicons and in the TTS engine's internal lexicon.

sub element

You can also specify alternate text for the TTS engine to speak, in place of a word or phrase you specify using the element. This is useful for speaking the expanded form of an acronym. The following example illustrates this.

<?xml version="1.0" encoding="ISO-8859-1"?>
<speak version="1.0"
xmlns="http://www.w3.org/2001/10/synthesis"
xml:lang="en-US">
<sub alias="Contoso Symphony Orchestra"> CSO </sub>
</speak>

Control the rhythm of speech output

A TTS engine automatically determines the sentence and paragraph structure of a prompt document, and inserts pauses when speaking sentences and paragraphs to approximate the rhythm of human speech. The precise length of the pause varies among TTS engines. A TTS engine may also produce pauses when it encounters an SSML or <s> element, which designate a sentence and a paragraph, respectively. You can use the and <s> elements to organize longer prompts, or to designate sentence and paragraph breaks that the TTS engine may not interpret as you intend.

A TTS engine also produces changes in prosody (for example, pitch, rate, volume, and silent intervals) when it encounters punctuation in prompt text. You may notice a rise in pitch approaching a question mark, or a silent interval after a comma. Therefore, it is important to notice how a TTS engine responds to punctuation in prompt text and to use punctuation to control spoken output.

You can also insert pauses of specified lengths directly, using the <break> element. You can specify an enumerated silence interval using the strength attribute, or specify an absolute interval using the time attribute. See break Element (Microsoft.Speech) for more information.

<?xml version="1.0" encoding="ISO-8859-1"?>
<speak version="1.0"
xmlns="http://www.w3.org/2001/10/synthesis"
xml:lang="en-US">
The rental car you reserved <break strength="medium" /> a mid-size sedan
<break strength="medium" /> will be ready for you to pick up at <break
time="500ms" /> <say-as interpret-as="hms12"> 4:00pm </say-as> today.
</speak>

Play back recorded audio

You can instruct the Speech Platform to play a pre-recorded audio file as part of a prompt. You need only specify a path the audio file, to which your system has access. The following example incorporates an audio file in a prompt. Note that the contents of the <audio> element contains text that an application can display if the audio file cannot be played. See audio Element (Microsoft.Speech) for more information.

<?xml version="1.0" encoding="ISO-8859-1"?>
<speak version="1.0"
xmlns="http://www.w3.org/2001/10/synthesis"
xml:lang="en-US">
<audio src="C:\Test\Weather.WAV"> Here's today's weather forecast. </audio>
Today's weather will be mostly cloudy with some sun breaks.
</speak>

Insert markers that trigger events

The element lets you insert a named marker in an SSML prompt. See mark Element (Microsoft.Speech) for more information. The TTS engine will raise the SPEI_TTS_BOOKMARK event when it encounters this empty element in a prompt, and will return the name of the bookmark.

<?xml version="1.0" encoding="ISO-8859-1"?>
<speak version="1.0"
xmlns="http://www.w3.org/2001/10/synthesis"
xml:lang="en-US">
<s> The TTS engine will raise an event when it encounters this element:
<mark name="sentence1" /> </s>
<s> The TTS engine will raise another event at the end of this sentence.
<mark name="sentence2" /> </s>
</speak>

You will need additional code to process the events that the TTS engine raises when it encounters elements in SSML. See Use TTS Events for an example of how to subscribe to TTS events.

Also see Generate Speech from Text in a File and Persist TTS Output to a WAV File for tips on how to author prompts using C++ code.