Speech Synthesis API and DDI (Windows Embedded CE 6.0)

Article
01/05/2012

1/6/2010

This topic describes the APIs and DDIs required for basic speech synthesis.

API

For basic speech synthesis, an application needs to use the interface ISpVoice. An application calls COM's CoCreateInstance for the component CLSID_SpVoice to get a pointer to the ISpVoice interface of a voice object. An application can start calling the methods of ISpVoice such as ISpVoice::Speak, which will synthesize speech for the text passed in as a parameter. By default, a voice object is set to the default voice unless the voice has been changed by a call to ISpVoice::SetVoice. Similarly, a voice object will also be set to a default audio data destination unless ISpVoice::SetOutput is called with an IUnknown and an appropriate SPSTREAMFORMAT format.

General management functions are provided through an ISpResourceManager object that is created by calling CoCreateInstance with CLSID_SpResourceManager.

In addition to generating speech from plain text, an application can place synthesis markup language within the text passed into ISpVoice::Speak. This markup language is based on the XML format. An application can also play raw audio data using the ISpVoice::SpeakStream. Text files can be spoken using the ISpVoice::SpeakStream as well. The format of the text (either Unicode or ANSI) will automatically be detected. Both functions have flags to control whether or not to purge any sounds not already played and whether or not calls are asynchronous.

Because an ISpVoice object is also an ISpEventSource object, an application can start receiving notifications of events by calling ISpNotifySource::SetNotifySink, by passing in an ISpNotifySink object. In association with an ISpVoice::Speak or ISpVoice::SpeakStream call, ISpVoice provides an SPEI_START_INPUT_STREAM event before the engine begins synthesizing and an SPEI_END_INPUT_STREAM event when the engine has finished synthesizing. For each call to ISpVoice::Speak or ISpVoice::SpeakStream, a stream number is assigned (pulStreamNum). For each SPEVENT structure describing an event, the ulStreamNum member identifies the stream number associated with an ISpVoice::Speak or ISpVoice::SpeakStream call. ISpVoice generates the following events of type SPEVENTENUM.

Event	Function
SPEI_VOICE_CHANGE	Voice identifier or voice type has changed during stream synthesis.
SPEI_TTS_BOOKMARK	A bookmark was encountered during synthesis.
SPEI_WORD_BOUNDARY	Start of word being spoken.
SPEI_SENTENCE_BOUNDARY	Start of sentence being spoken.
SPEI_PHONEME	Phoneme being spoken.
SPEI_VISEME	Viseme being spoken.

Applications can call ISpVoice::GetStatus to receive the current status of speech synthesis.

DDI

A SAPI speech synthesis engine implements the ISpTTSEngine interface. The primary method called by SAPI to perform speech rendering is ISpTTSEngine::Speak. SAPI, rather than the engine, performs XML parsing of the input text stream. The Speak method is handed a linked list of text fragments with their associated XML attribute state. The Speak method also receives a pointer to the SpVoice object's ISpTTSEngineSite interface. The TTS engine uses this interface to queue events and write output data.

Even though SAPI is a free-threaded architecture, the TTS engine instances will always be called by SAPI on a single thread. TTS engines are never directly accessed by applications. SAPI ensures that all parameter validation and thread synchronization has been properly performed before the TTS engine is called. For more Speech Synthesis DDI information, see Text-to-Speech Engine Manager.

Share via

Speech Synthesis API and DDI (Windows Embedded CE 6.0)

API

DDI

See Also

Concepts

Additional resources