Processing Speech Recognition

Processing Speech Recognition

The Speech Engine Services (SES) recognition engine interprets audio input from the user, so that a speech application can understand it and respond. The received audio may come from the telephony interface manager (TIM) software, or a device such as a Windows Mobile-based Pocket PC (Pocket PC).

Technically, this recognition process involves many steps.

  • The client (the TIM or mobile device) sends the audio stream to be recognized, along with an associated grammar, to Speech Engine Services (SES). The grammar is part of the speech application.

    Note   For SES with the default Microsoft-provided speech recognition (SR) engine, grammars must be in W3C-compliant XML format (uncompiled) or in context-free grammar (CFG) format (compiled). Other recognition engines may take grammars in other formats.

  • SES loads the application grammar, and sends it with the corresponding audio to SAPI.

  • SAPI parses the grammar into the appropriate rules, properties, and phrases, and passes it with the audio to an available SR engine.

  • The SR engine then performs the actual recognition work, passing the resulting data back to SAPI.

  • SAPI uses the application grammar to format the recognition results into SML output, passing this output back to SES.

  • SES in turn passes the SML-formatted result back to the client.

    single server deployment diagram


The SR engine provided with Microsoft Speech Server supports W3C grammars only, not SAPI grammars. Specifically, SAPI grammars are no longer supported in XML format.

See Also

Producing Speech Output