SR Engine Vendor Porting Guide SAPI 5.4

Microsoft Speech API 5.4

SR Engine Vendor Porting Guide

SR Engine Vendor Porting Guide



1         Contents

SAPI Speech Recognition Engine Guide.. 1

1        Contents  1

2        Summary   2

3        Introduction.. 3

3.1            SAPI SR Objects and Interfaces  3

3.2            Sample SR Engine. 4

4      Engine Initialization and Setup. 4

4.1            Engine Creation.. 4

4.2            Object token layout. 5

4.3            SetObjectToken   5

4.4            RecoProfiles  6

4.5            RecoContexts  6

4.6            Recognizer Properties  6

5        Grammar handling.. 7

5.1            Grammar Creation and Deletion.. 7

5.2       CFG Grammars  8

5.2.1                Introduction and terminology  8

5.2.2                Grammar Notifications  8

5.2.3                Word Notifications  8

5.2.4                Rule Notifications  9

5.2.5                States  9

5.2.6                Transitions  10

5.2.7                Special Transitions  10

5.2.8                Semantic Properties  11

5.2.9                Additional topics  11

5.3            Dictation Grammars  12

5.3.1                Language model adaptation  13

5.4            Proprietary grammars  13

5.4.1                Porting other grammar formats  13

6        Lexicon handling.. 14

6.1            Using lexicons. 14

6.2            Phone Converters  15

7        Recognition and audio.. 15

7.1            RecognizeStream    15

7.1.1                Active always state  16

7.2            Reading audio   16

7.2.1                How audio formats are represented in SAPI16

7.2.2                Setting the audio format16

7.2.3                Reading data  17

7.2.4                Information about the audio input17

7.2.5                Setting the input gain  18

7.3            Threading model  18

7.4            Synchronization   18

7.4.1                Pause and auto-pause  19

7.5            Events and Recognitions  20

7.5.1                Standard events  20

7.5.2                Event ordering  21

7.5.3                Other events  21

7.6            Completion of processing   21

8        Recognition Results. 22

8.1            Recognition Call  22

8.2            Dictation Phrases. 23

8.3       CFG Phrases. 24

8.4            Confidence Scoring and Rejection   25

8.4.1                Word Confidence  25

8.4.2                Property and Rule Confidence  26

8.4.3                Required Confidence and Rejection  26

8.4.4                Ambiguous Results  27

8.5            Inverse Text Normalization (ITN)27

8.6            Interpreters  28

9        Alternates. 28

9.1            Returning alternates in a Recognition   28

9.2            Alternates Analyzer   29

10        User-Interface.. 30

11            Engine extensions  31

11.1            Important notes about COM interface pointer handling by the SR extension aggregates  32



2         Summary


This document describes fully the Speech Recognition engine interface in SAPI 5.0. Speech Recognition engines and applications use this interface to connect to SAPI. This document is aimed at engine vendors wishing to port their SR engine using SAPI 5.0, and at general developers who are interested in understanding more about SAPI. This document explains which interfaces and objects SAPI implements, and which interfaces an SR engine should implement. It describes how engines are registered and initialized; how grammar and lexicon information is communicated to the engine; how engines read data and perform recognition; and how engines return events and results back to the application.


3         Introduction

The Microsoft Speech API (SAPI) is a software layer used by speech-enabled applications to communicate with Speech Recognition (SR) engines and Text-to-Speech (TTS) engines. SAPI includes an Application Programming Interface (API) and a Device Driver Interface (DDI). Applications communicate with SAPI using the API layer and speech engines communicate with SAPI using the DDI layer. 


A speech-enabled application and an SR engine do not directly communicate with each other - all communication is done using SAPI. SAPI controls a number of aspects of a speech system, such as:


·        Controlling audio input, whether from a microphone, files, or a custom audio source; and converting audio data to a valid engine format.

·        Loading grammar files, whether dynamically created or created from memory, URL or file; and resolving grammar imports and grammar editing.

·        Compiling standard SAPI XML grammar format, and conversion of custom grammar formats, and parsing semantic tags in results.

·        Sharing of recognition across multiple applications using the shared engine, as well as all marshaling between engine and applications.

·        Returning results and other information back to the application and interacting with its message loop or other notification method. Using these methods, an engine can have a much simpler threading model than in SAPI 4, because SAPI 5 does much of the thread handling.

·        Storing audio and serializing results for later analysis.

·        Ensuring that applications do not cause errors - preventing applications from calling the engine with invalid parameters, and dealing with applications hanging or crashing.


The SR engine performs the following tasks:


·        Uses SAPI grammar interfaces and loads dictation.

·        Performs recognition.

·        Polls SAPI for information about grammar and state changes.

·        Generates recognitions and other events to provide information to the application.


3.1       SAPI SR Objects and Interfaces


In order for an SR engine to be a SAPI 5 engine, it must implement at least one COM object. Each instance of this object represents one SR engine instance. The main interface this object must implement is the ISpSREngine interface. SAPI calls the engine using the methods of this interface to pass details of recognition grammars. It also uses these methods to inform the engine when to start and stop recognition. SAPI itself implements the interface ISpSREngineSite. A pointer to this is passed to the engine and the engine calls SAPI using this interface to read audio, and return recognition results.


ISpSREngine is the main interface to be implemented, but there are other interfaces that an engine may implement. The SR engine can implement the ISpObjectWithToken interface. This provides a mechanism for the engine to query and edit information about the object token in the registry used to create the engine. Information about object tokens is provided in the Object Tokens and Registry Settings White Paper and in SetObjectToken.


There are two other interfaces that the engine can also implement. Each needs to be implemented in a separate COM object, because SAPI needs to create and delete them independently of the main engine. These interfaces are:


·        ISpSRAlternates, which can be used by the SR engine to generate alternates for dictation results. It is possible to generate alternates without this interface, but this interface generates alternates off-line, after the result has been serialized. (See Alternates).

·        ISpTokenUI implements UI components that can be initialized from an application. These can be used to perform user training, add and remove user words, and calibrate the microphone. (See User-Interface)


The engine can also implement another COM object enabling engine-specific calls between the application and the engine. This object can implement any interface, which the application is able to use QueryInterface for. (See Engine extensions)


3.2       Sample SR Engine


The SAPI 5.0 SDK contains working Microsoft speech recognition engines for US English, Japanese and Chinese. These engines are not shipped with source. The SDK also contains a Sample SR Engine, which is shipped with source (In directory Microsoft Speech SDK5.0\Samples\CPP\Engines\SR). This is a sample engine - it implements all the functionality of an SR engine and can be created and used in applications, but it does not actually perform any recognition - instead it generates valid, but random, results. This is very useful example code for understanding how a real SR engine might be implemented.


4         Engine Initialization and Setup


4.1       Engine Creation


When an application wants to perform recognition, it can create a recognizer in one of two ways. The application can create an in-process (InProc) ISpRecognizer object. In this case, SAPI creates the SR engine COM object from the object token representing an engine. Alternatively, an application can create the shared recognizer. In this case, SAPI will create the SR engine in a separate process (named sapisvr.exe) and all applications will share this recognizer. This process is completely invisible to the SR engine and all marshaling is handled by SAPI.


In order to create the SR engine, SAPI uses the SetRecognizer call to look at the object token of the default recognizer or the object token that the application has specified. The object token contains the class ID (CLSID) of the main SR engine and this class is created. The SR engine COM classes must register themselves with "ThreadingModel = Both" or they may not be successfully created.


SetSite on the ISpSREngine interface is then called to give the engine a reference to the ISpSREngineSite interface it will use to call back to SAPI. Like all COM interfaces, the engine should use AddRef to maintain the correct reference count.


4.2       Object token layout


As part of the SR engine installation process, an object token is added that represents the engine into the user's system. Otherwise, SAPI will have no information about the SR engine. To add an object token, add a key to this point in the registry:



On a computer with SAPI 5 installed, there are several registry keys here for the Microsoft English, Chinese, Japanese and Sample SR Engine.


Inside this key there must be the value CLSID containing the CLSID of the main SR engine class. Using this key SAPI determines how to create the engine. There are values for the other CLSIDs that the engine can implement - AlternatesCLSID for the class implementing the ISpSRAlternates alternates analyzer, and RecoExtension for the class implementing any engine-specific private call interfaces. The key should also contain a {Default} value set with the name of the engine so that the Speech properties in Control Panel can display it.


In this key, there can also be a subkey Attributes, used by applications to query for engines matching certain attributes. Typically, the engine sets a value Language to indicate which languages the engine supports; Dictation to indicate the engine supports dictation; CommandAndControl to indicate the engine supports command and control, and so on. (See the Object Tokens and Registry Settings White Paper for more information on object tokens).


There can also be a key UI indicating the types of user-interface components the engine supports (See User-Interface).


4.3       SetObjectToken


Once SAPI creates the SR engine COM object, it determines if the engine supports the ISpObjectWithToken interface. If it does, SetObjectToken is called. This passes a pointer to the object token this engine was created from. This is useful for two reasons:

·        The engine can use the token to store information. It can store information directly in the object token using the ISpDataKey methods, or it can store file paths to other engine data. The ISpObjectToken interface method, GetStorageFileName, provides an easy way for an engine to find a file path to store data. This path is stored in the Files subkey of the engine object token.

·        The engine can also read information from the token, such as file paths set during install, or user options set using an engine properties UI component. It is also possible for several object tokens to share the same engine CLSID. For example, they can share the CLSID for different language engines, telephony, or desktop variants of the engine. In this case, the engine needs to know from which object token it is being created.


4.4       RecoProfiles


SetRecoProfile is called next to give the engine an ISpObjectToken pointer referring to the current user profile. RecoProfiles are added or removed by the user inside Speech properties in Control Panel. The engine can create a subkey under the profile object token and use it to store any data. The engine must store the data in a subkey named after its CLSID. This prevents other engines on the system recognizing the same profile.


To provide user enrollment, the engine implements a UI component User Training (See User-Interface) (SPDUI_UserTraining is defined as this in sapi.idl). This is instantiated using Control Panel->Speech properties->SR tab->Train Profile. Engines can also request that an application display the UI using AddEvent (See Events and Recognitions) to request UI.


The user-training UI might produce some adapted model files. These can be saved and their location stored in the RecoProfile object token. The engine can read the location of these files from the object token later.


4.5       RecoContexts


Each application using speech has at least one RecoContext object implementing ISpRecoContext. It is from this interface that the application creates and loads grammars and activates recognition. SAPI informs the SR engine of each RecoContext associated with it using OnCreateRecoContext and OnDeleteRecoContext. The SR engine returns a pointer to SAPI from the OnCreateRecoContext function, which is then passed to the engine in any future calls that need to refer to the RecoContext. It is not essential for an engine to keep track of each RecoContext unless it is using private calls or proprietary grammars.


4.6       Recognizer Properties


SAPI provides a means for applications to set certain settings and configurations on the SR engine. This is done using the application calling methods on the ISpProperties interface, implemented on the RecoContext objects. There are four methods on this interface to get and set string and integer values. When these methods are called, SAPI calls equivalent methods on the SR engine: GetPropertyString, SetPropertyString, GetPropertyNum, SetPropertyNum.


In SAPI 5.0, each method has a SPPROPSRC parameter, which is always set to SPPROPSRC_RECO_INST, and a pointer pvSrcObj, which is always set to NULL.


A number of these properties are already defined by SAPI (See SAPI 5.0 SR Properties White Paper). Ideally, the engine should implement these if they have equivalent parameters that can be controlled.


When an application sets one of these values, and calls the SR engine, it returns S_OK if it supports this property and the value is updated, and S_FALSE if it doesn't.


Note that these properties exist to alter run-time settings for this instance of the engine, and are reset every time the engine is deleted. For permanent changes to engine results, use an engine properties UI component, or include additional values in its object token that applications can read and set.


5         Grammar handling


Each speech application can have one or more ISpRecoGrammar objects associated with it. Within each grammar object there are several types of grammar:


·        Command and Control grammars. These are context-free grammars (CFG) created from either a SAPI XML grammar, or dynamically from the application, or from some other grammar format using the SpGramCompBackend object. In all cases, SAPI reports the contents of the grammars to the engine using the SpGramCompBackend object.

·        Dictation grammars. Here the engine loads and unloads its own dictation language model.

·        Proprietary grammars. There are various calls in SAPI to support engine-specific grammar formats.


Each grammar object can contain a dictation grammar and either a CFG or proprietary grammar.


Each application can have several grammars. In the shared recognizer case, multiple applications can be connected to one recognizer. Thus, grammars can be loaded, unloaded, modified, activated, and deactivated independently of each other. However, the SR engine controls when it is informed of these grammar state changes during recognition (See Synchronization).


5.1       Grammar Creation and Deletion


When an application creates a grammar object, this is reported to the engine using OnCreateGrammar. This passes the engine a grammar handle, as well as the pointer the SR engine returned from the call to OnCreateRecoContext. From this method the engine must also return a pointer, which is used to identify the grammar in later calls from SAPI. OnDeleteGrammar is called to delete grammars.


5.2       CFG Grammars


5.2.1        Introduction and terminology

Each CFG grammar contains one or more rules. Rules can be top-level, indicating that they can be activated for recognition. Each rule has an initial state and additional states, which are connected by transitions. Each transition can be one of several types:

·        A word transition indicating a word to be recognized

·        A rule transition indicating a reference to a sub-rule

·        An epsilon (null) transition

·        Some special transitions for such features as embedding dictation within a CFG.


References to sub-rules can be recursive, i.e., rules can reference themselves, either directly or indirectly. Left recursion is not supported and SAPI will reject these grammars upon loading. Inside a grammar, transitions can have semantic properties, although the engine does not normally need to recognize these.


SAPI takes full control of loading a grammar when an application requests it. SAPI can load from a file, URL, resource, or from memory, and can load either binary or XML forms of the grammar, and resolve imports. SAPI then notifies the SR engine about the contents of the grammar through various DDI methods.


5.2.2        Grammar Notifications

WordNotify and RuleNotify notify the engine about CFG grammar information. SAPI calls both methods before recognition begins, when a grammar is first loaded, and during recognition within a Synchronize call if grammars change (See Synchronization).


5.2.3        Word Notifications

The WordNotify call informs the engine about the words in the grammar. A single call is made to either add or remove words. SAPI keeps a reference count internally so that each word will be added only if it is not present in any existing grammar. Each word is represented by an SPWORDENTRY structure:


typedef struct SPWORDENTRY


    SPWORDHANDLE    hWord;

    LANGID          LangID;

    WCHAR          *pszDisplayText;

    WCHAR          *pszLexicalForm;

    SPPHONEID      *aPhoneId;

    void           *pvClientContext;



The hWord is a unique handle identifying the word. The pvClientContext is an arbitrary pointer that the SR engine sets with a call to SetWordClientContext. Subsequent calls to GetWordInfo will return the same structure with this field filled in. The LangID field represents the language of the word. Currently this will be the same for all words in a grammar, but in the future SAPI may support multi-lingual grammars.


The pszDisplayText and pszLexicalForm fields give the text of the word. Words can be defined in a grammar to have a different textual display form to the actual spoken lexical form used to look up the words in a lexicon. The grammar can also specify the pronunciation of the word. This is given as an array of SPHONEIDs. See Phone Converters for more detail on phones and phone converters.


5.2.4        Rule Notifications

The RuleNotify call informs the engines when rules are added, changed or removed. There are five actions that are performed on rules:

·        New rules can be added.

  • Existing rules can be removed.

·        Rules can be activated.

  • Rules can be deactivated for recognition.

·        Rules can be invalidated, which means the rule has been edited by the application and thus the engine needs to reread the contents of the rule.


Each rule is represented by an SPRULEENTRY structure:


typedef struct SPRULEENTRY


    SPRULEHANDLE    hRule;

    SPSTATEHANDLE   hInitialState;

    DWORD           Attributes;    

    void *          pvClientRuleContext;

    void *          pvClientGrammarContext;



The hRule is a unique handle identifying the rule. The pvClientRuleContext is a pointer that the engine sets using SetRuleClientContext. Subsequent calls to GetRuleInfo return the same structure but with the pvClientRuleContext field filled in. The pvClientGrammarContext is the pointer that the engine set in OnCreateGrammar. This indicates which grammar the rule belongs to. The Attributes field, of type SPCFGRULEATTRIBUTES, contains flags with extra information about the rule:

  • SPRAF_TopLevel if the rule is top-level and thus can be activated for recognition.
  • SPRAF_Active if the rule is currently activated.
  • SPRAF_Interpreter if the rule is associated with an Interpreter object for semantic processing (See Interpreters).
  • SPRAF_AutoPause if the rule is auto-pause (See Pause and auto-pause).


The hInitialState gives the initial state of the rule.


5.2.5        States

The SR engine determines the full contents of the rule (either immediately, or later during recognition), using GetStateInfo. This method passes information about all the subsequent states following from any given state. The engine passes a state handle into this method (starting with the hInitialState of the rule), and a pointer to an SPSTATEINFO structure (with all its fields initially set to zero). This structure is filled out with information on all of the transitions out of that state in the pTransitions array. SAPI uses CoTaskMemAlloc to create this array. The engine can call this method again on each of the states following the current state in order to get information about all of the states in the rule. Loop-back transitions are possible in a rule and the engine needs to check that it has not visited the current state before.


When the engine calls GetStateInfo subsequent times, it can call it with the cAllocatedEntries and pTransitions fields unchanged. SAPI re-uses the memory from the transition array, if possible, rather that re-allocating it. Alternatively, the engine can use CoTaskMemFree to free the pTransitions memory, and set these fields to NULL. SAPI will then re-allocate the memory every time.


5.2.6        Transitions

Each transition represents a link from one state to another state and is represented by an SPTRANSITIONENTRY structure. This structure contains an ID field that uniquely identifies the transition, an hNextState handle that indicates the state the transition is connected to, and a Type field that indicates what type of transition this is.


There are three common types of transition that all engines need to support:


·        Word transitions (SPTRANSWORD). These represent single words that the recognizer recognizes before advancing to the next state. The handle to the word and the word pointer are supplied inside the SPTRANSITIONENTRY structure, which the engine uses to find the full text of the word with GetWordInfo. To produce recognition results, the engine needs to keep track of the transition IDs of word transitions as they are used in ParseFromTransitions.

·        Rule transitions (SPTRANSRULE). These represent transitions into sub-rules. This transition is only passed when a path through the sub-rule has been recognized. The rule handle, engine's rule pointer, and initial state of the sub-rule are supplied. Rules can be recursive, but not left recursive.

·        Epsilon transitions (SPTRANSEPSILON). These are null or transitions that can be traversed without recognizing anything.


A state with a transition to a null state handle indicates the end of a rule. There can also be void states, which are blocking and indicate that there is no recognition path from this state. These void paths are indicated by a state having zero transitions out of it.


5.2.7        Special Transitions

There are a number of special transitions that may not be supported by all engines. Attributes in the engine object token indicate whether these are supported:


·        Wildcard transition (SPTRANSWILDCARD). This indicates a transition that matches any word or words (sometimes called a "garbage" model). The engine does not try and recognize the spoken words. The engine includes the string value WildcardInCFG as an attribute in its object token to inform the application that it is capable of supporting this.

·        Dictation transition (SPTRANSDICTATION). This is used to embed dictation within a CFG. Each transition means one word should be recognized. The attribute DictationInCFG in the engine object token indicates support for this feature.

·        Text buffer transition (SPTRANSTEXTBUF). This indicates that the engine is to recognize a sub-string of words from the text-buffer, if it has been set. (See Text-buffers).


5.2.8        Semantic Properties

Application developers are able to put properties (also known as semantic tags) within a grammar. This provides a powerful means for semantic information to be easily embedded inside a grammar.


By default, the engine does not recognize these properties. Typically, an engine simply recognizes the speech from the words in the grammar, and SAPI parses and adds the property information in the ParseFromTransitions call. However, it is possible for an engine to receive this information by calling GetTransitionProperty on any transition. If there is a property on this transition, the property name or ID, and value are returned in the SPTRANSITIONPROPERTY structure. When finished, SPTRANSITIONPROPERTY must be freed using CoTaskMemFree.


5.2.9        Additional topics       Ordering of actions

SAPI notifies engines about the contents of CFGs in a logical order. When a grammar is loaded, all new words are notified first with a WordNotify call, then all rules with a RuleNotify call. After that, rules are activated and deactivated. When a grammar is deleted, all rules are removed first and then all words.


All the rules and words for each grammar are added with single calls to RuleNotify and WordNotify. If a number of grammars are being loaded and rules are being activated, separate calls are made for each grammar. For engines that have a time-consuming internal grammar compilation to do before starting recognition, try to avoid unnecessary recompilations. Where possible, applications should try and combine separate CFG grammars into one to minimize compilations.       Grammar Weights

Grammar designers use a Weight field during each transition to change the likelihood of certain paths being taken. This Weight field is a probability - the range of values is 0.0 to 1.0, and the values of the transitions out of any state sum to 1.0. A value of 0.0 should always be interpreted as making this transition impossible to pass during recognition. Engines may or may not incorporate the other weight values into their recognition search. By default, grammars do not have weights set, so each transition weight will by 1.0 divided by the number of transitions out of the preceding state.       Required Confidence

Each transition also contains a RequiredConfidence field. Grammar designers use this field to set how easily recognitions are to be accepted or rejected. For example, this field is used to avoid false positives on critical actions, such as delete file, while less critical actions, such as scroll down, remain unaffected. See Required Confidence and Rejection for more information on how this field can be used for rejection.       Text-buffers

Using this feature, an application can define a text buffer. When a text-buffer transition is reached in a CFG, the engine attempts to recognize a sub-string of words from the text buffer.


The text buffer is set by the application using ISpRecoGrammar::SetWordSequenceData, and reported to the engine by ISpSREngine::SetWordSequenceData. The format of the buffer is a sequence of one or more null-terminated strings, with a double null-termination at the end. The engine recognizes any sub-string of words from any of the strings in the buffer. This provides a very simple way for applications to select from a set of text.


It is also possible for the application to alter the areas of the buffer to be used for recognition. This is done using SetTextSelection with the structure SPTEXTSELECTIONINFO. The ulStartActiveOffset and cchActiveChars indicate which area of the buffer should be active for recognition.


The other two fields of the SPTEXTSELECTIONINFO, ulStartSelection and cchSelection, are used with dictation to indicate which area of the buffer is currently selected on screen. If cchSelection is zero, this indicates where the insertion point currently is. The engine could use this insertion point to get extra language model context from the preceding words in the dictated text.


This text buffer feature is optional for engines, and support for it is defined in the WordSequences attribute. See Object Tokens and Registry Settings for more information.       Rule invalidations

When a rule is invalidated with a RuleNotify call with SPCFGN_INVALIDATE action, the engine needs to discard any cached information it has regarding this rule and parse the rule again from the new initial state. If a sub-rule is edited in a grammar, only that rule is invalidated, and not other rules referring to this rule.       Grammar Resources

It is possible for resource data to be included in a grammar. Each rule can contain one or more named strings containing arbitrary data. The engine recovers this data using GetResource, and passing in the rule handle and resource name.


5.3       Dictation Grammars


The mechanism used for handling dictation grammars is considerably simpler than for CFG grammars. SAPI instructs the engine to load a dictation grammar (or Statistical Language Model) with the LoadSLM method. LoadSLM supplies the engine's pointer to the grammar and a topic name. The topic name is by default NULL, and uses the standard dictation language model, but an application can request a specific topic. Currently, the only topic SAPI defines is Spelling for a spelling-mode grammar, although engines are free to define others. Dictation is unloaded with UnloadSLM and activated or deactivated for recognition with SetDictationState.


5.3.1        Language model adaptation

An application supplies text data to the engine for language model adaptation with SetAdaptationData. The engine is free to do nothing or anything with this data, and to either persist the adaptation in the current RecoProfile or reset every session. Because some engines take considerable processing to do adaptation, it is recommended that applications submit adaptation data in chunks. When the engine is ready to receive more data, it fires an event SPEI_ADAPTATION. If an engine does not have this performance issue, it can send the event immediately.


5.4       Proprietary grammars


If an engine and application wish to use a grammar format different from the standard SAPI binary or XML formats, this is possible. LoadProprietaryGrammar and UnloadProprietaryGrammar are used to load and unload such grammars. LoadProprietaryGrammar is called when the application calls ISpRecoGrammar:: LoadProprietaryGrammar. The application can supply the engine with string data, binary data, or a GUID, or some combination of these. SAPI does not touch this data in any way apart from marshaling it between the application and shared engine.


To activate rules in a proprietary grammar, call SetProprietaryRuleState or SetProprietaryRuleIdState. The first of these methods takes an optional string pointer, where NULL is interpreted to mean activate all top-level rules. These methods are used so that ISpRecoGrammar::SetRuleState works consistently on SAPI CFGs and Proprietary grammars. The engine must set the pcRulesChanged value to inform SAPI how many rules have been activated or deactivated.


There are some extra methods that an SR engine needs to recognize if it is using proprietary grammars. When a grammar is activated or deactivated, it calls SetGrammarState. SetContextState activates or deactivates a context (with CFG grammars SAPI will automatically deactivate or activate all relevant rules so these methods are not needed).


5.4.1        Porting other grammar formats

A disadvantage of proprietary grammar formats is that they are not engine or application independent. It is possible to write code that will convert any finite-state or CFG-based grammar format into the compiled SAPI binary format. For example, SAPI XML grammars are compiled using the object CLSID_SpGrammarCompiler. This object uses the CLSID_SpGramCompBackend object to add the correct rules, states, and transitions into the grammar, and then saves the grammar to an IStream using SetSaveObjects and Commit.


Exactly the same approach can be used for other formats. A new COM object could be created using the compiler back-end methods to add the correct information and save the grammar (See the ISpGrammarBuilder interface documentation for more info). The grammar produced is a standard SAPI compiled grammar and can be used by any SAPI engine or application, without having to use the proprietary grammar support.


6         Lexicon handling


SAPI provides a means for users or applications to specify new words and their pronunciations in lexicons. An engine should look at these words and pronunciations and use them during recognition.


There are two types of lexicons in SAPI:

·        User Lexicon. There is a User Lexicon for the current user logged onto the computer. It is initially empty but the user can add to it, either programmatically, or using an engine's add/remove words UI component. For example, running the Dictation Pad sample application and choosing Add/Remove Words will access the Microsoft engine UI where the user can add new words to the user lexicon.

·        Application lexicons. Applications can create and ship their own lexicons of specialized words. These are read-only.


Both types of lexicon are represented by object tokens. (Application Lexicons are stored in HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Speech\AppLexicons\ and the current user lexicon in HKEY_CURRENT_USER\SOFTWARE\Microsoft\Speech\CurrentUserLexicon).


Each lexicon object implements the ISpLexicon interface. Lexicons can be individually created from their object tokens. However, SAPI provides a Container Lexicon that combines the user and application lexicons into one single entity, making manipulating the lexicon information simpler. This class is SpLexicon (CLSID_SpLexicon).


6.1       Using lexicons


Typically, when an engine runs dictation, it will use CoCreate to create the SpLexicon object and determine which words and pronunciations are in it. The engine adds these to its dictation vocabulary and language model if it supports this feature. If the lexicon contains words with no pronunciation, the engine should try and generate pronunciations for them if it supports this feature. Similarly, if a CFG grammar is loaded containing words that are in the lexicon, the engine acknowledges these pronunciations.


Applications can call IsPronounceable on the RecoContext objects to determine if the engine can generate a pronunciation for a word. Engines set the BOOL pointer passed in to TRUE, if a pronunciation was found, otherwise, FALSE.


An engine can also add words into the user lexicon using AddPronunciations on the SpLexicon. (Lexicon interfaces for more information).


As an option, an engine also uses the basic SAPI lexicon classes (CLSID_SpCompressedLexicon for read-only lexicons and CLSID_SpUncompressedLexicon for read-write lexicons) for its own internal lexicons.


6.2       Phone Converters


For each language, SAPI defines a phone set describing which phones can be used in defining the pronunciation of a word. Currently, phone sets are defined for US English, Japanese and Chinese, and more will be added in later updates. Phone converters are represented as object tokens in the registry. Each phone converter object implements the interface ISpPhoneConverter. This has two methods, PhoneToId and IdToPhone, to convert between phone strings and phone IDs.


The pronunciations returned from a lexicon are returned as null-terminated arrays of SPPHONEID. All the methods the engine sees which use pronunciations also use these arrays of phone IDs, so the engine may never need to use a phone converter directly. However, to create a phone converter for a specific language, an engine uses SpCreatePhoneConverter in sphelper.h:


      ISpPhoneConverter *pPhoneCon;

HRESULT hr = SpCreatePhoneConverter(409, NULL, NULL, &pPhoneCon);


If a vendor wishes to implement an engine in a language for which Microsoft does not currently define a phone converter, it will not be possible to use lexicons or grammars with pronunciations in them. For more information, see Universal Phone Set White Paper.


7         Recognition and audio


SAPI indicates that the engine should start recognition by calling RecognizeStream on the SR engine. From that point on, the engine can read data, perform recognition, and send results and events back to SAPI. When all the data has been recognized or the application has deactivated recognition, the engine finishes processing and returns from the RecognizeStream call.


Thus the basic actions that take place are as follows:

·        SAPI calls RecognizeStream on the engine.

·        The engine starts reading data and doing recognition.

·        The engine calls Synchronize and UpdateRecoPos to be informed of grammar changes.

·        The engine returns events, hypotheses, and recognitions to SAPI.

·        Recognition continues until the stream is terminated and the engine returns from RecognizeStream.


7.1       RecognizeStream


RecognizeStream is normally called after grammars have been loaded and activated. The engine recognizes from the rules that are active. If multiple rules or dictations are active, the engine recognizes from all things in parallel, i.e., the user is able to say a word from any available rule or dictation that is active.


7.1.1         Active always state

In only one case, RecognizeStream can be called when there are no active rules. That is, when the application has set the RecoState to SPRST_ACTIVE_ALWAYS (with ISpRecognizer::SetRecoState). This is done by an application when it wants audio running to display a VU-meter (by listening to SPEI_SR_AUDIO_LEVEL events). In this case, RecognizeStream is called regardless of whether there are any active rules. The engine is free to throw the data away; although, engines can use this data to perform environmental adaptation or noise level estimation.


7.2       Reading audio


7.2.1        How audio formats are represented in SAPI

Two fields are used to define audio formats in SAPI: A GUID defining the class of format; and for wav format types, a WAVEFORMATEX structure that contains type, sample rate, bits per sample etc. All wav format types have the GUID SPDFID_WaveFormatEx. Engines can use other GUIDs for any engine-specific formats they have. There is a helper class CSpStreamFormat in sphelper.h that converts to and from this format. Also, the SPSTREAMFORMAT enumeration lists commonly used formats.


The WAVEFORMATEX structure has the following definition:


typedef struct WAVEFORMATEX


    WORD    wFormatTag;

    WORD    nChannels;

    DWORD   nSamplesPerSec;

    DWORD   nAvgBytesPerSec;

    WORD    nBlockAlign;

    WORD    wBitsPerSample;

    WORD    cbSize;



For example, the format for mono, PCM linear, 16-bit, 16kHz audio would be indicated by:

GUID guid = SPDFID_WaveFormatEx and

WAVEFORMATEX wfx = { WAVE_FORMAT_PCM, 1, 16000, 32000, 2, 16, 0 }


7.2.2        Setting the audio format

To determine what audio formats the engine supports, SAPI calls GetInputAudioFormat on the SR engine. The first pair of parameters of this method indicates the format SAPI is determining if the engine supports; the engine fills in the second pair of parameters to indicate which format it supports. Alternatively, SAPI calls this method with the first format set to NULL, in which case the engine sets the second pair of parameters to its preferred format.


When GetInputAudioFormat finishes, the RecognizeStream call specifies the current audio format in the rguidFmtId and pWavFormatEx parameters. This will not change during the RecognizeStream.


7.2.3        Reading data

Read is used to read data from the audio source. The SR engine requests the amount of data to be read. SAPI returns this data immediately if it is available, or will block until that amount of data is available. If the Read call returns a failure code or it reports that less data has been read than requested, the stream has ended and the engine can return from the RecognizeStream call once it has processed any data it has buffered.


It is possible to see how much data is available for reading immediately using the DataAvailable call. This call is used to read only the data that is already available without blocking. A Win32 event parameter, hDataAvailable on RecognizeStream, is set when a certain amount of data is available. SetBufferNotifySize sets the amount of data to trigger this parameter.


The engine tries to read data as close to real-time as possible. SAPI has only a finite buffer of audio data, and if the engine's reading lags by more than a certain amount of time (approx. 30 seconds currently), SAPI terminates the stream. The engine has the option of keeping its own buffer of data that has been read but has not yet processed.


The amount of data requested in the Read call is in bytes. All stream positions used in sending events, recognition information etc. back to SAPI are in bytes also.


7.2.4        Information about the audio input

The SR engine does not directly have access to the audio input device. SAPI handles this so that the SR engine is consistent regardless of the type of input. However, there is some information about the stream that the engine receives from parameters in the RecognizeStream call:


·        The rguidFmtId and pWavFormatEx parameters indicate the format of the stream.

·        pAudioObjectToken points to the object token that represents the audio input device.

·        The fRealTimeAudio Boolean indicates whether the input is real-time. Real-time inputs in SAPI are those that implement the ISpAudio interface.  An example of this is the standard multi-media microphone input. Non-real time streams are those that only implement ISpStreamFormat. An example of this is inputting wav files using the ISpStream object. With non real-time streams, all the data is available for reading immediately. The hDataAvailable event is always set and DataAvailable always returns INFINITE.

·        fNewAudioStream specifies whether this call to RecognizeStream is a restarting of an existing stream or a new stream. For example, if an application deactivates the active rules and the RecognizeStream returns. If later the application activates some rules, the RecognizeStream call will have this parameter set as FALSE. Only if the application activates a new SetInput will this return TRUE. Some engines might find this useful because they could preserve environmental information between calls to RecognizeStream and only reset this when the input is really changed.


7.2.5        Setting the input gain

The engine can alter the gain of the input, if the input is from an audio device passing through the Windows Mixer. The engine can store a value in the RecoProfile for each device indicating what the gain should be, and SAPI sets the gain on this device in the mixer every time the audio is opened. The engine can set this value in the RecoProfile either when calibrating a microphone within its Microphone Training UI component, or at any other point.


The value stored in the profile must be in a subkey with the same name as the CLSID of the main engine object. The value name should have the token ID of the audio input object currently being used (found by calling GetId on the audio input object token). The value should be a DWORD value between 0 and 10000 indicating the mixer level to set.


7.3       Threading model


With SAPI, engines use a very simple threading model. When recognition does not occur (i.e., when RecognizeStream is not being called on the engine), SAPI calls the engine on only one thread. During the RecognizeStream, SAPI only calls the engine when the engine itself calls Synchronize. Thus, the engine can control when it is called back. These call backs also occur only on one thread.


Because the engine does not return from RecognizeStream until all recognition is complete, SAPI has effectively given the engine one thread on which to operate. It is possible to write an engine to do all its work on this one thread and thus require no additional threads, critical sections or other thread-locking. This one thread system works best if the engine is not blocked unnecessarily during Read calls when it could be performing recognition. Use the DataAvailable to achieve this.


An alternative would be to have one additional thread. In this case, one thread could read the data, and possibly perform feature extraction, while another thread could do the actual recognition processing. Other threading arrangements are possible, and SAPI makes no restrictions about which threads call which methods or whether the methods are called simultaneously.


7.4       Synchronization


Synchronize informs SAPI that the engine is ready to receive any pending actions, such as grammar changes, rule activations/deactivations, or private calls. When the engine calls Synchronize, the engine is called from SAPI to perform these actions. For example, if the application has changed a grammar and called Commit, the next time the engine calls Synchronize, SAPI will call the engine back on RuleNotify and WordNotify with details of the modified grammar. When the engine returns from these methods, SAPI returns back from Synchronize as long as the engine is not in a paused state. When the engine sends a final recognition, (See Events and Recognitions) Synchronize is also called internally by SAPI.


The engine has complete control over when Synchronize is called. For example, engines may want to handle grammar changes only when the user is not speaking; and not handle changes when they are actually performing recognition. Also, a Win32 event fRequestSync is passed as a parameter in the RecognizeStream call. This event is set when an action is queued for the engine to respond to when it calls Synchronize. An engine can call Synchronize regularly, or only when this event is set.


The more quickly the engine responds to queued tasks, the more responsive it will seem to a user. An engine should not wait too long before calling Synchronize because in some cases an application will hang until the engine does so. For example, the sample SR Sample engine does not normally call Synchronize when it has detected speech. However, if speech has been detected for a long time and the hRequestSync event is set, the engine will always call Synchronize to prevent hanging.


The stream position given as a parameter to Synchronize indicates the point before which the engine will fire any events. SAPI discards its stored audio before this point and thus the engine cannot fire any events or report recognitions before this point. This position does not need to be exactly where the engine is currently recognizing; it can be a point in the stream before the engine fires any events.


UpdateRecoPos is another method that an engine should call regularly during recognition. It informs SAPI of the engine's position in recognizing the stream. This is currently used to ensure that Bookmarks are fired correctly, and to keep the application informed of the recognizer position (using the ullRecognitionStreamPos field returned from ISpRecognizer::GetStatus).


7.4.1        Pause and auto-pause

It is possible to put an engine into a paused state. This happens for one of three reasons:

·        The application called ISpRecoContext::Pause.

·        A rule, which the application activated as SPRS_ACTIVE_WITH_AUTO_PAUSE, was recognized.

·        A bookmark event of type SPBO_PAUSE has been reached.


When in the paused state, SAPI does not return to the engine from a call to Synchronize or a final recognition. Instead, control is kept by SAPI and it calls back into the engine to inform it of any grammar changes. that may occur. In fact, these are the results of a normal call to Synchronize or final recognition, except that SAPI waits inside the call until the application resumes the engine. While in the paused state, the engine is still able to Read data if it has another thread running. If the engine also performs sound start/end detection in this thread, it could fire those events to the application.


In the paused state, an application can make grammar and state changes at specific points. Normally grammar changes only occur the next time an engine calls Synchronize. Thus, if an application wanted to make a number of changes, some might occur during one Synchronize call and some in another, which might not produce the best results. Using pause, the application makes changes: after each recognition using Auto-Pause; at a specific point in the stream using Bookmarks; or just as soon as possible using Pause. The application can make as many grammar changes as needed. When the application calls Resume, the engine continues recognizing, without having lost any data, but with all the grammar changes having been reported.


7.5       Events and Recognitions


7.5.1        Standard events

In order to report to the application information about what is being recognized, there are several events the engine can report. These indicate for example, that the engine has detected the start or end of speech, or that it has a hypothesis or a completed recognition result. The main events used to report the progress of recognition are as follows:


·        Sound Start. Used to indicate that the start of some speech-like sound has been detected. Reported by calling AddEvent with event type SPEI_SOUND_START.

·        Sound End. Used to indicate that the end of some speech-like sound has been detected. Reported by calling AddEvent with event type SPEI_SOUND_END.

·        Phrase Start. Used to report the start of some speech that the engine recognizes as an utterance matching the currently active grammar. Reported by calling AddEvent with event type SPEI_PHRASE_START.

·        Final recognition. Used to return results of the recognition of an utterance. Reported by calling Recognition.

·        False recognition. Used to indicate that the engine attempted recognition of the utterance but rejected it on the basis of low confidence scores, inability to find a valid path, etc. Indicated by calling Recognition with the eResultType having the SPRT_FALSE_RECOGNITION flag set.

·        Hypothesis. Used to report a partial recognition of the utterance. Indicated by calling Recognition with the fHypothesis flag set.


AddEvent takes as parameters an SPEVENT structure, and an SPRECOCONTEXTHANDLE which should be set to NULL. The SPEVENT has the following fields:


SPEVENTENUM        eEventId;


ULONG       ulStreamNum;

ULONGLONG   ullAudioStreamOffset;

WPARAM      wParam;

LPARAM      lParam;


The elParamType should be set to SPET_LPARAM_IS_UNDEFINED and the lParam and wParam are set to NULL to display no extra information is returned with these events. The ulStreamNumber can also be set to 0 as SAPI fills this field in before returning the event to the application. The eEventId indicates the type of event (SPEI_SOUND_START, SPEI_SOUND_END, or SPEI_PHRASE_START). The stream position indicates the position in the audio stream where the engine decides this event has happened.


Recognition is used to send hypotheses and final or false recognitions.


7.5.2        Event ordering

There are various requirements for the chronological ordering and stream positions of how these events are reported:


·        Sound start and sound end events form a pair. Every sound start call must later have a sound end call with a later stream position.

·        Each phrase start event forms a pair with either a final or false recognition. The recognition must be fired later and have stream positions later than the phrase start event. Between the phrase start and recognition some hypotheses can optionally be located.

·        Each phrase start/recognition pair must have stream positions inside a sound start/sound end pair, i.e., if part of the stream has been determined to be a phrase it must also be speech. Zero, one, or more than one phrase start/recognition pairs can sit between a sound start/end. Different phrase start/recognition pairs cannot overlap. If part of the stream is determined to be in one utterance, it cannot belong to other utterances.

·        Although the phrase starts and recognitions must have stream positions within sound start/end events, the actual time sequence of the firing of these events can vary. For example, if an engine has an independent speech detector, which can determine the end of speech before the recognition has completed, it can fire the sound end event before a recognition event.


7.5.3        Other events

There are several other event types the engine can fire to indicate other information to SAPI and applications:

  • SPEI_ADAPTATION is used to indicate that the engine is ready to receive more text adaptation data (See Language model adaptation).
  • SPEI_REQUEST_UI is used to request a display of one of the engine's UI components. Both these events are called with a stream position that is either valid or zero (See User-Interface).
  • SPEI_INTERFERENCE is used to indicate to applications that there is a problem with the audio stream or user speech. This event is called with the LPARAM set to one of the SPINTERFERENCE values to indicate the nature of the problem (no signal, clipping, user speaking too fast etc). Applications can choose whether to respond to these events.
  • SPEI_SR_PRIVATE is an event type that engines use for engine-specific communication with applications. Engines should include some unique identifier in the data sent with the event to distinguish the use of this event from another engine's. The SPRECOCONTEXTHANDLE parameter in the AddEvent call can also be used to send this event to a particular RecoContext, rather than to all contexts.


7.6       Completion of processing


An engine should continue recognizing as long as data is available and it is not signaled to stop. Typically, after an engine reports a recognition, it checks for grammar changes and continues reading data and recognizing.


The Read call indicates that the engine should finish recognition when there is no more data to read from the stream. The engine should finish processing the data it has, sending events and recognitions as necessary, and return from RecognizeStream. A Win32 event, passed as the hExit parameter to RecognizeStream, which is set to indicate that the recognizer should exit immediately, without necessarily reading all the data. This condition is also indicated by Recognition or Synchronize returning S_FALSE rather than S_OK.


Recognition is complete for one of several reasons:

  • All active rules in the applications connected to the engine were deactivated.
  • An application set the recognition state (with SetRecoState) to inactive or inactive with purge.
  • No data left in the stream (e.g., for wav file streams).
  • An error or buffer overflow in the stream reading.


Note that the default model in SAPI is for an engine to continue recognizing unless it is explicitly informed to stop. An example of this would be a desktop application where everything the user says is being listened to and acted upon. For some systems an utterance-by-utterance method is required, where recognition stops after each thing the user says. This is best implemented by the application using the auto-pause feature when it activates rules. Then the engine pauses after each recognition and the application can process information and terminate recognition by deactivating all active rules.


8         Recognition Results


In this chapter, the process of returning results is explained in more detail.


8.1       Recognition Call


The Recognition call itself takes a pointer to the following structure as a parameter:




    ULONG             cbSize;

    SPRESULTTYPE      eResultType;

    BOOL              fHypothesis;

    BOOL              fProprietaryAutoPause;

    ULONGLONG         ullStreamPosStart

    ULONGLONG         ullStreamPosEnd;


    ULONG             ulSizeEngineData;

    void             *pvEngineData;

    IspPhraseBuilder *pPhrase;

    SPPHRASEALT      *aPhraseAlts;

    ULONG             ulNumAlts;



These fields are set as follows:


  • cbSize is simply size of (SPRECORESULTINFO).
  • eResultType is either SPRT_CFG, SPRT_SLM, or SPRT_PROPRIETARY. The SPRT_FALSE_RECOGNITION flag is set to indicate a false recognition
  • fHypothesis indicates whether the result is a hypothesis or final recognition.
  • fProprietaryAutoPause is set to FALSE unless using auto-pause with proprietary grammars.
  • ullStreamPosStart/End provides the start and end positions of the result. This informs the application of the position of the result, and controls what audio data is retained in the result. The start position must be equal or later than the stream position reported for the phrase start corresponding to this recognition, and later than the stream positions reported in previous Synchronize calls.
  • hGrammar is set to the grammar handle for dictation or proprietary results, and NULL for CFG results.
  • pvEngineData returns arbitrary data of size ulSizeEngineData with the result. This data is used for generating alternates if an alternates analyzer object is being used.
  • pPhrase contains the main information about the result. This is created in differently for CFG, dictation, or proprietary results. This can also be NULL for false recognitions if the engine has no phrase to report. See below for details on how to create.
  • aPhraseAlts is an array containing ulNumAlts alternates for this recognition. These are optional (See Returning alternates in a Recognition).


8.2       Dictation Phrases


The standard method used to construct a phrase builder object to hold a dictation result is as follows:

  1. The engine creates an SPPHRASE structure using the needed information.
  2. Then the engine uses CoCreate to create an SpPhraseBuilder object (CLSID_SpPhraseBuilder).
  3. The SPHRASE information is added with IspPhraseBuilder::InitFromPhrase.


The SPPHRASE has the following fields:


  • cbSize. This is the size of (SPPHRASE).
  • LangID. The LANGID of the result.ullGrammarID, wReserved, ftStartTime, ulRetainedSizeBytes, ulAudioSizeTime. Set by SAPI and can be left as 0.
  • ullAudioStreamPosition and ulAudioSizeBytes. Indicate the position of the result in the audio stream. The position is in bytes, relative to the start of the stream. The part of the stream spanned by this phrase must be the same as or less than the range in the ullStreamPosStart/End fields in the SPRECORESULTINFO.
  • Rule. Set to zero for dictation results apart from the ulCountOfElements, which must be filled in with the number of words in the result.
  • pProperties. The array of semantic properties, which will be NULL for a dictation result.
  • pElements. The array of actual words in the result.
  • pReplacements. This holds an array of size cReplacements of ITN text replacements the may fill in (See Inverse Text Normalization (ITN)).
  • SREngineID. An arbitrary GUID that could be the CLSID of the engine, for example.
  • pSREnginePrivateData. This enables arbitrary engine-specific data to be returned with the phrase object, of size ulSREnginePrivateDataSize.


The words of the result are represented by an array of SPPHRASEELEMENT structures in the pElements field. The rule ulCountOfElements field indicates the number of words. The elements are filled in either directly before the InitFromPhrase call, or afterward with an AddElements call.


Each SPPHRASEELEMENT contains the following information:

  • Result times: ulAudioTimeOffset and ulAudioSizeTime and retained audio format details, ulRetainedStreamOffset and ulRetainedSizeBytes. These are all filled in by SAPI and are set to 0.
  • Audio stream start position and size for this word: These are relative to the start stream position of the parent phrase. These can be left 0, although the application then cannot obtain word position information.
  • Display text, lexical form, and pronunciation information for the word. The display text must be set so that the application displays the result. Optionally, the lexical form and pronunciation can also be set.
  • Display attributes information. This is of type SPDISPLAYATTRIBUTES and provides information to SAPI about how to format the result. For European languages, this would usually be set to SPAF_ONE_TRAILING_SPACE so that each word is printed with a space between it.
  • RequiredConfidence. This is filled in by SAPI.
  • Actual and SR confidence. (See Confidence Scoring and Rejection)


Once the PhraseBuilder has been filled in, it passes to SAPI as the pPhrase field in the Recognition call.


Results for proprietary grammars are completed in the same way as dictation results.


8.3       CFG Phrases


The engine does not directly create a phrase object for a CFG result, but a calls ParseFromTransitions. The engine provides to this method information about the words in the result, and SAPI parses the active rules to fill in semantic property and other information correctly. The engine then passes the returned phrase builder object to SAPI as the pPhrase pointer in Recognition.

  • ParseFromTransitions uses the SPPARSEINFO structure as a parameter containing the following information cbSize. Set to size of (SPPARSEINFO).
  • hRule. The handle of the top-level rule this result refers to.
  • ullAudioStreamPosition and ulAudioSize indicate the position of the result in the audio stream. The position is in bytes, relative to the start of the stream. The part of the stream spanned by this phrase must be the same as or less than the range given in the ullStreamPosStart/End fields in the SPRECORESULTINFO.
  • pPath. An array, of size cTransitions, of SPPATHENTRY structures.
  • An SREngineID GUID that can be used by the application to identify the engine.
  • Optional private engine data pSREnginePrivateData of size ulSREnginePrivateDataSize.
  • A flag fHypothesis indicating whether the result is to be used for a hypothesis or final recognition.


The SPPATHENTRY array contains information about the words in the result. Each word transition needs an entry in the result. The engine does not include rule or epsilon transitions; ParseFromTransitions is able to process the result without these. Each SPPATHENTRY contains a transition ID for the word transition, and an SPPHRASEELEMENT structure, which is filled in the same way as dictation results Neither the display text, lexical form nor pronunciation needs to be filled in for word transitions. ParseFromTransitions will automatically do this.


There should also be entries for any special transitions:

  • For a wildcard transition, the engine sets the transition ID to the value it received for the wildcard transition, and fills in the phrase element information as it would for a normal word.
  • For a dictation or text-buffer transition the engine includes an SPPATHENTRY for each word recognized in the dictation or text-buffer. The display text needs to be set to the text for each word. Optionally, the lexical form and pronunciation can be set also.


8.4       Confidence Scoring and Rejection


8.4.1        Word Confidence

It is possible for confidence score information to be included in recognition results. On each phrase element there are two confidence fields that the engine can set. These have both a Confidence (three-level) field and an SREngineConfidence (floating-point) field. If the engine does not explicitly set any of these values, SAPI will try and produce reasonable default values for them. It will produce the Confidence values by averaging the levels for each of the words in the phrase or property, and it will set the SREngineConfidence values to -1.0.

The first, ActualConfidence, is a three level value to indicate low, medium or high confidence (SP_LOW_CONFIDENCE, SP_NORMAL_CONFIDENCE, SP_HIGH_CONFIDENCE for C/C++, or of type SpeechEngineConfidence for OLE automation). This is designed to give applications a simple, and engine-independent, confidence value.

The second value, SREngineConfidence is a positive floating-point value. This can be used by engines to give more detailed confidence information, but is not necessarily engine-independent. SAPI defines that this value should be positive, with zero indicating the lowest confidence. It can be used to optimize an application's performance with a specific engine. Using this value will improve the application with a particular speech engine but more than likely will make it worse with other engines and should be used with care. This value is more useful with speaker-independent engines because it allows a large corpus of recorded usage to correctly optimize the overall accuracy of the application. See Confidence Scoring and Rejection in SAPI Speech Recognition Engine Guide for additional details. If this field is not being used, the engine sets this confidence to -1.0.


8.4.2        Property and Rule Confidence

It is also possible for confidences to be associated with rules and semantic properties in CFG results. This application looks at confidence on individual words rather than at confidence on groups of words. The confidence for a rule is the overall confidence for all the words in the phrase contained within that rule. Thus, for the top-level rule, this gives an overall confidence for the whole phrase. The confidence for a semantic property is the confidence for all the words within the rule, if the property is on a rule or rule reference; or the confidence for the word, if the property is on a word transition.


It is possible for the engine to override these settings if it has an alternative method of estimating phrase confidences. Since these fields cannot be directly manipulated on the ISpPhraseBuilder interface, it is necessary to convert them back to an SPPHRASE structure first. This is done using the following sequence of actions:


·        Calling GetPhrase on the SpPhraseBuilder object to get an SPPHRASE.

·        Modifying the chosen fields in the SPPHRASE (Note this may require casting the fields away from const to do this).

·        Calling InitFromPhrase on the original SpPhraseBuilder to set the modified phrase information in the object.

·        The engine can then use the SpPhraseBuilder in Recognition calls.


This method can also be used to override other information in a CFG result phrase after ParseFromTransitions has been called.


8.4.3        Required Confidence and Rejection

Each transition on a CFG contains a RequiredConfidence field. This is set to one of three values like the ActualConfidence field. This field is set to SP_NORMAL_CONFIDENCE by default, and is changed in the XML grammar by preceding words with "+" or "-". The purpose of this field is to indicate how much confidence in the recognition an application requires for the result to be returned. In principle, if the engine has a result and the confidence of any of the words in the phrase is lower than the required confidence, the engine should reject the result. Note that the engine may have a different mechanism for rejecting phrases so SAPI does not enforce this particular confidence. The engine should, however, acknowledge the required confidence fields if possible.


Most applications will listen for and act only upon final recognitions (SPEI_RECOGNITION events), and probably do not analyze the confidence scores of these results. Engines send these events only after they have determined that the confidence is high enough to accept the result.


When a result is rejected, the engine sends a false recognition rather than a final recognition. A false recognition can include exactly the same phrase information as a final recognition, or it can have a NULL phrase. Returning a phrase with a false recognition could be useful to applications so that they can analyze why the phrase was rejected, and recover the audio from the phrase. Advanced applications can perform their own rejection, and thus could look at both final and false recognition events and analyze the confidence scores returned in each result.


8.4.4        Ambiguous Results

Sometimes the words that have been recognized may match more than one possible path, either with dictation or CFG grammars. SAPI does not currently provide a means to resolve ambiguity or send the result to multiple contexts, so the engine must select one path to return the result as. The guidelines for this are as follows:

·        If several CFG paths match the recognized words, set the result for the most recently activated rule. This should mean that applications with focus should receive the rule over those in the background.

·        If the rule matches both a dictation and a CFG, pick the CFG (unless the path has a very low CFG weight score indicating that the dictation path should be chosen).

·        If several dictations are active, pick the most recently activated dictation.


Neither engines nor SAPI can determine what the user meat to say. Applications are encouraged to try and avoid potentially ambiguous grammars.


8.5       Inverse Text Normalization (ITN)


For dictation results, it is possible for the SR engine to specify a normalized form as well as the raw text of the recognized words. To accomplish this, add one or more SPPHRASEREPLACEMENT structures into the result phrase and use either  AddReplacements or directly set the pReplacements and cReplacements fields in the SPPHRASE. It is possible to have more than one replacement because each can refer to a sub-set of the full text.


SAPI does not provide an automatic inverse-normalization (ITN) facility. However, it does provide a mechanism for engine vendors to write a grammar describing their ITN rules which can be automatically parsed.


SAPI provides an object SpITNProcessor that implements ISpITNProcessor. The engine can use CoCreate to create this (CLSID_SpITNProcessor), and then call LoadITNGrammar on this object. The engine must pass in the CLSID of an object that implements the ISpCFGInterpreter interface. Engine vendors must implement this object, which has two methods. InitGrammar is called by SAPI when the LoadITNGrammar is called. This method should load a SAPI binary grammar containing the ITN information and return it as serialized data.


The ITN grammar that the engine implements should have rules for each of the phrase fragments that need to be normalized. These rules need to have the attributes "TOPLEVEL=ACTIVE" so that SAPI will activate them for parsing, and "INTERPRETER=1" so that SAPI calls the engine's ISpCFGInterpreter object when the rule is fired.


When the engine has a result phrase to normalize, ITNPhrase is called on the SpITNProcessor object. This will parse the grammar, and if a fragment of text matches any of the rules in the grammar, ISpCFGInterpreter::Interpret will be called. This method will be passed in a phrase containing the result text and matching rule information, and an ISpInterpreterSite pointer. The engine's implementation of this should look at the reported text and call ISpInterpreterSite::AddTextReplacement to add the normalized text.


Because the ITN grammar is a standard SAPI grammar, it is possible to use all the features of such a grammar. For example, it may be useful to include the normalized text as a semantic property. This will be included in the phrase passed to Interpret so that this method could just set the replacement text to be the property string. For example, the following is a simple grammar rule to convert dollar and pound signs to their respective symbols:




                  <P VALSTR="$">dollar</P>

                  <P VALSTR="$">dollars</P>

                  <P VALSTR="£">pound</P>

                  <P VALSTR="£">pounds</P>




When trying to reorder symbols (e.g., so that the dollar sign comes before the currency amount), more sophisticated processing would have to be done in Interpret.


8.6       Interpreters


Sometimes an application will want extract semantic information from CFG results after processing. Using this process, it is possible for a grammar to be associated with an object. The object would implement ISpCFGInterpreter, and would be called each time a result matching a rule in the grammar was passed into ParseFromTransitions. Then, rather than filling in the property information statically from the grammar, this object would do so with ISpInterpreterSite::AddProperty.


This process is basically invisible to the SR engine. It recognizes the words in the grammar and calls ParseFromTransitions and then Recognition. The one difference is that GetTransitionProperty may report that there are no properties on a transition, which are later added into the result in the ParseFromTransitions calls. Rules with an associated interpreter have the SPRAF_Interpreter flag set in the SPRULEENTRY Attributes field.


9         Alternates


There are two ways the engine can supply alternates back to the application. It either supplies the alternates directly in the Recognition call, or produces alternates using an alternates analyzer object.


9.1       Returning alternates in a Recognition


The field aPhraseAltsin the SPRECORESULTINFO structure can contain an array of alternates. The size of the array is given by ulNumAlts. Each entry in the array is a SPPHRASEALT structure. This contains an ISpPhraseBuilder structure generated directly or from ParseFromTransitions, depending on whether the alternate is for a CFG or dictation result. This entry contains the full phrase of the alternate, not just the words that are different from the main result.

  • ulStartElementInParent and cElementsInParent indicate which words in the main phrase are different in this alternate.
  • cElementsInAlternate indicate which words these are being replaced within this alternate.
  • ulStartElementInAlternate is not needed because it would be the same as ulStartElementInParent.
  • pvAltExtra and cbAltExtra fields can be used to add engine-specific extra data to the alternate.


The number of alternates that an application requires is obtained with either GetMaxAlternates or GetContextMaxAlternates. The first gives the maximum number of alternates for any rule, and the latter, the maximum for any RecoContext.


Sometimes when multiple applications are using the shared recognizer and all have rules active, the engine may generate alternates from different applications. Currently, SAPI cannot process such alternate lists and it is not possible for the engine to send alternates referring to different RecoContexts in the main result. To simplify detecting which context an alternate belongs to, the engine can use IsAlternate, using the handle of the main and alternate top-level rule. SAPI returns S_OK if the alternate is valid and S_FALSE otherwise.


9.2       Alternates Analyzer


The engine can also implement a separate alternates analyzer COM object. This must implement the interface ISpSRAlternates. The CLSID for this object is stored in the engine's object token in the AlternatesCLSID string value. This type of analyzer can only be used for dictation alternates, not for CFG alternates.


When an application asks for alternates with ISpRecoResult::GetAlternates, and if the engine has already supplied alternates within the Recognition call, SAPI supplies these to the application. Otherwise, if the engine has an alternates analyzer object, SAPI creates this and requests the alternates from it, using GetAlternates.


As well as passing in the main ISpPhrase result and the number of alternates requested, the alternates analyzer is also receives any extra data the engine supplied in the pvEngineData field of the SPRECORESULTINFO structure in the Recognition call. The engines stores serialized lattice information or similar information here, which the alternates analyzer uses to generate the alternates.


The analyzer is also passed a pointer to the IspRecoContext, requesting the alternates. The analyzer can use this to query for an engine-specific extension interface (See Engine extensions) to make a private call to the main engine object. These objects may be in different processes. Because the result object can be serialized and then re-created later, it may not be possible to provide a RecoContext referring to the SR engine. In this case NULL is passed in.


If a user makes a correction after looking at the alternates, the application can commit that alternate with ISpRecoResult::Commit. Commit is called on the analyzer, giving the engine the SPPHRASEALT that the application selected. The engine can use this to perform adaptation to improve subsequent recognition performance.


10   User-Interface


There are a variety of user-interface components an engine supplies along with the main engine object itself. Examples of these are user enrollment, microphone set-up, adding and removing user exception words and engine properties. Engines should not create UIs directly. Instead, SAPI provides mechanisms for engines to describe what UI components they have and to request the display of these components.


The engine's object token contains the details of the UI that an engine supports. The token can contain a UI subkey. Within this key can be subkeys for each component type the engine implements. For example the Sample SR engine (Token: HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Speech\Recognizers\Tokens\SAPI5SampleEngine) has a key UI\AddRemoveWord. This contains a value CLSID, which holds the CLSID of the class to be created when this UI component is displayed. The engine setup installs and registers this class. This class must implement the interface ISpTokenUI.


An SR engine can display a certain UI by calling AddEvent with an eEventId of SPEI_REQUEST_UI. The name of the UI component requested is in the lParam field, with the elParamType set to SPET_LPARAM_IS_STRING. To cancel the UI request, the engine calls this method again with a NULL lParam.


This event is sent to all applications connected to the engine that are waiting for the event. An application can either ignore the engine's request, or respond to it. An application can ask to create an engine UI at any time, not just after it has received a request.


An application determines if an engine object token supports a particular UI component by calling IsUISupported, either on the ISpObjectToken itself, or on the ISpRecognizer. If the object token has a CLSID for this type of UI, SAPI will create the object and call ISpTokenUI::IsUISupported on the engine's UI class. To actually display the UI, the application calls DisplayUI, which leads to a call to ISpTokenUI::DisplayUI on the engine's class.


Both DisplayUI and IsUISupported on ISpTokenUI take an IUnknown parameter punkObject. If an application calls these methods on the IspRecognizer, this parameter will point to this ISpRecognizer object. The UI component finds the RecoContext and makes a private call to the main SR engine object. If the methods are called by the application directly from the object token, this parameter may be NULL or point to a different object. If the engine requires its parent SR engine to be created in order to run the UI, it may not be able to display the UI.


11   Engine extensions


SAPI 5 aims to provide interfaces for all the main objectives on the SR engine. However, there are often additional features that SR engines can implement when connected to certain applications. For example, there are a number of places where engine-specific data can be passed to an application. And, there are engine-specific data fields on results phrases, the SPEI_SR_PRIVATE event, methods to support proprietary grammars, new object token attributes and properties etc. There is also a mechanism for an engine to implement additional interfaces for an application to use if it wants to be specifically connected to a particular engine. To do this, the engine implements a new COM object. The CLSID for this object is stored in the engine object token, in the string value RecoExtension. This object can implement any interfaces that the engine vendor wants to implement. To use this object, an application must use QueryInterface on the RecoContext for an interface that it supports. SAPI then creates the engine extension object as an aggregate, and queries it for the interface. The application is then free to call any methods on this interface and the methods will be passed to the extension object. For example, the Sample SR Engine implements the interface ISampleSREngine, with which an application can use QueryInterface to call RecoContext.


The extension object can make calls to its main SR engine class. It does this by querying on the RecoContext for the _ISpPrivateEngineCall interface. The RecoContext is the outer unknown of the extension object. The sample engine does this (in srengext.h) by:

        hr = OuterQueryInterface(IID__ISpPrivateEngineCall, (void **)&m_pEngineCall);


This query for _ISpPrivateEngineCall must be done in the constructor or ATL FinalConstruct of the engine extension object. It cannot be done later.


This gives a pointer to an _ISpPrivateEngineCall interface. See 11.1 Important notes regarding handling interface pointers in an SR engine extension. Both methods send a serialized chunk of data to the engine. The difference between these two methods is that CallEngineEx can return a chunk of data back from the engine that is larger than the data passed in.


It is possible for applications to know what engine they are connected to. If the engine is not one that supports the extension interface, the QueryInterface will fail. Alternatively, the application can look at the CLSID of the engine using ISpRecognizer::GetStatus. Thus, applications that plan to use special features of a particular engine will be able to detect if they do not have this interface and either perform with limited functionality or fail gracefully. This provides maximum interoperability between engines and still enables applications to take advantage of engine-specific information.


11.1  Important notes about COM interface pointer handling by the SR extension aggregates


An SR engine extension can only query for the IID_ISpPrivateEngineCall during creation of the object (for example, in the FinalConstruct call of an ATL object).  RecoContext makes the interface visible only during this time.  This interface does not use AddRef for the QueryInterface call, so the extension should never call Release on this interface.


Because the SR engine extension is created as a COM aggregate, if it were to hold a reference to its outer IUnknown interface (the IUnknown of the ISpRecoContext interface), it would prevent the context from ever being released.  If the extension object calls QueryInterface on the ISpRecoContext interface for any interface except for _ISpPrivateEngineCall, Release() must be called immediately on the outer unknown object to prevent a self-reference.  Continue to use this interface even though Release has already been called.  This is because the lifetime of the ISpRrecoContext interface and the extension object are guaranteed to be the same.


The sample SR engine extension shows an example of how to handle these cases in FinalConstruct.