Grammar Basics

Article
11/13/2009

Grammars are used to list words, phrases, or Dual-Tone Multi-Frequency (DTMF) tones that are expected "utterances" by a speaker in response to a voice prompt. This topic includes the following sections:

Where grammars are used in a VoiceXML application
Inline and stand-alone grammars
Voice and DTMF grammars
Grammar contents

Where grammars are used in a VoiceXML application

A <form> element can be used in a VoiceXML application to elicit speaker responses, just as a <form> element in an HTML page is used to obtain user input. A <field> element is a child of the <form> element. The <field> elements contain a "dialog" between the speaker and the application. In turn, <prompt>, <grammar>, and <filled> elements are children of the <field> element. The <prompt> element contains the audio that is delivered to the speaker by the VoiceXML application; the <grammar> element contains a list of acceptable speaker utterances (responses) that the application expects; and the <filled> element contains instructions for what to do if the speech recognition engine finds a match between the speaker’s utterance and the grammar’s choices. The <field> element has a name attribute that is a variable that can be assigned values; for example, <field name="returnVar">. Matches found by the speech recognition engine, using the <grammar> element, are placed in the returnVar variable, where they are available to the <filled> element. Here is what it looks like:

<form>
...
   <field name="returnVar">
      <prompt>
          <!-- speech to be delivered to speaker. -->
      </prompt>

      <grammar>
          <!—words or DTMF tones for matching the
            speaker’s utterance. When a match occurs,
            it is placed
            in the returnVar variable. -->
      </grammar>

      <filled>
          <!— what to do when a grammar match is found. 
              This element obtains the match from the
              returnVar variable. -->
      </filled>
   </field>
...
</form>

A working example

Here is an example of a small but complete VoiceXML application that includes a simple grammar. The caller is prompted to say something and should say "yes" or "no." The caller's utterance, if it matches one of the grammar choices, is placed in the yesOrNo variable. In the <filled> section, the VoiceXML application handles the return from the grammar match. In this case, the application tells the caller what their answer was, using the value of the yesOrNo variable.

<?xml version="1.0"?>
<VOICEXML   version="2.1" revision="4"
            xmlns="http://www/w3/org/2001/VOICEXML"
            xml:lang="en-US">
<form id="mainDialog">
  <field name="yesOrNo">
    <prompt>
       Please say something
    </prompt>
    
    <grammar   version="1.0" mode="voice"
               type="application/srgs+xml" 
               tag-format="semantics/1.0"
               xml:lang="en-US" root="top">
      <rule id="top">
        <one-of>
         <item>yes</item>
         <item>no</item>
        </one-of>
      </rule>
    </grammar>
    
    <filled>
      <prompt> thank you </prompt>
      <prompt>you said <value expr="yesOrNo"/> </prompt>
    </filled>
    <noinput> </noinput>
    <nomatch> </nomatch>
  </field>
</form>
</VOICEXML>

Note

The <noinput> and <nomatch> elements are used in the VoiceXML application to handle events that are thrown if the caller says nothing at all or says something that is not in the grammar.

Inline and stand-alone grammars

Grammars can be inline, which means that they are embedded in a VoiceXML application (as in the preceding example). They can also be in stand-alone files with the recommended suffix of .grxml. The inline and stand-alone grammars are identical, except that the two alternative grammar types have different headers (see Grammar Headers).

All grammars should conform to the W3C’s SRGS standard. Conformance is more easily demonstrated for stand-alone external grammars—as XML files, they can be validated against the W3C DTD or schema for grammars.

Most of the examples in this document are inline grammars. Any inline grammar can become a stand-alone grammar by adding the correct headers.

Warning

You can use more than one grammar in a dialog (<field>). However, such grammars must either be all inline or all stand-alone. If you mix inline and stand-alone grammars in the same dialog, an error is thrown.

Voice and DTMF grammars

Grammars can either be written for voice input or for DTMF input. Voice and DTMF grammars must be distinct from one another—they cannot be mixed in the same grammar. If your application expects either voice or DTMF input, you must include separate grammars for voice and for DTMF.

Note

Most of the discussion and examples in this document describe voice grammars. DTMF grammars are covered in DTMF Grammars.

Grammar contents

The words and phrases in voice grammars and the digits in DTMF grammars all appear as text content in <rule> or <item> elements (described in the next section). There are certain requirements on the words.

Requirements for DTMF grammar contents

Like voice grammars, DTMF grammars contain text. In DTMF, though, the text can only be the digits 0 through 9 (as digits, not as words), and the special symbols "*" and "#".

All unmarked digits and symbols in a <rule> or <item> element are content that the speech recognition engine attempts to match.
All such digits and symbols must be recognized in the sequence that they are presented, and in their entirety. This requirement is exactly analogous to requirement 2 for voice, immediately below.

Requirements for voice grammar contents

All unmarked text in a <rule> or <item> element is content that the speech recognition engine attempts to match.
All such text must be recognized in the sequence that it is presented, and in its entirety. As an example, for <item>one two three</item>, the speaker must say all three words in the phrase without saying any other words before or within the phrase.
Acronyms should be avoided. For example, replace "USA" by "u s a" and replace "W3C" by "w three c".
Abbreviations should be replaced by the unabbreviated form. For example, replace "Dr." by "doctor" or "drive"
Punctuation, if it is expected to be spoken, should be replaced by words. When dashes are spoken in Social Security or phone numbers, for example, replace "-" by "dash."
Numbers should be replaced by spelled forms, such as, "zero," "four," "ten" or "four thousand."