VoiceXML Tutorial

This tutorial leads you through the initial development of a VoiceXML application that runs on the Tellme platform, using the Microsoft speech recognition engine. The tutorial consists of this introduction and the following 10 lessons, a summary, and three appendices:

Lesson 1 -- Building the First Document

Lesson 2 -- Adding DTMF Input

Lesson 3 -- Using Menus Instead of Forms

Lesson 4 -- Improving Prompts and Grammars

Lesson 5 -- Alternatives to Inline Grammars

Lesson 6 -- Form Items, Properties, and Getting Caller Confirmation

Lesson 7 -- VoiceXML Scripting Elements and Subdialogs

Lesson 8 -- Events and Links

Lesson 9 -- Text and Audio in Prompts

Lesson 10 -- Mixed Initiative Forms

Lesson 11 -- Summary

Appendix A -- Testing VoiceXML Applications on the Tellme Platform

Appendix B -- Processing Grammar Returns

Appendix C -- The Form Interpretation Algorithm

In order to follow the tutorial lessons, no prior experience with VoiceXML is required. Familiarity with XML and HTML would be helpful.

The lessons assume that you will be testing the tutorial's VoiceXML code on the Tellme Studio Web site: https://studio.tellme.com/ (see Appendix A -- Testing VoiceXML Applications on the Tellme Platform).

The sample application developed in this tutorial run on the Tellme platform using the Microsoft speech recognition engine. Because the speech products of both Microsoft and Tellme (a Microsoft subsidiary) adhere to industry standards, the application should run a variety of other platforms as well. The relevant W3C standards are:


The Tellme platform requires the XML form of SRGS grammars, as opposed to the ABNF form.

After you have completed all lessons, you should be able to work your way through these W3C specifications to gain an even greater understanding of VoiceXML.

What is VoiceXML?

An interactive voice response (IVR) application is a Web application that allows a caller to use voice and touch-tone inputs (DTMF) to obtain access to information or services over a telephone connection.

IVR applications frequently are alternatives to visiting a Web site. In many situations, an individual can choose to do the same thing over the phone that he or she might do online. Examples are: checking a bank balance, getting the current weather forecast, paying a credit card bill, making a plane or hotel reservation, or checking a flight arrival time.

IVR applications are developed using a W3C-standard voice browser markup language called VoiceXML. They are comprised of a series of "dialogs" between the caller and the VoiceXML application. A dialog is a conversation between the IVR application and a caller in which the caller responds to a series of prompts. The caller’s response in one dialog, which is interpreted by a speech recognition engine, usually leads to another dialog, and then another, and another, and so forth, until the caller has completed his or her task. Sometimes, the VoiceXML application logic will transfer the caller to a live operator if the right input is not detected or if the caller requests it.

Callers interact with a Web site by using a Web browser when connected to the Internet. Callers interact with an IVR application by using a "VoiceXML browser" over a telephone connection. VoiceXML developers can use their existing Web infrastructure to run their IVR applications.

Introduction to the VoiceXML Tutorial

This tutorial describes steps in the development of an IVR application for Contoso Travel, Inc., a fictitious travel agency. Contoso can make airline, hotel, or rental car reservations. It also recommends restaurants in cities that its clients visit.

A client can interact with Contoso either by browsing its Web site or by telephoning an 800 number.

Clients who telephone Contoso interact with the IVR application we are developing in this tutorial.

The Contoso IVR application provides, by phone, the same essential services that are available online:

  • Making new reservations for plane flights, hotel rooms, or rental cars.

  • Changing an existing reservation.

  • Getting restaurant recommendations in destination cities.

Callers who use the phone are expected to know in advance what they want to accomplish. The Web site, by comparison, also allows the client access to non-essential services, such as looking for package tours or viewing articles and video clips that highlight various travel destinations. Visitors to the Web site do not need to know in advance exactly what they want to accomplish.

VoiceXML compared to HTML

Since VoiceXML applications are often alternatives to Web sites, and most software developers are familiar with Web technology, it is valuable to examine the similarities and differences between a VoiceXML document and a typical HTML Web page.

VoiceXML is the computer markup language used to develop IVR applications, just as the familiar HTML is the markup language used to develop Web sites. The tags for the two languages are quite different, but the underlying technology is similar. Here is a very basic application that shows you what the VoiceXML tags (bolded) look like:

<?xml version="1.0" encoding="UTF-8"?>
<vxml xmlns="http://www.w3.org/2001/vxml" version="2.1">
<form><field name="greeting">
         Please say hello
      </prompt><grammar mode="voice" xml:lang="en-US" version="1.0" root="top">
         <rule id="top">
         Thank you. You said <value expr="greeting"/>


A VoiceXML application is highly analogous to a Web site, and, in fact, they share a lot of technology. For example,

  • VoiceXML browsers are based on the same software code as Internet browsers, such as Internet Explorer or Firefox.

  • VoiceXML applications are delivered by Web servers, using the HTTP protocol.

  • VoiceXML applications can connect using Secure Socket connections (the same HTTPS that is used for Web connections).

  • VoiceXML applications can use cookies.

  • Both VoiceXML and HTML can embed ECMAScript (JavaScript) in their code by using <script>....</script> tags.

  • Voice XML references supporting documents such as grammars, scripts, and audio files, just as HTML references style sheets, scripts, and images.

User Interface

While the underlying technology is similar, the VoiceXML user interface is very different from the Web browser interface:

  • Web browser users see Web pages and are prompted by the pages' visual content, while VoiceXML browser users see nothing-they get prompts from speech they hear over the phone line.

  • Web browser users interact with the Web pages using a keyboard and mouse, while VoiceXML browser users interact with the application by speaking or pressing the touch tone phone keys.

  • When Web users request information, it is primarily returned to them visually; when phone users request information, it is returned to them by speech.

  • Web pages can deliver enormously varied and complex content, while VoiceXML typically delivers simple, terse content.

When using a Web browser, a user can get a great deal of information simply by logging on and looking around. For some transactions, the user must input information by typing, choosing a radio button, checking a box, and so forth.

In a typical VoiceXML application, apart from a welcome statement, a caller gets little information just by calling up. The caller gets information by making voice requests. Furthermore, a VoiceXML browser user does not even know what to request until the VoiceXML application prompts him or her with speech, asking for input.

Getting User Input

Once again, it is instructive to compare VoiceXML with HTML.

Getting user input with HTML <form> elements

In the HTML code for a Web page, <form> elements (sections enclosed by <form> and </form> tags) can contain visual input devices (drop down lists, text boxes, radio buttons, and check boxes, for example). A user can type in the text boxes, choose radio buttons, check boxes, and make selections from drop down lists. Then, when the user is finished and presses a Submit button, the input is sent to the HTML application. Here is a simple example, where the Submit button is labeled "Add your card."


Getting user input with VoiceXML <form> elements

As with HTML applications, <form> elements in VoiceXML code are used to obtain user input. The mechanisms are quite different, however. Rather than presenting the user with visual elements, the VoiceXML application uses a <prompt> element. The <prompt> element delivers speech to the caller, posing a directed question or a request for information. The content of the prompt might be something like "Do you want to book a flight, a hotel, or a car?" or "Please say or key in your credit card number." Without prompts, a caller would not know what to input. The caller's response is detected and analyzed, also in the <form> element. The VoiceXML application's logic then determines what to do next (for example, present the caller with information or direct the caller to another dialog).

These question and answer interchanges, called dialogs, are the essence of VoiceXML applications. A VoiceXML application is essentially a container for a number of logically interconnected dialogs.

Multi-page documents

Users are accustomed to navigating from one page to another on Web sites. They begin on the entry page, which is called the "home page," and reach other pages by clicking hypertext links. It is, of course, possible to create a one-page Web site, even with a large number of topics. Such one-page sites are rare, however, because the user experience is much better if discrete topics are placed on their own Web page. Multi-page Web sites are also much easier to maintain because of their modularity. In a multi-page Web site, each page is an individual HTML file. These pages form a tree structure under the home page.

VoiceXML also supports multiple-file applications. In VoiceXML, each file is called a "document." A document which is shared by all the other documents is called the "root document." Documents other than the root document are called "leaf documents." A root document, which may or may not be the entry document, is not required in a VoiceXML application.

Transitioning from one document to another in a VoiceXML application works differently from page-to-page transitioning on Web sites. On a Web site, the user chooses to transition to a new page by clicking a link. In VoiceXML applications, the application chooses when to make transitions, based on responses obtained from the caller.

The VoiceXML application we are developing in this tutorial uses multiple documents, including a root document. Furthermore, we will use the root document as the applications entry point.

VoiceXML dialogs

As already noted and reiterated, VoiceXML applications are composed of a sequence of dialogs, short conversations between a caller and the application. The application speaks first. For example,

   Application prompt: "Please enter your policy number.
   Caller: "P 2 3 8 7 Z X 5 0 0 4"

<form> elements (dialogs) in the VoiceXML code are used to obtain caller input. A <form> element can contain one or more <field> elements. A <field> element is an input field, analogous to a "field" in HTML code that needs to be filled in by a user.


The term "dialog" is used when discussing VoiceXML applications and it captures the concept of an interchange between a caller and an application in an expressive way. However, there is no VoiceXML element by that name. There are two types of dialogs: forms and menus, each of which is the name of a VoiceXML element. We will not use menus in the application that we develop in this tutorial, so that dialog and form are synonymous for us. Since "form" is the name of a VoiceXML element, we will tend to use it in the remainder of this tutorial, rather than "dialog."

Directed forms

Throughout this tutorial, we will primarily use "directed forms." Each form can consist of one or more fields. Each field is a single question and answer. The caller must answer the question in each field in order of its appearance before proceeding to the next field.


There is also a type of structure called a "mixed initiative form." In such a form, the caller does not have to fill the fields (answer the questions) in order. We will not use a mixed initiative form in the application that we develop in our tutorial, but Lesson 10 shows you how they work.

In summary, a directed form with more than one field might look like this:

First field
   Application prompt: "What is your policy number?"
   Caller: "P 2 3 87 Z X 5 0 0 4"
Second field:
   Application prompt: "What are the last four digits of your social security number?"
   Caller: "2 2 3 3"
Third field:
   Application prompt: "What is your date of birth?"
   Caller: "April 15th, 1985"

The caller must answer the question in the first field successfully before being transitioned to the second field, and so forth. The form .(which is contained in a <form> element) is made up of all three fields.

Anatomy of a <field> element

In each field, the application prompts the caller with a question or request for information. The caller responds with an answer. The field contains an associated "grammar" (discussed in detail below), which is a list of responses that the application expects. A speech recognition engine compares what the caller said to the expected responses in the grammar. If what the caller said is interpreted as being one of the expected responses, there is said to be a "match." Another section of the field, the <filled> element, contains logic to determine what happens when the caller's input has been matched in the grammar. For example, this could be a transition to the next field in the form, or to another form, or to another document.

Here is the basic organization of a <field> element in VoiceXML code:

<field name="myQA">
<!-- speech to be delivered to caller
in the form of a question or a request for information. -->
<!-- words or phrases expected in the
caller’s response. -->
<!-- logic that determines what to do 
when a grammar match is found.  -->

The principal components of a <field> element are the <prompt>, <grammar>, and <filled> elements, each of which is a child of the <field> element.

The <prompt> element

The <prompt> element contains the audio that is delivered to the caller by the VoiceXML application. It may be either a pre-recorded audio file or text-to-speech (TTS).

The <grammar> element

The <grammar> element contains a list of acceptable caller responses that the application expects and can understand. A grammar may be embedded inline as code in the application or may simply be a pointer to an external file. Grammars are used by the speech engine to help interpret the caller's utterances. The speech engine tries to match what the caller says to the words and phrases in the grammar and return the correct interpretation as specified by the grammar. . For example, in a yes-or-no grammar, the phrases "yes," "you bet," "sure," "yup," "of course," and others like them could all be mapped to the same "YES" interpretation so that an application could handle any one of a number of positive responses in the same way.

The <filled element>

The <filled> element contains instructions for what to do if the speech recognition engine finds a match between the caller’s response and the grammar’s choices. Possible actions include transitioning to another field in the form, to another form in the call flow, obtaining information from a database to relay to the caller, booking a ticket, charging a credit card, or any number of other actions.

A working example

Here is an example of a small but complete VoiceXML application that includes a single form with one field that contains an inline grammar. The caller is prompted (using TTS) for an answer that should be "yes" or "no." The caller's response, if it matches one of the grammar choices, is placed in the CallerResponse variable. In the <filled> section, the VoiceXML application handles the return from the grammar match. In this case, the application just tells the caller what their answer was, using the value of the CallerResponse variable.

<?xml version="1.0"?>
<vxml version="2.1" 
   <form id="mainDialog">
      <field name="callerResponse">
          You want to fly to New York. Is that correct?
        <grammar version="1.0" mode="voice" root="top">
          <rule id="top">
          <prompt> thank you </prompt>
          <prompt>you said <value expr="callerResponse"/> </prompt>


The name attribute of the <field> element, CallerResponse, is a variable. If there is a match between the caller's utterance and one of the grammar's choices, the VoiceXML interpreter puts the match phrase into a variable its the name specified by the name attribute of the <field> element, which in this case is the CallerResponse variable.

You can and should run this simple application on the Tellme platform to see what happens. This is a good time to learn how to run applications in Tellme Studio (see Appendix A -- Testing VoiceXML Applications on the Tellme Platform). It will help your comprehension a great deal if you run all of the examples in the tutorial.

The importance of grammars

Accurate voice recognition is essential to successful IVR applications.

As noted, the application prompts the caller with a question or request in each field. The caller responds with an answer (sometimes called an "utterance"). A key element in the field, the grammar, lists the possible responses that the form expects. The speech recognition engine compares what the caller said to the expected responses described in the grammar.

A typical VoiceXML application uses one or more grammars in every field. The syntax of grammars used with the Microsoft speech recognition engine should conform to the XML form of the W3C’s Speech Recognition Grammar Specification (SRGS) Version 1.0 standard (see http://www.w3.org/TR/2004/REC-speech-grammar-20040316/).

Speech recognition engines can be trained to accurately recognize the general speech of particular individuals. With a large number of anonymous and varied callers (with differing sex, age, accent, and phone connection quality), however, such a training approach is impossible. In a VoiceXML application, grammars are used to severely limit the number of words that the speech recognition engine must consider.

As an example, suppose a field in a dialog expects a yes or no answer. The field’s grammar (every field has one or more grammars) informs the speech recognition engine to expect "yes" or "no," along with a number of alternate ways a caller might say them ("yeah," "yup," "right," "nah," "nope," "wrong," and so forth). As a result, the speech recognition engine does not have to determine which word in a vocabulary of tens of thousands was spoken-it simply has to decide which, if any, among a handful of words was spoken.

In summary, grammars are the keys to successful voice recognition, and therefore are the keys to a successful IVR application.

Using grammars in the tutorial

In this tutorial, you will see several different types of grammars:

  • Inline grammars. These grammars are embedded in the VoiceXML code, and written by the person who writes the code. Such grammars are used regularly from Lesson 1 onwards.

  • Grammars in external files. Tellme has written a number of grammar files that can be referenced by the public. These include grammars for general numbers, phone numbers, Social Security numbers, times, dates, street addresses, zip codes, and others (such grammars are covered in Lesson 5). You can also write your own grammars and put them in external files.

  • Builtin grammars. The Tellme platform provides the builtin grammars specified by the VoiceXML 2.0 and 2.1 specifications. These are: boolean, date, digits, currency, number, phone, and time. Such grammars are covered in Lesson 5.


Inline grammars are included in many of the tutorial examples, largely without explanation. You should read the Tellme Speech Service Developer’s Guide (http://msdn.microsoft.com/en-us/library/ee800148.aspx) in parallel with working through this tutorial-as much as is necessary for you to understand how the grammars in the tutorial examples work. You should have mastered the material in the Tellme Speech Service Developer’s Guide by the time you complete this tutorial.

Reference materials on MSDN

Something that you will want to refer to frequently is the VoiceXML 2.x Element Catalog, found at http://msdn.microsoft.com/en-us/library/ff934626.aspx. For each element that can be used in VoiceXML, you can learn what its parent elements can be, what its child elements can be, and what its attributes are.

You should bookmark the Element Catalog Summary, a table of all the elements with brief descriptions of each element and links to them. You will refer to this summary very frequently when writing VoiceXML code. The element summary's URL is http://msdn.microsoft.com/en-us/library/ff928995.aspx.


The VoiceXML 2.x Element Catalog includes information that is specific to the Tellme platform. For example, it includes references to Version 2 and Version 3 (older versions of the Tellme platform) and to special Tellme element attributes ("Tellme extensions") and properties (tellme.xxxx). In addition, the defaults listed for attributes and properties are for the Tellme platform--these are often, but not always, the same as the defaults for other platforms.None-the-less, the element reference is a valuable resource for developing code for any platform, and it should be easy for you to filter out the information that is Tellme specific.

Reference materials on the Tellme Studio Web site

The Tellme Studio Web site provides information that you will find valuable when working through this tutorial. In fact, the tutorial will reference particular items on the Studio site from time to time.

To access the Studio information, you must be a developer member of the Tellme Studio: Go to http://studio.tellme.com and click the Join Studio button. Fill out the required information in the Become a Tellme Studio Developer form and click the Submit button. Credentials for using Studio online and by phone are then e-mailed to you.


Keep these credentials in a safe place.

Here is a screenshot of the Tellme Studio home page:


You should explore the content of this Web site, which provides insight into many aspects of building a successful application on the Tellme platform using the Microsoft speech engine.


The VoiceXML 2.x Elements link, under Documentation in the left-hand navigation panel, opens a VoiceXML 2.x Elements Catalog that is the same as the Elements Catalog on MSDN, mentioned in the preceding section.

What's next?

In Lesson 1, we will plan the structure of our travel agency application, work out a strategy for navigation, and build the basic entry-point document.

Lesson 1 will introduce all the requirements and concepts needed to create the basic entry-point document for the Contoso Travel IVR application. By the end of this first lesson we will have created a VoiceXML document that prompts the caller to make, change, or cancel a reservation and decides what to do next, based on the caller's response.