December 2012

Volume 27 Number 12

Windows Phone - Speech-Enabling a Windows Phone 8 App, Part 2: In-App Dialog

By F Avery | December 2012

Last month, in part 1 (msdn.microsoft.com/magazine/jj721592) of this two-part series, I discussed enabling voice commands in a Windows Phone 8 app. Here, I’ll discuss dialog with the user in a running app using speech input and output.

Once an app has been launched, many scenarios can benefit from interaction between the user and the phone using speech input and output. A natural one is in-app dialog. For example, the user can launch the Magic Memo app (see previous article) to go to the main page and then use speech recognition to enter a new memo, receive audio feedback and confirm the changes. Assuming no misrecognitions, the user can completely enter and save several memos without touching the phone (other than the first long push on the Start button).

You can imagine many other scenarios using speech dialog starting out in the app. For example, once the user has navigated to a page showing a list of saved favorites such as memos, movies or memorabilia, she could use recognition to choose one and take an action: edit, play, order, remove and so on. Speech output would then read back the selection and ask for confirmation.

In the following sections I’ll lay out examples using speech for input and output, starting with simple examples and working up to more complex examples. I’ll show how easy it is to implement the simple cases and show some of the richer functionality available for advanced scenarios.

Communicating to the User: Speech Synthesis API

Computer-generated speech output is variously called text to speech (TTS) or speech synthesis (though strictly speaking, TTS encompasses more than speech synthesis). Common uses include notification and confirmation, as mentioned earlier, but it’s also essential to other use cases such as book readers or screen readers.

A Simple Example of Speech Synthesis In its simplest form, your app can translate a text string to spoken audio in just two lines of code. Here’s an example using code extracted from the Magic Memo sample:

// Instantiate a speech synthesizer
private SpeechSynthesizer speechOutput = 
  new SpeechSynthesizer();
// ...
// Private method to get a new memo
private async void GetNewMemoByVoice()
{
  await speechOutput.SpeakTextAsync("Say your memo");
  // Other code for capturing a new memo
}

When the user taps the mic button, she’ll hear “Say your memo” spoken from the current audio device. In the following sections I’ll expand on this example by adding code that accepts the user’s input using speech recognition.

TTS Features for Advanced Scenarios Apps that rely heavily on speech output might have use cases that require changing volume, pitch or speaking rate in the course of speech output. To cover these advanced cases, there are two additional methods: SpeakSsmlAsync and SpeakSsmlFromUriAsync. These methods assume the input is in Speech Synthesis Markup Language (SSML) format, a World Wide Web Consortium (W3C) XML standard for embedding properties of the audio and the synthesizer engine into the text to be spoken. I haven’t included sample code for SSML in this article or the Magic Memo code download, but you can find out more about SSML in the MSDN Library reference article at bit.ly/QwWLsu (or the W3C specification at bit.ly/V4DlgG).

The synthesizer class also has events for SpeakStarted and BookmarkReached, and there are overloads for each Speak method that take a generic state object as a second parameter to help you keep track of which instance of the Speak method generated a particular event. Using SSML and handling the events, your code can provide features such as highlighting spoken text or restarting a Speak call in the middle of a paragraph.

Speech Input: Speech Recognition API

The two broad classes of use cases for speech recognition in an app are text input and command and control. In the first use case, text input, the app simply captures the user’s utterance as text; this is useful when the user could say almost anything, as in the “new memo” feature of the sample code.

In the second use case, command and control, the user manipulates the app by spoken utterance rather than by tapping buttons or sliding a finger across the face of the phone. This use case is especially useful in hands-free scenarios such as driving or cooking.

A Simple Example of Speech Recognition Before going into detail about the features of speech recognition in an app, let’s take a look at the simplest case: text input in a few lines of code.

Figure 1 shows the GetNewMemo­ByVoice method shown earlier, but with lines added to initialize a recognizer object, start a recognition session and handle the result.

Figure 1 Initializing a Recognizer Object, Starting a Recognition Session and Handling the Result

private SpeechRecognizerUI speechInput = 
  new SpeechRecognizerUI();
// Set text to display to the user when recognizing
speechInput.Settings.ExampleText = 
  "Example: \"Buy roses\"";
speechInput.Settings.ListenText = "Say your memo";
// ...
// Private method to get a new memo
private async void GetNewMemoByVoice()
{
  await speechOutput.SpeakTextAsync("Say your memo"); // TTS prompt
  var recoResult =
    await speechInput.RecognizeWithUIAsync();
      // Uses default Dictation grammar
  Memo_TB.Text =
    recoResult.RecognitionResult.Text; // Do something with the result
}

Of course, in real code it’s never as simple as this, and if you look in the Magic Memo sample, you’ll see a try/catch block and a check for successful recognition.

If you try this in the sample app by tapping the mic icon, you’ll notice that after you’ve spoken your memo, a “thinking” screen appears, followed by a confirmation UI, after which the result is inserted in the memo text box. Behind the scenes a lot is going on, not the least of which is the use of a “grammar” on a remote server to recognize your speech. A grammar is essentially a set of rules specifying what lexical entries (“words”) the engine needs to recognize and in what order. In the next sections I’ll explore the speech recognition API and how it’s used with recognition grammars.

Overview of the Speech Recognition API Before I get into the details of coding for speech recognition, let’s take a high-level look at the classes in the API and their roles. Figure 2 shows the basic layout of the API. The first thing you’ll notice is that two boxes have Speech­Recognizer in the name.

Speech Recognition API Design Overview
Figure 2 Speech Recognition API Design Overview

If your app doesn’t need to display a UI with speech recognition, or if you want to display your own custom UI, you should instantiate a copy of the SpeechRecognizer class shown in the middle-left of Figure 2. Think of this object as the base operational unit of speech recognition within this API. This is where the app adds any grammars it requires. After initialization, you call RecognizeAsync to do the actual recognition. Because SpeechRecognizer implements IAsyncOperation<SpeechRecognitionResult>, status and a result object are available in the Completed callback function. Thus, there are no separate events for recognition completed or rejected as in other managed speech APIs.

As the name implies, the top-level SpeechRecognizerUI class provides speech recognition with a default GUI that’s consistent with the phone’s global speech UI for feedback, disambiguation and confirmation. To maintain compatibility with the global speech UI and simplify coding, most apps should use this class rather than the non-UI class mentioned earlier. When you instantiate a SpeechRecognizerUI object, it comes with two important objects: a Settings object, where you set the UI text to display to the user; and a SpeechRecognizer object, where you can specify grammars as described in the following sections. Once initialized, you should call RecognizeWithUIAsync on the parent SpeechRecognizerUI object to launch a recognition session. If you use RecognizeAsync on the child SpeechRecognizer object, it will recognize as if the SpeechRecognizer object were being used standalone, that is, without a UI. Hereafter, the terms SpeechRecognizer and Recognize­Async are understood to be generic references for the objects and methods with and without a UI, as appropriate.

Steps for Using Speech Recognition There are four basic steps for using speech recognition in a Windows Phone 8 app:

  1. Create grammars to be used in the recognition process (not needed if using a predefined grammar).
  2. Initialize the SpeechRecognizer object by setting properties and adding grammars as needed.
  3. Start the recognition session by calling Speech­Recognizer.RecognizeAsync or SpeechRecognizer­-UI.RecognizeWithUIAsync.
  4. Process the recognition result and take the appropriate action.

Figure 1 shows all of these steps except No. 1, Create grammars. The predefined Dictation grammar is the default grammar, so there’s no need to create or add it to the Grammars collection.

The code to implement these steps largely depends on the type of grammar used in speech recognition. The next section describes the concept and use of speech recognition grammars in Windows Phone 8.

Introduction to Speech Recognition Grammars

Modern speech recognition engines all use grammars to restrain the set of phrases through which the recognition engine must search (hereafter called the “search space”) to find a match to the user’s utterance, and thus improve recognition accuracy. Grammar rules may allow recognition of phrases as simple as a list of numbers or as complex as general conversational text.

In the Windows Phone 8 speech API you can specify a grammar in three ways, as described in the following sections. For each case, you add the grammar to a collection of grammars on the SpeechRecognizer object.

Simple List Grammar The easiest way to specify a custom grammar for an app is to provide a list of all the phrases for which the recognizer should listen in a simple string array. These list grammars are handled by the on-device speech recognition engine. The code to create and add a list grammar can be as simple as the following for a static list of button names to recognize against:

commandReco.Recognizer.Grammars.AddGrammarFromList(
  "mainPageCommands", new string[] { "cancel", "save", "quit" });

The Magic Memo sample does something a little more sophisticated: It builds up the list of phrases by finding the Content attribute of all the button controls on the page and adding the content text to a string list. See the InitCommandGrammar method in MainPage.xaml.cs for details.

To process the result of a recognition session using a list grammar, you read the Text property on SpeechRecognitionUIResult (or SpeechRecognitionResult if using the version without a UI). You could do this, for example, in a switch statement, as shown in Figure 3.

Figure 3 Processing the Result of a Recognition Session

switch (result.RecognitionResult.Text.ToLower())
{
  case "cancel":
  // Cancel code
    break;
  case "save":
  // Save memo code
    break;
  case "quit":
    break;
  default:
    break;
}

A more detailed example is found in the CommandCompleted callback in MainPage.xaml.cs.

Predefined Grammar The Speech API on Windows Phone 8 provides two predefined grammars: Dictation and WebSearch. Dictation is also called Short Message Dictation and employs the same grammar as used in the built-in Texting app. In contrast, WebSearch is optimized to the phrases used to search online. The built-in Find/Search command uses the same WebSearch grammar.

The search space for both predefined grammars is vast, requiring the processing power available through remote speech recognition using the Microsoft speech Web service. In general these grammars aren’t well suited to command and control because of the possibility of misrecognition and the wide range of possible results.

A major advantage of predefined grammars is that they’re easy to implement in an app. For example, to use the WebSearch grammar rather than the default Dictation grammar in Figure 1, you simply add this line before the call to RecognizeWithUIAsync:

speechInput.Recognizer.Grammars.AddGrammarFromPredefinedType(
  "webSearch", SpeechPredefinedGrammar.WebSearch);

You process the recognition result from a predefined grammar by accessing the result Text property, as shown in Figure 1.

Grammars in Speech Recognition Grammar Specification Format The Speech Recognition Grammar Specification (SRGS) is a W3C standard in XML format. For details about the format and usage, see the MSDN Library article, “SRGS Grammar XML Reference,” at bit.ly/SYnAu5; the W3C specification at bit.ly/V4DNeS; or any number of tutorial Web pages that you’ll find by searching online for “SRGS grammar.” SRGS grammars offer rich functionality such as the ability to specify optional items and to repeat items, rules, rule references, special rules and semantics—at the expense of extra effort to author, test and debug the grammar. In Windows Phone 8, SRGS grammars are used only in the local recognition engine on the phone, that is, not in the remote service.

To add an SRGS grammar, you reference the URI of the grammar file in the app’s install path, as follows:

commandReco.Recognizer.Grammars.AddGrammarFromUri(
  "srgsCommands", new Uri("ms-appx:///ViewMemos.grxml"));

One major advantage of SRGS grammars is that they allow you to specify semantic values to simplify the processing of a wide range of user responses without accessing the recognized utterance (which is available on the RecognitionResult.Text property, as always).

SRGS semantics are objects (which in practice are often strings) that you assign to variables in your SRGS grammar using a <tag> element and a subset of ECMAScript. They have two advantages over using the recognized text directly:

  1. Simplified processing: You can determine the user’s intent without parsing the recognized text, which might take on multiple forms for the same meaning. For example, using semantics, you can map all utterances that mean affirmative—“yes,” “yup,” “affirmative,” “OK” or “ya”—to the single semantic value “yes.”
  2. Ease of localization: You can use the same codebehind to process utterances in any supported spoken language if you use a uniform set of semantic values across all languages.

To illustrate these concepts, the Magic Memo sample uses a simple grammar ViewMemos.grxml for controlling the ViewMemos.xaml page; excerpts from that grammar file with the semantic tags are shown in Figure 4. The function micImage_Tap in ViewMemos.xaml.cs (excerpted in Figure 5) demonstrates the use of semantic values in mapping the user’s utterance to an action.

Figure 4 Excerpts from ViewMemos.grxml SRGS Grammar

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE grammar PUBLIC "-//W3C//DTD GRAMMAR 1.0//EN"
  "http://www.w3.org/TR/speech-grammar/grammar.dtd">
<!-- the default grammar language is US English -->
<grammar xmlns="http://www.w3.org/2001/06/grammar"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://www.w3.org/2001/06/grammar
           http://www.w3.org/TR/speech-grammar/grammar.xsd"
         xml:lang="en-US" version="1.0" tag-format="semantics/1.0"
           root="buttons">
  <rule id="buttons" scope="public">
    <one-of>
      <!--The 'process' semantic can be one of 'clear',
        'save', 'new', or 'select'-->
      <item>
        <!--Reference to the internal rule "scope" below-->
        Clear <ruleref uri="#scope" type="application/srgs+xml"/>
        <tag>out.process="clear";out.num = rules.latest();</tag>
      </item>
      <item>
        Save
        <item repeat="0-1">changes</item>
        <tag>out.process="save";</tag>
      </item>
      <item>
        Enter new
        <tag>out.process="new";</tag>
      </item>
      <item>       
        Select
        <item repeat="0-1">memo</item> <!-- Optional words -->
        <item repeat="0-1">number</item>
        <!--Reference to the internal rule "number" below -->
        <ruleref uri="#number" type="application/srgs+xml"/>
        <tag>out.process="select";out.num =
          rules.latest();</tag>
      </item>
    </one-of>
  </rule>
  <rule id="scope" scope="private">
    <one-of> <!-- Can be "all", "selected" or a number from the
      'number' rule -->
      <item>
        all <tag>out.scope="all";</tag>
      </item>
      <item>
        selected <tag>out.scope="selected";</tag>
      </item>      <item>
        <item repeat="0-1">memo</item> <!-- Optional words -->
        <item repeat="0-1">number</item>
        <ruleref uri="#number" type="application/srgs+xml"/>
      </item>
    </one-of>
  </rule>
 <rule id="number" scope="public">
      <item>
        1
      </item>
    <!-- See ViewMemos.grxml for the remainder
      of the items in this block -->
  </rule>
</grammar>

Figure 5 Handling a Recognition Result Using Semantic Properties

// micImage Tap handler, excerpted from ViewMemos.xaml.cs
private async void micImage_Tap(object sender, GestureEventArgs e)
{
  var commandResult = await commandReco.RecognizeWithUIAsync();
  if (commandResult.ResultStatus ==
    SpeechRecognitionUIStatus.Succeeded)
  {
    var commandSemantics = commandResult.RecognitionResult.Semantics;
    SemanticProperty process = null;
    if (commandSemantics.TryGetValue("process", out process))
    {
      // In general a semantic property can be any object,
      // but in this case it's a string
      switch (process.Value as string)
      {
        // For this grammar the "process" semantic more or less
        // corresponds to a button on the page
        case "select":
        // Command was "Select memo number 'num'"
          break;
        case "clear":
        // Command was "Clear memo number 'num,'" "Clear all"
        // or "Clear Selected"
          break;
        case "save":
        // Command was "Save" or "Save Changes"
          break;
        case "new":
        // Command was "Enter new"
          break;
        default:
          break;
      }
    }
  }
}

This sample just scratches the surface of what’s possible with semantics. To explore more, start with the MSDN Library article, “Using the tag Element,” at bit.ly/PA80Wp. The W3C standard for semantics is at bit.ly/RyqJxc.

You can try out this grammar in the Magic Memo sample by navigating to the ViewMemos page and tapping the mic icon. The file ViewMemos.xaml.cs has the codebehind, including code under a #define section that you can activate (using #define SemanticsDebug) to display and debug the semantic values returned on the recognition result.

Using Multiple Grammars on the Same Recognizer Object A natural question to ask at this point is whether you can use more than one grammar on a SpeechRecognizer object. The answer is yes, with some restrictions. Here are some guidelines and coding techniques for using multiple grammars:

  1. If you add a predefined grammar, you can’t add any other grammars. Also, you can’t disable a predefined grammar; it’s the one and only grammar associated with that recognizer object for its lifetime.

  2. You can add multiple custom grammars (list grammars and SRGS grammars) to a single recognizer object and enable or disable the grammars as needed for different scenarios in your app:

    1. To access a specific grammar, use the grammar name (the string parameter passed in the call to the AddGrammar method) as a key on the Grammars collection.
    2. To enable or disable a particular grammar, set its Enabled Boolean to true or false. For example, the following will disable the grammar named “buttonNames”:
                 myRecognizer.Grammars["buttonNames"].Enabled = false;
  3. When you call any of the AddGrammar methods, the grammar is put in a queue to await processing but isn’t parsed or loaded. The grammar is compiled and loaded on the first call to RecognizeAsync or on an optional call to PreLoad­GrammarsAsync. Calling this latter method before actual use can reduce the latency in returning a result from RecognizeAsync and is therefore recommended for most use cases.

The Next ‘Killer App’

The speech features for apps on Windows Phone 8 represent, among all smartphone offerings, the first fully functional developer platform for speech featuring both on-device and remote recognition services. Using voice commands and in-app dialog, you can open up your app to many compelling scenarios that will delight your users. With these speech features, your app could catch the buzz and be the next “killer app” in the marketplace.


F Avery Bishop has been working in software development for more than 20 years, with 12 years spent at Microsoft, where he’s a program manager for the speech platform. He has published numerous articles on natural language support in applications including topics such as complex script support, multilingual applications and speech recognition.

Thanks to the following technical experts for reviewing this article: Eduardo Billo, Rob Chambers, Gabriel Ghizila, Michael Kim and Brian Mouncer