Speech design guidelines for Windows Phone

Article
11/23/2015

[ This article is for Windows Phone 8 developers. If you’re developing for Windows 10, see the latest documentation. ]

In Windows Phone 8 users can interact with your app using speech. There are three speech components that you can integrate with your app: voice commands, speech recognition, and text-to-speech (TTS).

Designed thoughtfully and implemented effectively, speech can be a robust and enjoyable way for people to interact with your Windows Phone app, complementing or even replacing interaction by touch, tap, and gestures.

This topic contains the following sections.

Speech interaction design

Before you start coding a speech-enabled app, it's a good idea to envision and define the user experience and flow.

Using speech to enter your app

You can integrate voice commands into your app so users can deep link into your app, from outside of your app. For example, you could add voice commands that access the most frequently used sections of your app or perform important tasks.

Speech interaction inside your app

From inside your app, users can speak to give input or accomplish tasks by using speech recognition. Also while inside your app, you can use text-to-speech (TTS), also known as speech synthesis, to speak text to the user through the microphone.

Consider the following questions to help define speech interaction after users have opened your app (perhaps by using a voice command):

What actions or app behavior can a user initiate using her voice, for example: navigate among pages, initiate commands, or enter data such as notes or messages?
What phrases will users expect to say to initiate each action or behavior?
How will users know when they can speak to the app and what they can say?
What processes or tasks may be quicker using speech rather than touch? For example, browsing large lists of options or navigating several menu levels or pages.
Will your app be used for speech recognition in the absence of network connectivity?
Does you app target specific user groups that may have custom vocabulary requirements, for example, specific business disciplines such as medicine or science, or gamers, or specific geographic regions?

When you have made some decisions about the speech interaction experience either through voice commands and/or with in-app speech, you’ll be able to:

List the actions a user can take with your app.
Map each action to a command.
Assign one or more phrases that a user can speak to activate each command.
Write out the dialogue that the user and your app will engage in, if your app will also speak to users.

Implementing speech design

Voice commands

To enable voice commands, you'll need to define a list of recognizable phrases and map them to commands in a Voice Command Definition (VCD) file. For text-to-speech readout that accompanies voice commands, you specify the string in the VCD file that the speech synthesizer will speak to confirm the action being taken. Here are some tips to keep in mind when creating a VCD file for your app:

Prompts a user for input.
Displays a readout of the audio level of speech input.
Confirms the phrase that was matched to a user's speech.
Informs a user that recognition was not successful and lets the user try again (repeatedly if necessary).
Helps the user choose from among multiple recognition possibilities (if they exist).

Because the built-in recognition experience in Windows Phone leverages the same interactive model used in global speech contexts on the phone, users are more likely to know when to start speaking, to have familiarity with the built-in sounds, to know when processing is complete, and to receive feedback on errors and help with disambiguation when there are multiple match possibilities. See Presenting prompts, confirmations, and disambiguation choices for Windows Phone 8 for more info.

Prompts and confirmations

Let users know what they can say to your app based on the current app context, and give users an example of an expected input phrase. Unless you want a user to be able to say anything at all, for example, when inputting short-message dictation, strive to make your prompt elicit as specific a response as possible. For example, if you prompt the user with "What do you want to do today?", the range of responses could vary widely and it could take quite a large grammar to match the possible responses. However, if the prompt says "Would you like to play a game or listen to music?", then the prompt specifically requests one of two responses "play a game" or "listen to music". The grammar needed to match only the two responses would be more simple to author and would likely perform recognition more accurately than a much larger grammar.

Request confirmation from the user when speech recognition confidence is low. If the user's intent is unclear, it's usually better to prompt the user for clarification than for your app to initiate an action that the user didn't intend.

The built-in recognition experience in Windows Phone includes screens that you can customize with prompt text and an example of expected speech input, as well as screens that confirm speech input.

Plan for recognition failures

Plan for what to do if recognition fails. For example, if there is no input, the recognition quality is poor, or only part of a phrase is recognized, your app logic should handle those cases. Consider informing the user that your app didn't understand her, and that she can try again. Give the user another example of an expected input phrase and restart recognition when necessary to allow additional input. If there are multiple successive failed recognition attempts, consider letting the user either key in text or exit the recognition operation. The built-in UI recognition experience in Windows Phone includes screens that let a user know that recognition was not successful and allow the user to speak again to make another recognition attempt.

Listen for and try to correct issues in the audio input. The speech recognizer generates the AudioProblemOccurred event when it detects an issue in the audio input that may adversely affect speech recognition accuracy. When you register for the AudioProblemOccurred event, you can use the information returned with the event to inform the user about the issue, so she can take corrective action if possible. For example, if speech input is too quiet, you can prompt the user to speak louder. The speech recognition engine generates this event continuously whether or not you use the built-in speech recognition experience. For more info, see Handling issues with audio input for Windows Phone 8.

Grammars in Windows Phone

A grammar defines the set of phrases that a speech recognition engine can use to match speech input. You can either provide the speech recognition engine with the predefined grammars that are included with Windows Phone, or with custom grammars that you create. This section gives an overview of the types of grammars you can use in Windows Phone and provides tips for authoring SRGS grammars. Also see Grammars for Windows Phone 8 for more info about different grammar types and when to use them with Windows Phone.

Predefined grammars

Windows Phone supports two predefined grammars. To match a large number of phrases that a user might speak in a given language, consider using the pre-defined short-message dictation grammar. To match input that is in the context of a web query, consider using the predefined web search grammar. These online predefined grammars can be used as-is to recognize up to 10 seconds of speech audio and require no authoring effort on your part, but they do require connection to a network at run time because they are online.

Creating custom grammars

If you author your own grammars, the list grammar format in Windows Phone works well for recognizing short distinct phrases. These phrases can be programmatically updated and used for speech recognition when there is no app connectivity to the network.

For the greatest control over the speech recognition experience, author your own SRGS grammars, which are particularly powerful if you want to capture multiple semantic meanings in a single recognition. SRGS grammars can also be used for offline speech recognition.

Tips for authoring SRGS grammars

Keep grammars small. Grammars that contain fewer phrases to be matched tend to provide better recognition accuracy than larger grammars containing many phrases. It's generally preferable to have several smaller grammars for specific scenarios in your app than to have one grammar for your entire app. Prepare users for what to say in each app context and enable and disable grammars as needed so the speech recognition engine needs to search only a small body of phrases to match speech input for each recognition scenario.

Design your grammars to allow users to speak a command in a variety of ways, and to account for variations in the way people think and speak. For example in SRGS grammars, you can use the GARBAGE rule as follows:

Match speech input that your grammar does not define. This will allow users to speak additional words that have no meaning for your app, for example "give me", "and", "uh", "maybe", and so forth, yet still be successfully recognized for the words that matter to your app, which you have explicitly defined in your grammars.
Add the GARBAGE rule as an item in a list of alternatives to reduce the likelihood that words not defined in your grammar are recognized by mistake. Also, if the speech recognition engine matches unexpected speech input to a GARBAGE rule in a list of alternatives, you can detect the ellipsis (…) returned by the match to GARBAGE in the recognition result and prompt the user to speak again. However, using the GARBAGE rule in a list of alternatives may also increase the likelihood that speech input which matches phrases defined in your grammar is falsely rejected.

Use the GARBAGE rule with care and test to make sure that your grammar performs as you intend. See ruleref Element for more info.

Try using the sapi:subset element to help match speech input. The sapi:subset element is a Microsoft extension to the SRGS specification that can help match users' speech input to enabled grammars. Phrases that you define using sapi:subset can be matched by the speech recognition engine even if only a part of the phrase is given in the speech input. You can define the subset of the phrase that can be used for matching in one of four ways.

Try to avoid defining phrases in your grammar that contain only one syllable. Recognition tends to be more accurate for phrases containing two or more syllables, although you should avoid defining phrases that are longer than they need to be.

When you define a list of alternative phrases, avoid using phrases that sound similar, which the speech recognition engine may confuse. For example, including similar sounding phrases such as "hello", "bellow", and "fellow" in a list of alternatives may result in poor recognition accuracy.

Preload large grammars to avoid a perceived lag in speech recognition when loading a large grammar.

Custom pronunciations

Consider providing custom pronunciations for specialized vocabulary. If your app contains unusual or fictional words or words with uncommon pronunciations, you may be able to improve recognition performance for those words by defining custom pronunciations. Although the speech recognition engine is designed to generate pronunciations on the fly for words that are not defined in its dictionary, you may be able to improve the accuracy of both speech recognition and text-to-speech (TTS) by defining custom pronunciations. For fewer or infrequently used words, you can create custom pronunciations inline in SRGS grammars. See token Element for more info. For more or frequently used words, you may want to create separate pronunciation lexicon documents. See About Lexicons and Phonetic Alphabets for more info.

Test speech recognition accuracy

Test speech recognition accuracy and the effectiveness of any custom GUI you provide to support speech recognition with your app, preferably with a target user group for your app. Testing the speech recognition accuracy of your app with target users is the best way to determine the effectiveness of the design and implementation of speech in your app. For example, if users are getting poor recognition results, are they trying to say something that your implementation isn’t listening for? One solution would be to modify the grammar to support what users expect to say, but another solution would be to change your app to let users know what they can say prior to the speech interaction. Test results can help you to discover ways to improve your grammars or the speech recognition flow of your app to enhance its effectiveness.

Text-to-speech

Also known as speech synthesis, text-to-speech (TTS) generates speech output from text or Speech Synthesis Markup Language (SSML) XML markup that you provide. The following are some suggestions for implementing text-to-speech (TTS) in your app.

Design prompts to be polite and encouraging.
Think about whether you want to use text-to-speech (TTS) to read back to the user large bodies of text. For example, users may be inclined to wait while TTS reads back a text message, but may get impatient or disoriented listening to a long list of search results that will be difficult to memorize.
Give users the option to stop text-to-speech (TTS) readout, particularly for longer readouts.
Consider giving users a choice for a male and female voice for text-to-speech (TTS). All languages on Windows Phone have both a male and a female voice for each supported locale.
Listen to the text-to-speech (TTS) readout before you ship an app. The speech synthesizer attempts to read out phrases in intelligible and natural ways, however, sometimes there may be an issue with either intelligibility or naturalness.
- Intelligibility is most important and reflects whether a native speaker can understand the word or phrase being spoken by text-to-speech (TTS). Sometimes an intelligibility issue can arise with an infrequent pattern of language being strung together or where part numbers or punctuation are a factor.
- Naturalness is desired and can result when the prosody or cadence of a readout is different from how a native speaker would say the phrase. Both kinds of issues can be addressed by using SSML instead of plain text as input to the synthesizer. For more info about SSML, see Use SSML to Control Synthesized Speech and Speech Synthesis Markup Language Reference.