Best Practices in Designing Speech User Interfaces


Kathy Frostad
Voice Web Consulting

July 2003

Applies to:
    Microsoft® Speech Application Software Development Kit (SDK)
    Microsoft Visual Studio® .NET

Summary: Kathy Frostad, principal adviser for Voice Web Consulting, discusses best practices for designing good speech user interfaces and shows Web developers the dos and don'ts of speech UI development. (5 printed pages)


Lesson One: You Can Never Gather Enough Data
Lesson Two: Develop a Conceptual Design Blueprint of the Chosen Application and Validate It with Target Users
Lesson Three: Realize that Speech User Interface Design Is Both a Science and an Art
Lesson Four: Familiarize Yourself with Speech Terminology Relative to VUI DesignLesson Five: Learn More About Speech


Since the late 90s, interactive voice response (IVR) developers have designed telephony speech recognition interfaces for enterprise call centers to replace agent interaction and layered touch-tone interfaces. Research on these deployed systems supports continued growth of voice user interface (VUI)—callers like it, accuracy is good, and ROIs are excellent.

What about voice enabling Web sites? How can Web developers take advantage of speech to easily build telephony-based voice site complements to Web sites and support multimodal interactions on any device? The wealth of information in Web sites hasn't reached the billions of telephone users, but not because the speech recognition technology won't support it. Until now, tools didn't exist to aid Web developers in doing what IVR developers do today for call centers. Microsoft hopes to solve this dilemma by rolling out a Speech Application Language Tags (SALT)-based, enterprise speech application platform that includes speech and telephony servers as well as a Microsoft ® Speech Application Software Development Kit (SDK) that integrates directly into Microsoft Visual Studio® .NET. If all goes well, this will be the trigger event for speech adoption. Web developers will have tools at their fingertips to allow callers to access information housed in Web sites. These Microsoft enablers promise to deliver the flexibility to create either telephony-only voice sites and/or multimodal sites by extending existing languages/standards and developer programming paradigms that define the Web.

This article builds upon the functionality of these enabling SALT tools and others to share best practices in designing speech user interfaces. Web design principles were created by trial and error; but today the dos and don'ts are more or less known and embodied in Web development tools. This should be the goal of speech development tools, especially ones such as the Microsoft Speech Application SDK, which address both telephony and multimodal environments; because many design principles used in speech are the same for graphical user interfaces.

Lesson One: You Can Never Gather Enough Data

The following information should be documented and owned by marketing/product management.

  • Business Goals/Implications
  • Organizational Goals/Implications
  • Caller Profile/User Research
  • Task Completion Goals/Implications

Voice Web Consulting helps companies gather this information and develop a corporate and/or departmental road map for speech. Some companies go on to create positions unique to speech—a "voice master" role similar to a Webmaster role.

Lesson Two: Develop a Conceptual Design Blueprint of the Chosen Application and Validate It with Target Users

User research and usability testing is critical to the design and development process and should be used at several stages.

Lesson Three: Realize that Speech User Interface Design Is Both a Science and an Art

The "science" ensures good speech recognition, and the "art" ensures compelling, engaging conversation with the caller.

Lesson Four: Familiarize Yourself with Speech Terminology Relative to VUI Design

(The following descriptions were summarized from various teachings of notable linguists including Jennifer Balogh, James Giangola and Blade Kotelly.)


The personality of the system is the character of the speech application—defined by voice talent, audio, prompt wording and prosody. If an explicit character is not provided, industry research shows that callers develop perceptions of the system's personality (in this case, helpful or bossy) regardless. Some callers even develop a visual image in their minds of the person they hear. The dialogue should cue off human conversation and not written speech, with a conscious effort to define the desired characteristics through text and prosody. Service and/or company branding should be incorporated into the personality as well. Test the personality with various caller groups and refine as necessary. Also consider developing fictional biographies and a fictional face to go along with the persona.

Prompting Style (Open, Directed, Mixed Initiative)

Speech recognition technologies now make it possible to choose a prompting style that most suits the goals of the application and the needs/expectations of the callers. With each prompting style there are factors to consider: efficiency vs. clarity, recognition accuracy, expected caller profile and usage patterns. Open-ended prompts allow callers to provide a complete request as in the following example:

S: "Tell me about your travel plans."

U: "I'd like to fly from Seattle to Boston next Tuesday, in the evening."

If some callers find open-ended prompts daunting, then error recovery can prompt in a more directed fashion:

S: "Tell me about your travel plans."

U: "Uh, what?"

S: "I'd like to help you with your travel plans. Tell me what city you would like to depart from."

Mixed initiative is where the system allows (but doesn't mandate) the caller to decide how much information to provide in response to an open question such as "What are your travel plans?" Based on the information the caller provides, the system will take the initiative to ask for the missing information that is needed to complete the task at hand.

Voice Talent

Voice talent is a critical component in developing the personality of the speech application. Cultural beliefs influence callers' reactions to system voices because of social values they attribute to them. The capability of the voice talent to emulate the colloquialisms of the caller base is important to engaging the caller and establishing rapport (in this case, directory enquiries in the United Kingdom and directory assistance in the United States). Audio production involves either choosing a predefined persona (Bill, Mary, Logan) or coaching chosen voice talent (usually actors) to reflect desired character traits. Radio personalities are not recommended unless the designer likes them just the way they are; they are not used to being coached for different roles.

UI Metaphors

Try to find a conceptual picture that comes to mind with this application. We know that on the Web UI metaphors aid users in learning how to navigate and use a system because of the ability to conceptualize and recall capabilities. UI metaphors can be explicit or implicit, and VUI examples include department stores, shopping malls, books and TV channels. Research done by the Center for Communication Interface Research (CCIR) found that the more familiar callers were with the type of person the voice represented, the more successful they were in completing the task.

The general atmosphere or mood of the system is part of the metaphor and is created by the use of audio sounds/music and voice.

Design the application with universal commands and a consistent approach to tasks that supports the UI metaphor. Research conducted by CCIR found implicit help (where the system offered help upon hearing silence or user error) was favored over explicit help (where the caller must ask for assistance) for a U.K. sample group. Take care not to front load directions at the beginning because callers won't remember them. If there are a large amount of directions for a custom application (in this instance, field service automation), consider incorporating an audio tutorial that all callers must listen to the first time and then afterward can reference as needed. Provide shortcuts (explicit and implicit) to allow frequent users to bypass new user instructions and get to the steps they want. Allow barge-in support except during confirmation, error recovery and agent handoff. Prompting should include landmarks to aid callers' understanding of where they are in the dialogue and to process lists (in this case, 3 of 10).

Recorded Audio and Synthesized Speech

Recorded audio is still the most desired output, but synthesized speech has advanced recently (concatenation, prosody) such that it is an effective substitute in cases where recording speech would be cost prohibitive. Both can and should be used together, but the designer should take care to provide a transition marker (in this case, an explicit reference to a database lookup) to let the caller know the system is switching from recorded voice to synthesized speech.


As they are with the Web, speech users are more apt to return and use a system if it can be personalized. Callers perceive a personalized system to be faster and easier to use, especially if they use it frequently in a particular manner. Personalization can also be simply for choice, (here, there are two system personalities to choose from). User profiles can be established and updated via the Web and/or the VUI.


Good accuracy (low to mid-90s) is attainable even in mobile environments and has become implicit from the callers perspective. Investigate poor accuracy when you see it. Do the grammars and prompts need updating, confidence thresholds adjusted? If you can't isolate the problem quickly, get professional help.

Error Handling

Errors will occur in system conversations, just as they do in human conversations. Design prompts to be polite, and never blame the caller for saying something that isn't in the grammar. Part of the design process should be devoted to uncovering all possible error scenarios so that a proactive strategy can take place. Just as how department stores handle product returns can influence your perception of them so can how speech systems handle errors influence your perception of them. Set boundaries and expectations in error handling so the caller can help recover from the errors by understanding what is allowed. If a caller is continually having the same problem in an open-ended dialogue, design the error-handling prompts with more hand-holding. Use N-best lists to prevent repeating the same error.


Confirmation techniques help the system understand what the user wants when it is not sure what was said or if the caller said something that seems out of context. Confidence thresholds can be used to determine whether explicit or implicit confirmation is required.


It may be appropriate to produce audio ads on future application capabilities or other related products that the company has for sale. Extra income can be earned by selling sponsorships to the entire application or for specific sub features (for instance, the weather report of a voice portal). Audio ads can cover for system latency that may occur when databases are accessed or information is being processed.

Multimodal Dialogs

Multimodal holds great promise for the end user because it combines the power of multiple UIs to deliver richer, fuller feature functionality. Universal desktops paradigms are extending into the mobile world with devices such as Pocket PC, and voice will support this by allowing expressive input. Multimodal capability also allows callers to use discretion in situations where speaking is inappropriate (like in a meeting or hotel lobby) and to have privacy when providing/accessing personal information (like credit card or dating prospects). When designing applications with multimodal capability, map out which inputs and outputs in the application are best suited for which interface (voice vs. text/graphics) given likely usage scenarios. Determine the degree of modality coordination (serial or simultaneous) needed for each application function/step. Define help and error-handling strategies to be consistent across modalities. For more information on multimodal interfaces, read Flash Design for Mobile Devices, Chapter 13: Multimodal Interfaces by Lewis Leiboh and Ilya Kaplun.

Lesson Five: Learn More About Speech

The following tradeshows and conferences are focused on speech user interface. Most have developer workshops as part of their program.

  • SpeechTEK
  • VOX
  • Telephony Voice User Interface Conference
  • Avios
  • InterSpeech

Speech applications are versatile communication mediums capable of fantastic things, but only if designed correctly. Web developers should leverage best practices and standards-based development environments to ensure that applications developed today are future proof as we head into the converged world of multimodal.