Voice Design Best Practices

When designing the your skill consider the following recommendations:

  • Design for the most common scenario. Keep the Natural Language (NL) task simple and serve the most common use case.
  • Tasks should feel quick and easy. Ask only for the necessary information to complete the task. Keep questions simple and direct.
  • Keep tasks glanceable. Cortana must allow users to multi-task with minimal visual attention. Consider every piece of information on the display and eliminate anything that is not required.
  • Help guide the user's attention to the current focus of the conversation. The visual UX should be peripheral. Highlight the active facet to guide the user's eye.
  • Clearly communicate forward progress during task completion. Let the user see what has been understood and make corrections as necessary.

If creating a Bot Framework based skill, it is common to use Microsoft's Language Understanding Intelligent Services (LUIS.ai). LUIS is designed to enable developers to build smart applications that can understand human language and accordingly react to user requests.

Defining your Skills Target Scenarios

Write down your skill usage scenarios. Avoid complexity as much as possible. Use default values when user is not specific. For example, User says, "Make my room warmer." Cortana says, "I’ve raised your room temperature to 72 degrees" instead of "Sure, to what temperature?"

Consider the following examples:

  • What customer problems do you want to solve?
  • What are the user queries for this skill?
  • What action is triggered (by your backend service)?
  • What should the skill reply to the user?
  • What kind of confirmation will the skill use?
  • Does the skill need directed dialog? See Designing a Prompt.

Designing a dialog

After the user invokes a skill, Cortana establishes a dialog with the user. The dialog consists of:

  • A prompt from Cortana or an utterance from the user.
  • Then a response. Possibly a confirmation, if designed
  • And so on to the conclusion

Before you design the parts of a dialog, let's see how they interact. A skill consists of an intent and one or more entities. Once the intent and entities have been created, you design the interaction of the user and the skill.

  1. To initiate the skill, the user gives an invocation phrase, which you design.
  2. Then a dialog between the user and Cortana begins.

This dialog can have several looks. They are slightly different for headless (i.e. speaker) or screen (canvas). In general, the relationship of the dialog can vary as follows:

  • The user can include an utterance as part of the invocation phrase. For example, a user might invoke a skill call "My Sports Genie" using the invocation phrase "Hey Cortana, ask My Sports Genie who won the game last night?" Cortana would then respond with an answer or prompt for more information. For example, "Pittsburgh." or "Which game?"
  • The user launches or opens a skill using just its invocation name, "Hey Cortana, open My Sports Genie" Cortana prompts the user with "How can I help you?" The user would then give an utterance like, "Who won the game?"
  • The user could combine the invocation with an utterance like, "Hey Cortana, Ask My Travel Agent to book me a flight to New York." Cortana would combine a confirmation with a prompt like, "What departure time would you like for your flight to New York?"
  • The user says an utterance as a query like, "Hey Cortana, Tell My Travel Agent I want to fly to New York at 6:00 pm." Cortana could reply with "You want a flight to New York at 6:00 pm, right?" The user says "Yes." Cortana then prompts by saying, "What airlines?"
  • The user makes a request "Hey Cortana, Tell My Travel Agent I want a flight to Seattle." Cortana uses a short-time confirmation like, "Seattle." The user does not respond. Cortana then prompts "What time?"

Have a good follow up

The nature of voice requires more attention to detail around potential dead-ends and failures. Consider what you want to happen when:

  • The task gets it right. What should happen next?
  • The task doesn't quite do it all. What can the user do?
  • The task gets it wrong. How does the user recover?
  • Something goes wrong. Have you planned a default or recovery response? Always have a backup plan.

Voice User Interface Design

When thinking of creating a voice driven application, one needs to consider a set of factors that should be taken into consideration for a successful experience. We will discuss these factors below.

Voice Driven Application page

The first thing to think of is what do you want to accomplish with a voice driven application and what would the customer's experience be. You want to define the mapping between what the user says and the action that will be triggered. We recommend that the design be based on a headless environment and any addition of pictures and/or display text should be viewed as an embellishment.

Depending on the environment to which you will deploy, you may have to think how potential haptic interaction with the feature will impact your design. Keep in mind that if you voice enable an existng text-based environment, the voice path should not follow the text path at the intent of using voice to be able to go faster to a given result.

VUI Flow

When you have decided on the feature you want to create, generating a Voice User Interface (VUI) flow is the next crucial step. See VUI flow example below.

A VUI flow should be designed first with the end user interaction in mind. However, it also needs to take under consideration the technically capability of the voice engine.

Voice Driven Application page

In this example, we used an approach where the user interactions are located on top and bottom of the Cortana responses and flow logic.

Wherever you have a user interaction, this is considered an Intent (See entries marked in red in above drawing). Intents will be use throughout the tools used to create the skill.

Wherever you are passing parameters to the intent, this is considered an Entity (See entries marked in black in above drawing). Entities will be use throughout the tools used to create the skill.

In the Cortana canvas, you can display pictures and text. This is supported by several of the Bot Framework cards in the code logic. See Card Design Best Practices for more informaiton

VUI Flow Considerations

  • Anything viewable by the user, whether it be buttons or other objects displayed elsewhere should be recognized by voice, e.g., volume button, channel selection, etc.
  • Anything the user can do with a button should also be possible by voice.
  • Display Text and other attributes of a card should be a summary of what Cortana is speaking aloud.

While working on the flow, try to determine what all the user's intents and associated entities will be.

Designing an Intent

There are three types of intents:

The intent type affects the way you design the user's utterance.

Full Intent

A full intent is when a user expresses fully what she/he wants to do vocally in a single utterance. As there are many ways to say the same thing you need to develop a subset of the main intent to cover as many command variations as possible.

For example:

User: Hey Cortana, ask Mileage Wizard if I have miles to travel.
Mileage Wizard: You currently have 30,000 miles. This is enough to travel domestically but not internationally. Would you like to purchase additional miles or to book a domestic trip?

This is the most common type of interaction as the user wants the request to be fulfilled as soon as possible. When the user provides a complete request in their first utterance, you should respond directly to their request and either propose further interaction if required or end the conversation.

If you provide interaction examples in a card (displayed in the Cortana canvas) try to show full intent examples. This will help train the users on the most effective way to communicate with your application.

For more information about cards, see Card Design Best Practices

Partial Intent

A partial intent is when a user expresses partially what she/he wants and the utterance is missing a required entity. Your skill should detect the missing element and automatically provide a follow-up conversation focused on getting the missing entity.

For example:

User: Hey Cortana, ask Mileage Wizard if I have miles.
Mileage Wizard: Miles to travel?

No Intent

A no intent is when a user experiences a skill for the first time, she/he will only give minimum information which will not be sufficient to engage in the conversation. When this happens, you need to tell the user what options she/he has for interacting with your skill.

It is critical that the skill consider the first-time user interaction and help her/him to get started. Our recommendation is to present a list of 3 potential options to choose from. If you present more than 3 options, the user might be overwhelmed and frustrated which will lead to a rejection of the voice experience.

For example:

User: Hey Cortana, ask Mileage Wizard.
Mileage Wizard: Do you want available miles, used miles, or discounts?

In LUIS, all language models have a predefined intent, "None". You should teach it to recognize user statements that are irrelevant to the app, for example if a user says, "Get me a great cookie recipe" in a TravelAgent app.

Designing an Entity

Since entities describe information relevant to the intent, and sometimes are essential for the app to perform its task, design with care. Some labeled data contains interesting items to pass to targets at runtime. Since entities are specified in utterances when training your intent model, consider the following suggestions:

  • What will the task or app do to complete the utterance?
  • Are permissions required?
  • Will the tasks available to you be able to support the entities?
  • Be careful about revealing private information.

If using Luis.ai for your skills language unsterstanding, there are several built-in entities available.

Designing a Prompt

When designing prompts, use these recommendations:

  • Be conversational. Use conversational writing. Write how people speak. Don’t emphasize grammatical accuracy over sounding natural. For example, ear-friendly verbal shortcuts like "wanna" or "gotta" are fine for text-to-speech (TTS) read out.
    • Use the implied first-person tense where possible and natural. For example, "Looking for your next Adventure Works trip" implies that someone is doing the looking, but does not use the word "I" to specify.
    • Use contractions in responses for more natural interactions and additional space saving on the Cortana canvas. For example," I can’t find that movie" instead of "I was unable to find that movie". Write for the ear, not the eye.
  • Use variations. Use some variation to help make the app sound more natural. Provide different versions of TTS and GUI (graphical user interface i.e. Cortana's canvas) strings to effectively say the same thing. For example, "What movie do you wanna see?" could have an alternative like "What movie would you like to watch?". People don’t say the same thing the exact same way every time. Just make sure to keep TTS and GUI versions in sync.
  • Use phrases like "OK" and "Alright" in responses with good judgement. While they can provide acknowledgment and a sense of progress, they can also get repetitive if used too often and without variation.

Use acknowledgment phrases in TTS only. Due to the limited space on the Cortana canvas, don't repeat them in the corresponding GUI strings.

  • Use language that the system understands. Users tend to repeat the terms they are presented with. Know what you display.

Successful Cortana interactions require you to follow some fundamental principles when crafting TTS and GUI strings. Here are some examples:

Principle Bad example Good example
Use as few words as possible and put the most important information up front.
Sure can do, what movie would you like to search for today? We have a large collection. Sure, what movie are you looking for?
Provide information pertinent only to the task, content, and context.
I’ve added this to your playlist. Just so you know, your battery is getting low. I’ve added this to your playlist.
Avoid ambiguity. Use everyday language instead of technical jargon.
No results for query "Trips to Las Vegas". I couldn’t find any trips to Las Vegas.
Be as accurate as possible. Be transparent about what’s going on in the background. If a task hasn’t finished yet, don’t say that it has. Respect privacy, don’t read private information aloud.
I couldn’t find that movie, it must not have been released yet. I couldn’t find that movie in our catalogue.

Prompt Design Questions

  • What are the most essential questions to complete the task?
  • What information is absolutely required and what can be optional?
  • Is there anything we can infer? Remember from last time?
  • Is there an inherent or typical order?
  • Is there anything ambiguous to clarify?
  • Should we confirm before taking the action?
    • If so, what level of confirmation should we use?

Creating a Prompt

There are two contrasting types of prompts:

  • Directed prompts
  • Open prompts

A directed prompt lists specific choices for users, such as, "Please select cheese, pepperoni, or sausage."

An open prompt allows users to speak their own answers, such as, "Which movie would you like?". If users are familiar with the choices, perhaps through frequent use of the application, open prompts are fine.

When choosing between directed and open prompts, also consider the number of options presented to the user and how likely it is that the options will change.

Good uses of a direct prompt are:

  • A wide variety of users use the application, or they use it on an infrequent basis. For example, a call center application is best suited to use directed prompts.
  • There will never be more than three options. A direct prompt minimizes user confusion.

For directed prompts, choose a prompt that follows the form "Please select X, Y, or Z" rather than one that follows the form "Would you like X or do you want Y or Z." The second prompt invites a Yes or No response from the user after each option. Encourage the user to select one of the options by beginning the prompt with the phrase, "Please select."

If the list of options is either long (a list of stock investments, for example) or variable (movie titles, for example), adding a list of choices to the prompt may be impractical. In this case, use an open prompt. For example:

Cortana: Please select a stock name.
User: Help.
Cortana: Please select a stock name. For example, Microsoft please.

For help messages that provide example input, use a different voice to speak the portion of the example that the system expects the user to say (in this example, "Microsoft please"). This technique reinforces the expected form of the answer to the user.

Designing a confirmation

A confirmation is an acknowledgement that the system has heard a user's response. For example:

Cortana: Where do you want to fly to?
User: Paris
Cortana: On which day would you like to leave Paris?

Think about where in the dialog flow users need confirmations. Recognizing speech from a telephone is not perfect, particularly under noisy conditions. In addition, when skills are used in a headless scenario (i.e. speaker) you only have one channel of communication with the user. An effective confirmation and correction strategy alleviates these issues.

A good voice based skill uses a variety of techniques for confirmation and correction. The techniques depend on the style of the skill, the importance of the action being performed, the cost of misunderstanding, and the need for a natural dialog.

For example, a dialog that follows each question with a confirmation of the form "Did you say X?" is slow and potentially very frustrating. Conversely, a dialog that employs no confirmation and, based on a misrecognized command, deletes data without first checking with the user, is equally frustrating. A developer must strike a balance between efficient interaction with the application and protection from wasted time or lost data.

In many cases, the cost of misrecognition is so low as to warrant no confirmation and correction at all. In other cases, explicit confirmation is always required, regardless of the application's confidence in the user's utterance.

There are three distinctly different styles of confirmation.

Explicit Confirmation

Explicit confirmation is the most basic form of confirmation. Of the three styles of confirmation, explicit confirmation takes the most user time, because it introduces an extra prompt to explicitly confirm information that the user has previously provided. Use explicit confirmation for situations in which the cost of a misunderstanding is high. For example, in a flight booking application, the application must understand the cities between which the user wishes to fly. Explicit confirmation results in a dialog interaction of the following form.

Cortana: Where are you flying from?
User: Seattle
Cortana: Did you say Seattle?
User: Yes
Cortana: On what date are you flying?

Implicit Confirmation

Implicit confirmation combines the confirmation question with the next information retrieval question into a single prompt. This method uses fewer prompts than explicit confirmation. Consider a flight booking scenario where the application obtains the city that the user is flying from, followed by the date. Implicit confirmation results in a dialog interaction of the following form.

Cortana: Where are you flying from?
User: Seattle
Cortana: Flying from Seattle. On what date?

If the user answers this question with a date, then the answer implies that Seattle is correct, thereby confirming selection of Seattle as the departure city. The grammar for implicit confirmation interaction is subtly different from the grammar for explicit confirmation. The grammar for implicit confirmation combines acceptance or denial of the previous prompt (in this case, the city) with supply of information for the next prompt.

Cortana: Flying from Seattle. On what date?
User: No
Cortana: Where are you flying from?
User: Vancouver
Cortana: Flying from Boston. On what date?
User: No, Vancouver
Cortana: Flying from Vancouver. On what date?

Answering with a simple yes does not answer this kind of question completely.

Short Time-out Confirmation

For the short time-out confirmation method, the confirmation question is an echo of the informational item, either as a statement or a question. Aside from the length of the prompt, this scenario also differs from the explicit confirmation scenario in two ways.

  • The short time-out confirmation method interprets silence as acceptance.
  • The silence time-out.

The silence time-out is the period of time that the skill waits for the user to speak and is less than the typical amount in the explicit confirmation method.

The skill does not expect a response. Instead, the skill makes a statement of its understanding to the user and invites a correction. Assuming that the system is correct most of the time, the dialog flow moves quickly and smoothly in the short time-out confirmation method.

Cortana: Which city do you want to fly to?
User: Seattle.
Cortana: Seattle.
User: "" 
Cortana: "At what time do you want to fly?

Because the user did not correct the system when it repeated the value, the application accepts the value Seattle.

If the user triggers a response after a confirmation for any reason reason (mumbling, asking for Help or saying Repeat) then it is a good idea to revert to an explicit confirmation state, as in this example.

Cortana: Which city do you want to fly to?
User: Seattle.
Cortana: Seattle.
User: ~Mumble~
Cortana: Am I right with Seattle?

In this example, the application does not accept silence as a Yes, and the silence time-out returns to its original value. From this point forward, the skill reverts to an explicit confirmation state.

Getting information from the User

What to ask?

Before asking any question to the users, make sure that you understand perfectly how your skill works and the context you are in then, take the position that you are the user who has no knowlegde of your skill. You will have to make smart assumptions on the environment of your skill at any given time and take under consideration its perequites and its constraints if any, in order to avoid unnecessary questions. Asking non-essential questions add friction to the user's experience and makes the experience seem less thoughtful.

As voice design is mostly focussed on headless environment, you will need to ask one question at a time if the information cannot be assumed from the previous dialogs.

Cortana: Where would you like to go?
User: Seattle
Cortana: When would you like to go?

When creating a voice conversation, your design needs to make sure that the user clearly understand what you are asking and that a response is expected. Just presenting the options is not sufficient. An option list always needs to be followed by asking the users a question so they know that they are expected to say something.

Cortana: The shirt comes in three colors; red, blue, green. Which color would you like?

Presenting options

When presenting options to the user, utterances should be written in such a way that make it clear to the user that it is an either/or question. The list of options presented should be limited to three entries and all options should be clerly stated and not contain repetitive words.

Cortana: What type of directions would you like, walking or driving?

Simplicity is key

It is important to keep your utterances simple. When listening to a speach output a users isn't able to quickly scan and skip content like they can in a visual interface. Responses to questions need to be quick for efficiency. Utterances should be short and concise, while still being clear.

Cortana: The shirt comes in three colors; red, blue, green. Which color would you like?
User: Red
Cortana: What size would you like; small, medium, large?
User: Medium

Think of Help responses

When users first start using your skill they may no what types of utterances or queries your skill supports. They may make it past the initial step of your skill, but may not know what to say at a given time. Your design should take under consideration the generation of help prompts for critical areas that are specific to that step. If the user asks for help, you should list out all the skill capabilities for this specific area.

User: Book me a hotel in Seattle
Cortana: On what date
User: April 1st
Cortana: What ammenities would you like?
User: Help
Cortana: You can say 'pool', 'parking', or 'breakfast'
User: pool and parking

Presenting information to the User

Setup the stage

It is important to let users know that they are at the right place within the interaction of your skill as, unlike visual interface, they do not have icons or text to orient themselves.

Good: Welcome to My Travel Agent. What can I do for you?
Bad: What can I do for you?

Speak to be heard

In a voice driven application, you are creating an environment where voice will be the main vector of information so special attention should be given to how utterances and prompts are created in order to have them sounds as close as possible to natural language. It is critical that you listen to your prompts on your test device to ensure that the TTS sounds natural. Try to avoid as much as possible technical jargon, acronyms and legal discussion. Fragmented sentences and ending sentences with a preposition are acceptable, if it sounds natural in spoken dialog.

Good: The dimensions are 20 inches by 12 inches.
Bad: The dimensions are 20" x 12".

Piece meal information

The human brain can only remember a small quantity of information when listening to instructions or requests so the voice interaction should be limited to what is absolutely required for the active interaction.

Good: I've found 10 concerts near you. Number 1. Spring Fling. Number 2. Hot Summer Night. Number 3. Falls a coming. Would you like to hear more? 
Bad: I've found 10 concerts near you. Number 1. Spring Fling. Number 2. Hot Summer Night. Number 3. Falls a coming. Number 4. ....

Abbreviations and Symbols

Cortana's text-to-speech conversion handles most text such as abbreviations and special characters automatically.

For example:

  • "Dr. Smith" is spoken as "Doctor Smith"
  • "Microsoft.com" is spoken as "microsoft dot com"
  • "Shopping Ctr." is spoken as "shopping center"
  • "Lake Shore Dr." is spoken as "Lake Shore Drive"

Next Up

Card Design Best Practices