Principles of Cortana Skills design
Cortana Skills Kit enables developers to create compelling experiences that provide productivity solutions for home or work. But does it make sense to create a skill in all cases? Probably not.
The following are the cases when it makes sense to create a skill.
When the user's hands are busy
If the user is cooking, they can ask, "Was that 1 tablespoon or 1 teaspoon of salt?"
When using voice is an easier and more natural way to interact
If the user needs to share content or set a reminder, it's easier to say, "Share that with John" or "Remind me to pick up Jessie".
If the user is laying down, it's easier to use your voice than a keyboard.
When using voice is a more efficient experience
If using voice reduces the number of steps required to do something. For example, saying "Play the latest House of Cards" is much easier than opening up an app, searching for "House of Cards", finding the latest episode, and pressing play.
When using voice assists with multitasking
If you can use your voice to do things like read and reply to incoming messages or control music while you're writing a document.
When using voice is easier than typing
If the user is on a device that's hard to type on, such as a phone, tablet, or Xbox, it can be easier to use voice.
When the user is driving, walking, or is otherwise distracted
If the user is driving or walking, it is easier to use voice than to navigate the world safely while using a device.
If you decided that building a skill makes sense, consider the following throughout the design process.
- Will the design solve the user’s problem with the minimum number of steps?
- Will the design solve the user’s problem better, easier, or faster than any of the alternative experiences?
- Is the design intuitive; will users naturally know what to do when using it?
- Does the design avoid complexity?
- Does the design use default values when the user is not specific?
A great user experience does not require users to talk too much, repeat themselves, or explain things that the skill should automatically know.
Design for the most common scenario
Define the key scenarios you want your skill to target.
- What are the high value scenarios and metrics?
Of these scenarios, which work well with voice?
- Which scenarios don't rely heavily on visual elements?
- Which scenarios are relatively quick to complete with minimal steps?
Select the scenarios that meet the above criteria.
Design the conversation
Next, design a conversation that sounds natural and is intuitive. Start out thinking about:
- What are the most essential questions that the skill must ask to complete the task?
- What questions the user might ask?
- How will the skill reply to the user's questions?
- What information is absolutely required and which is optional?
- Is there anything we can infer or remember from prior interactions?
- Which answers should you confirm before taking the action, and the type of confirmation you should use?
- What actions the conversation might trigger on your backend service?
- Whether the skill needs a directed dialog
- How will the skill handle help questions and errors?
- How the skill reacts if it reaches a dead end? For example, the task doesn't complete or gets it wrong.
- What environments could the skill be used in and how potential touch interactions might impact your design.
Roleplay the conversation to make sure it's natural and intuitive. Just like in real life, conversations with users vary depending on the user. Your skill should be adept in handling conversations with different users.
Identify the intents, entities, and utterances
When building your skill, you are encouraged to use Microsoft Language Understanding Intelligent Services (LUIS.ai) to model your intents and entities.
For a list of built-in entities in LUIS, visit the Prebuilt entities page.
The conversational design process should identify intents, entities, and utterances. Intents are the actions that the user wants to perform and entities are the data required to perform the action. For example, if the user says, "Hey Cortana, tell My Travel Agent I want to fly to New York at 6:00 PM," the intent is to book a flight, and New York and 6:00 PM are the entities.
Intents fall into the following categories.
Full intent is when the user fully expresses what they want to do in a single utterance. When the user provides a complete request in their first utterance, you should respond directly to their request and either propose further interaction, if required, or end the conversation.
User: Hey Cortana, ask Mileage Wizard if I have miles to travel. Mileage Wizard: You currently have 30,000 miles. This is enough to travel domestically but not internationally. Would you like to purchase additional miles or book a domestic trip?
Because there are many ways to express the same intent, you need to develop as many variations of the intent as possible when you model your intents.
If you display interaction examples in a card on Cortana's canvas (for example, if they ask for help), try to show full intent examples. This helps train the users on the most effective way to communicate with your skill.
Partial intent is when a user partially expresses what they want but the utterance is missing a required entity. Your skill should detect the missing element and automatically provide a follow-up prompt that asks for the missing entity.
User: Hey Cortana, ask Mileage Wizard if I have miles. Mileage Wizard: Miles to travel?
No intent is when a user uses a skill for the first time and they only give the minimum information which is not sufficient to engage in the conversation. When this happens, you need to tell the user how to interact with your skill.
It is critical that the skill consider the first-time user interaction and help them to get started. You should present a list of three potential options to choose from. If you present more than three options, the user might be overwhelmed and frustrated, which leads to a rejection of the voice experience.
User: Hey Cortana, ask Mileage Wizard. Mileage Wizard: Do you want available miles, used miles, or discounts?
In LUIS, the language models have the predefined "None" intent.
Asking users questions
All interactions with the user should use a conversational tone whether spoken or written.
Be conversational. Interact how people speak. Don’t emphasize grammatical accuracy over sounding natural. For example, ear-friendly verbal shortcuts like "wanna" or "gotta" are fine for text-to-speech (TTS).
Use the implied first-person tense where possible and natural. For example, "Looking for your next Adventure Works trip" implies that someone is doing the looking, but does not use the word "I" to specify.
Use contractions for more natural interactions and to save space on Cortana's canvas. For example, "I can’t find that movie" instead of "I was unable to find that movie." Write for the ear, not the eye.
Use variations. Use variation to help make the app sound more natural. When repeating a question, ask it differently the second time. For example, a variation of "What movie do you wanna see?" might be "What movie would you like to watch?"
Use phrases like "OK" and "Alright" in responses with restraint. While they provide acknowledgment and a sense of progress, they can also get repetitive if used too often and without variation. Use acknowledgment phrases in TTS only. Given the limited space on Cortana's canvas, don't write acknowledgments to the canvas.
Your interactions should be:
|Principle||Good example||Bad example|
Use as few words as possible and put the most important information up front.
|Sure, what movie do you want to watch?||Sure can do, what movie would you like to search for today? We have a large collection.|
Provide information pertinent to the task, content, and context.
|I’ve added it to your playlist.||I’ve added it to your playlist. Just so you know, your battery is getting low.|
Avoid ambiguity. Use everyday language instead of technical jargon.
|I couldn’t find any trips to Las Vegas.||No results for query "Trips to Las Vegas".|
Be as accurate as possible. Be transparent about what’s going on in the background. If a task hasn’t finished yet, don’t say that it has. Respect privacy, don’t read private information aloud.
|I couldn’t find that movie in our catalogue.||I couldn’t find that movie, it must not have been released yet.|
Your design needs to make sure that the user clearly understands what you are asking and that a response is expected. Just presenting the options is not sufficient. Follow a list of options with a question so the user knows that they are expected to say something.
Cortana: The shirt comes in three colors; red, blue, green. Which color would you like?
When presenting options to the user, ask in a way that makes it clear to the user that it is an either/or question. Limit the list of options to three entries, and state all options clearly.
Cortana: What type of directions would you like, walking or driving?
Because users can't quickly scan and skip content like they can in a visual interface, it is important to keep your questions simple and concise.
Asking questions using a directed prompt or open prompt
You can ask users questions using either a directed prompt or an open prompt. A directed prompt lists specific choices for the user. For example, "Please select cheese, pepperoni, or sausage." Directed prompts often minimize user confusion. An open prompt lets the user provide their own choice. For example, "Which movie would you like?" If users are familiar with the choices, open prompts are fine.
When choosing between directed and open prompts, also consider the number of options that you present to the user and how likely it is that the options will change.
Good uses of a directed prompt are when:
A wide variety of users use the skill, or they use it on an infrequent basis. For example, a call center application is best suited to use directed prompts.
There is never more than three options.
For directed prompts, use the form, "Please select X, Y, or Z." Don't use the form, "Would you like X or do you want Y or Z," because it may lead to Yes or No response instead of a X or Y or Z response.
If the list of options is long (for example, a list of stock investments) or variable (for example, movie titles), using a directed prompt is impractical. In this case, use an open prompt. For example:
Cortana: Please say a stock's name. User: Help. Cortana: Please say a stock's name. For example, say Microsoft.
Headless devices (like speakers) will reprompt with the last message sent if there is no input. After a reprompt, if there is no input, the conversation is ended. Good interactive dialogs should wait on a prompt.
Confirming a user's answer
A confirmation is an acknowledgement that your skill heard the user's response. For example:
Cortana: Where do you want to fly to? User: Paris Cortana: Which date do you want to leave Paris?
Think about where in the conversation flow the users need confirmations. Recognizing speech from a telephone is not perfect, particularly under noisy conditions. In addition, when skills are used in a standalone speaker, you only have one channel of communication with the user. An effective confirmation and correction strategy alleviates issues.
A good voice-based skill uses a variety of techniques for confirmation and correction. The techniques depend on the style of the skill, the importance of the action being performed, the cost of misunderstanding, and the need for a natural dialog.
For example, a dialog that follows each question with a confirmation, such as "Did you say X?", is slow and potentially very frustrating. Conversely, a dialog that employs no confirmation and, based on a misrecognized command, deletes data without first checking with the user, is equally frustrating. A developer must strike a balance between efficient interaction with the skill and protection from wasted time or lost data.
In many cases, the cost of misrecognition is so low that confirmation is not warranted. In other cases, explicit confirmation is always required, regardless of the skill's confidence in the user's utterance.
The following are different confirmation strategies you can use.
Explicit confirmation is the most basic form of confirmation. It also slows the conversation flow because it introduces an extra prompt to explicitly confirm information that the user provided. Use explicit confirmation for situations where the cost of a misunderstanding is high. For example, in a flight booking application, the application must understand the cities that the user wishes to fly between. The following shows an explicit confirmation interaction.
Cortana: Where are you flying from? User: Seattle Cortana: Did you say Seattle? User: Yes Cortana: Which date do you want to leave?
Implicit confirmation combines the confirmation with your next question. This method uses fewer prompts than explicit confirmation. Consider a flight booking scenario where the skill obtains the city that the user is flying from, followed by the date. The following shows an implicit confirmation interaction.
Cortana: Where are you flying from? User: Seattle Cortana: Flying from Seattle. Which date?
If the user answers this question with a date, then the answer implies that Seattle is correct, thereby confirming Seattle as the departure city. The grammar for implicit confirmation interaction is subtly different from the grammar for explicit confirmation. The grammar for implicit confirmation combines acceptance or denial of the previous prompt (in this case, the city) with supplying information for the next prompt.
Cortana: Flying from Seattle. Which date? User: No Cortana: Where are you flying from? User: Vancouver Cortana: Flying from Boston. Which date? User: No, Vancouver Cortana: Flying from Vancouver. Which date?
Answering with a simple yes does not answer this kind of question.
Other design considerations
Your design should include help prompts, especially for critical areas of your skill. If the user asks for help, you should list the skill's capabilities and options for that specific area.
User: Book me a hotel in Seattle Cortana: Which date do you want to check-in in Seattle? User: April 1st Cortana: What amenities would you like? User: Help Cortana: You can say things like 'pool', 'parking', or 'breakfast' User: pool and parking
Using default values
Use default values when the user is not specific. For example, if the user says, "Make my room warmer," Cortana should say, "I’ve raised your room temperature to 72 degrees" instead of "Sure, what temperature?"
Identifying the skill when invoked
If the user invokes your skill without including an utterance, you should identify your skill and display your help content or ask them what they want to do with leading questions.
Good: Welcome to My Travel Agent. To book a trip say, "Book a trip," or to get the status of your miles say, "Available miles" or "Used miles." Bad: What can I do for you?
Break lists into manageable pieces
For most people, the human brain can only remember a small amount of information when listening to instructions. Limit voice interactions to only what is absolutely required. For example, present only three items of a list at a time.
Good: I've found 10 concerts near you. Number 1. Spring Fling. Number 2. Hot Summer Night. Number 3. Falls a coming. Would you like to hear more? Bad: I've found 10 concerts near you. Number 1. Spring Fling. Number 2. Hot Summer Night. Number 3. Falls a coming. Number 4. ....
Abbreviations and symbols
Cortana's text-to-speech translator handles most text such as abbreviations and special characters automatically.
- "Dr. Smith" is spoken as "Doctor Smith"
- "Microsoft.com" is spoken as "microsoft dot com"
- "Shopping Ctr." is spoken as "shopping center"
- "Lake Shore Dr." is spoken as "Lake Shore Drive"
Design your skill's visual elements
Although the goal is to design your skill for voice only, for some skills that's not practical. There are times when displaying visual elements to support the information that your skill speaks is required. For example, if the user asks for recipes, displaying pictures of the prepared food with links to the recipe can be a better experience than trying to describe each recipe with voice only. Visual elements should not be required to communicate the intent, they should only support the intent.
Things to consider when designing the visual elements:
How can the visual elements enhance the voice experience and assist the user?
Don’t overload the card experience with unnecessary information. For example, too many images, unnecessary text, or too many cards.
Keep tasks glanceable. The skill must allow users to multi-task with minimal visual attention. Consider every piece of information on the canvas and eliminate anything that is not required.
Adding visual elements to your skill
Cortana supports Bot Framework cards, which are rich, graphical controls that contain text, images, and interactive buttons. Skills may include the following cards:
|Card Type||Description||Supported layout|
|Adaptive Card||A user-designed card that contains the elements you specify||Single or Carousel|
|Hero Card||A card with one big image||Single or Carousel|
|Thumbnail Card||A card with a single small image||Single or Carousel|
|Receipt Card||A card that lets the user deliver an invoice or receipt||Single|
|Sign-In Card||A card that lets the skill initiate a sign-in procedure||Single|
The following image shows a card displayed on Cortana's canvas.
In addition to cards, Node.js users can use a set of built-in prompts to simplify collecting inputs from a user. For example, you can use the
choice prompt to present a list of choices that the user can pick from, or you can use the
confirm prompt to confirm an action. For a list of prompts, see Prompt types.
Card design tips
Limit the card's title to 84 characters or less
Limiting the title to 84 characters keeps the title to two lines or less. Having longer titles doesn't look good and pushes the rest of the content down in the card.
Create brief but meaningful responses
When possible, create brief but meaningful responses with a bias towards text-based answers that are glanceable. Make cards crisp, clear, and actionable.
Try to fit your content within the height of Cortana's canvas
Limiting the content to the size of Cortana's canvas makes it easy for the user to see the content without scrolling.
Use cards to provide details
Cards are meant to provide additional information beyond what the skill speaks. If a user has to listen to Cortana read all the details that are on the card, it often won't be the best user experience. Provide a summary of the card using voice and use the card for the details. For example, if the card shows shirt choices, the skill might say, "Select the shirt's color."
Direct users to a screen only when needed
Some users use Cortana on speaker-only devices and can't see a card. If the user needs to see a card, direct them to Cortana's companion app. However, try and do this sparingly. Ideally, a user should be able to use your skill on any supported device regardless of whether it has a screen without having to use a secondary device. The exception being when the user has to sign-in to your skill or provide private information. Note that other than the sign-in card, cards are read-only on speaker devices.
The following scenario shows a cooking skill that provides a list of ingredients.
Good: This recipe has 5 ingredients. Here are the first 3. 2 eggs. A cup of flour and a half a cup of water. Say next for the rest of the ingredients. Bad: Open the companion app to see the list of ingredients.
Note that the time between user utterances is limited. On a speaker-only device, if a timeout occurs, the skill ends. On Windows, the skill is still active but the microphone turns off. For cases like this, it may also be a good to ask the user if they would like the instructions emailed to them.
Tailor the experience
Tailor the experience based on the device the user is using. If they are using a standalone speaker device, rely on speech to convey the message to the user. If they have a screen, share a quick summary using voice and add additional information in the card. For example, if a user is shopping for a gift, the following are ideas on how to present the information to the user.
- Voice: The Contoso shirt is a custom-made shirt available in three colors; red, blue, and orange. Sizes include small, medium and large. It retails for $30.
Device with screen:
- Voice: The Contoso shirt is a custom-made shirt that retails for $30.
- Card: Show an image and additional details such as sizes/dimensions and color options. The Bot Framework's Hero card is a good option for this case. If presenting several items to the user, a carousel of Hero cards works well.
Use horizontal lists
A card's attachment layout specifies how to display multiple card attachments. The framework supports vertical list layout and horizontal carousel layout. Use carousels, if possible.
Cortana's persona has been well crafted and it's very important that you maintain her character in all interactions with users. For information about maintaining her persona in your skill, see Cortana's persona.
For performance design considerations, including Azure services, see Performance guidelines.
For invocation name do's and don'ts, see Invocation name guidelines.
When you publish your skill to the world, the Cortana team reviews your skill to make sure it's compliant with the design principals in addition to a few other requirements. As part of your design process, be sure to read the list of review requirements that your skill must comply with before you can publish you skill (see Cortana skills certification requirements).