Voice input in Unity


Instead of the below information, consider using the Unity plug-in for the Cognitive Speech Services SDK which has much better Speech Accuracy results and provides easy access to speech-to-text decode and advanced speech features like dialog, intent based interaction, translation, text-to-speech synthesis and natural language speech recognition. Find the sample and documentaion here: https://docs.microsoft.com//azure/cognitive-services/speech-service/quickstart-csharp-unity

Unity exposes three ways to add Voice input to your Unity application.

With the KeywordRecognizer (one of two types of PhraseRecognizers), your app can be given an array of string commands to listen for. With the GrammarRecognizer (the other type of PhraseRecognizer), your app can be given an SRGS file defining a specific grammar to listen for. With the DictationRecognizer, your app can listen for any word and provide the user with a note or other display of their speech.


Only dictation or phrase recognition can be handled at once. That means if a GrammarRecognizer or KeywordRecognizer is active, a DictationRecognizer can not be active and vice versa.

Enabling the capability for Voice

The Microphone capability must be declared for an app to leverage Voice input.

  1. In the Unity Editor, go to the player settings by navigating to "Edit > Project Settings > Player"
  2. Click on the "Windows Store" tab
  3. In the "Publishing Settings > Capabilities" section, check the Microphone capability

Phrase Recognition

To enable your app to listen for specific phrases spoken by the user then take some action, you need to:

  1. Specify which phrases to listen for using a KeywordRecognizer or GrammarRecognizer
  2. Handle the OnPhraseRecognized event and take action corresponding to the phrase recognized


Namespace: UnityEngine.Windows.Speech
Types: KeywordRecognizer, PhraseRecognizedEventArgs, SpeechError, SpeechSystemStatus

We'll need a few using statements to save some keystrokes:

using UnityEngine.Windows.Speech;
using System.Collections.Generic;
using System.Linq;

Then let's add a few fields to your class to store the recognizer and keyword->action dictionary:

KeywordRecognizer keywordRecognizer;
Dictionary<string, System.Action> keywords = new Dictionary<string, System.Action>();

Now add a keyword to the dictionary (e.g. inside of a Start() method). We're adding the "activate" keyword in this example:

//Create keywords for keyword recognizer
keywords.Add("activate", () =>
    // action to be performed when this keyword is spoken

Create the keyword recognizer and tell it what we want to recognize:

keywordRecognizer = new KeywordRecognizer(keywords.Keys.ToArray());

Now register for the OnPhraseRecognized event

keywordRecognizer.OnPhraseRecognized += KeywordRecognizer_OnPhraseRecognized;

An example handler is:

private void KeywordRecognizer_OnPhraseRecognized(PhraseRecognizedEventArgs args)
    System.Action keywordAction;
    // if the keyword recognized is in our dictionary, call that Action.
    if (keywords.TryGetValue(args.text, out keywordAction))

Finally, start recognizing!



Namespace: UnityEngine.Windows.Speech
Types: GrammarRecognizer, PhraseRecognizedEventArgs, SpeechError, SpeechSystemStatus

The GrammarRecognizer is used if you're specifying your recognition grammar using SRGS. This can be useful if your app has more than just a few keywords, if you want to recognize more complex phrases, or if you want to easily turn on and off sets of commands. See: Create Grammars Using SRGS XML for file format information.

Once you have your SRGS grammar, and it is in your project in a StreamingAssets folder:


Create a GrammarRecognizer and pass it the path to your SRGS file:

private GrammarRecognizer grammarRecognizer;
grammarRecognizer = new GrammarRecognizer(Application.streamingDataPath + "/SRGS/myGrammar.xml");

Now register for the OnPhraseRecognized event

grammarRecognizer.OnPhraseRecognized += grammarRecognizer_OnPhraseRecognized;

You will get a callback containing information specified in your SRGS grammar which you can handle appropriately. Most of the important information will be provided in the semanticMeanings array.

private void Grammar_OnPhraseRecognized(PhraseRecognizedEventArgs args)
    SemanticMeaning[] meanings = args.semanticMeanings;
    // do something

Finally, start recognizing!



Namespace: UnityEngine.Windows.Speech
Types: DictationRecognizer, SpeechError, SpeechSystemStatus

Use the DictationRecognizer to convert the user's speech to text. The DictationRecognizer exposes dictation functionality and supports registering and listening for hypothesis and phrase completed events, so you can give feedback to your user both while they speak and afterwards. Start() and Stop() methods respectively enable and disable dictation recognition. Once done with the recognizer, it should be disposed using Dispose() method to release the resources it uses. It will release these resources automatically during garbage collection at an additional performance cost if they are not released prior to that.

There are only a few steps needed to get started with dictation:

  1. Create a new DictationRecognizer
  2. Handle Dictation events
  3. Start the DictationRecognizer

Enabling the capability for dictation

The "Internet Client" capability, in addition to the "Microphone" capability mentioned above, must be declared for an app to leverage dictation.

  1. In the Unity Editor, go to the player settings by navigating to "Edit > Project Settings > Player" page
  2. Click on the "Windows Store" tab
  3. In the "Publishing Settings > Capabilities" section, check the InternetClient capability


Create a DictationRecognizer like so:

dictationRecognizer = new DictationRecognizer();

There are four dictation events that can be subscribed to and handled to implement dictation behavior.

  1. DictationResult
  2. DictationComplete
  3. DictationHypothesis
  4. DictationError


This event is fired after the user pauses, typically at the end of a sentence. The full recognized string is returned here.

First, subscribe to the DictationResult event:

dictationRecognizer.DictationResult += DictationRecognizer_DictationResult;

Then handle the DictationResult callback:

private void DictationRecognizer_DictationResult(string text, ConfidenceLevel confidence)
    // do something


This event is fired continuously while the user is talking. As the recognizer listens, it provides text of what it's heard so far.

First, subscribe to the DictationHypothesis event:

dictationRecognizer.DictationHypothesis += DictationRecognizer_DictationHypothesis;

Then handle the DictationHypothesis callback:

private void DictationRecognizer_DictationHypothesis(string text)
    // do something


This event is fired when the recognizer stops, whether from Stop() being called, a timeout occurring, or some other error.

First, subscribe to the DictationComplete event:

dictationRecognizer.DictationComplete += DictationRecognizer_DictationComplete;

Then handle the DictationComplete callback:

private void DictationRecognizer_DictationComplete(DictationCompletionCause cause)
   // do something


This event is fired when an error occurs.

First, subscribe to the DictationError event:

dictationRecognizer.DictationError += DictationRecognizer_DictationError;

Then handle the DictationError callback:

private void DictationRecognizer_DictationError(string error, int hresult)
    // do something

Once you have subscribed and handled the dictation events that you care about, start the dictation recognizer to begin receiving events.


If you no longer want to keep the DictationRecognizer around, you need to unsubscribe from the events and Dispose the DictationRecognizer.

dictationRecognizer.DictationResult -= DictationRecognizer_DictationResult;
dictationRecognizer.DictationComplete -= DictationRecognizer_DictationComplete ;
dictationRecognizer.DictationHypothesis -= DictationRecognizer_DictationHypothesis ;
dictationRecognizer.DictationError -= DictationRecognizer_DictationError ;


  • Start() and Stop() methods respectively enable and disable dictation recognition.
  • Once done with the recognizer, it must be disposed using Dispose() method to release the resources it uses. It will release these resources automatically during garbage collection at an additional performance cost if they are not released prior to that.
  • Timeouts occur after a set period of time. You can check for these timeouts in the DictationComplete event. There are two timeouts to be aware of:
    1. If the recognizer starts and doesn't hear any audio for the first five seconds, it will timeout.
    2. If the recognizer has given a result but then hears silence for twenty seconds, it will timeout.

Using both Phrase Recognition and Dictation

If you want to use both phrase recognition and dictation in your app, you'll need to fully shut one down before you can start the other. If you have multiple KeywordRecognizers running, you can shut them all down at once with:


In order to restore all recognizers to their previous state, after the DictationRecognizer has stopped, you can call:


You could also just start a KeywordRecognizer, which will restart the PhraseRecognitionSystem as well.

Using the microphone helper

The Mixed Reality Toolkit on GitHub contains a microphone helper class to hint at developers if there is a usable microphone on the system. One use for it is where one would want to check if there is microphone on system before showing any speech interaction hints in the application.

The microphone helper script can be found in the Input/Scripts/Utilities folder. The GitHub repo also contains a small sample demonstrating how to use the helper.

Voice input in Mixed Reality Toolkit

You can find the examples of the voice input in this scene.