Voice input in DirectX

Article
01/19/2021

Note

This article relates to the legacy WinRT native APIs. For new native app projects, we recommend using the OpenXR API.

This article explains how to implement voice commands plus small-phrase and sentence recognition in a DirectX app for Windows Mixed Reality.

Note

The code snippets in this article use C++/CX rather than C++17-compliant C++/WinRT, which is used in the C++ holographic project template. The concepts are equivalent for a C++/WinRT project, but you need to translate the code.

Use SpeechRecognizer for continuous speech recognition

This section describes how to use continuous speech recognition to enable voice commands in your app. This walk-through uses code from the HolographicVoiceInput sample. When the sample is running, speak the name of one of the registered color commands to change the color of the spinning cube.

First, create a new Windows::Media::SpeechRecognition::SpeechRecognizer instance.

From HolographicVoiceInputSampleMain::CreateSpeechConstraintsForCurrentState:

m_speechRecognizer = ref new SpeechRecognizer();

Create a list of speech commands for the recognizer to listen for. Here, we construct a set of commands to change the color of a hologram. For convenience, we also create the data that we'll use for the commands later.

m_speechCommandList = ref new Platform::Collections::Vector<String^>();
   m_speechCommandData.clear();
   m_speechCommandList->Append(StringReference(L"white"));
   m_speechCommandData.push_back(float4(1.f, 1.f, 1.f, 1.f));
   m_speechCommandList->Append(StringReference(L"grey"));
   m_speechCommandData.push_back(float4(0.5f, 0.5f, 0.5f, 1.f));
   m_speechCommandList->Append(StringReference(L"green"));
   m_speechCommandData.push_back(float4(0.f, 1.f, 0.f, 1.f));
   m_speechCommandList->Append(StringReference(L"black"));
   m_speechCommandData.push_back(float4(0.1f, 0.1f, 0.1f, 1.f));
   m_speechCommandList->Append(StringReference(L"red"));
   m_speechCommandData.push_back(float4(1.f, 0.f, 0.f, 1.f));
   m_speechCommandList->Append(StringReference(L"yellow"));
   m_speechCommandData.push_back(float4(1.f, 1.f, 0.f, 1.f));
   m_speechCommandList->Append(StringReference(L"aquamarine"));
   m_speechCommandData.push_back(float4(0.f, 1.f, 1.f, 1.f));
   m_speechCommandList->Append(StringReference(L"blue"));
   m_speechCommandData.push_back(float4(0.f, 0.f, 1.f, 1.f));
   m_speechCommandList->Append(StringReference(L"purple"));
   m_speechCommandData.push_back(float4(1.f, 0.f, 1.f, 1.f));

You can use phonetic words that might not be in a dictionary to specify commands.

m_speechCommandList->Append(StringReference(L"SpeechRecognizer"));
   m_speechCommandData.push_back(float4(0.5f, 0.1f, 1.f, 1.f));

To load the commands list into the list of constraints for the speech recognizer, use a SpeechRecognitionListConstraint object.

SpeechRecognitionListConstraint^ spConstraint = ref new SpeechRecognitionListConstraint(m_speechCommandList);
   m_speechRecognizer->Constraints->Clear();
   m_speechRecognizer->Constraints->Append(spConstraint);
   create_task(m_speechRecognizer->CompileConstraintsAsync()).then([this](SpeechRecognitionCompilationResult^ compilationResult)
   {
       if (compilationResult->Status == SpeechRecognitionResultStatus::Success)
       {
           m_speechRecognizer->ContinuousRecognitionSession->StartAsync();
       }
       else
       {
           // Handle errors here.
       }
   });

Subscribe to the ResultGenerated event on the speech recognizer's SpeechContinuousRecognitionSession. This event notifies your app when one of your commands has been recognized.

m_speechRecognizer->ContinuousRecognitionSession->ResultGenerated +=
       ref new TypedEventHandler<SpeechContinuousRecognitionSession^, SpeechContinuousRecognitionResultGeneratedEventArgs^>(
           std::bind(&HolographicVoiceInputSampleMain::OnResultGenerated, this, _1, _2)
           );

Your OnResultGenerated event handler receives event data in a SpeechContinuousRecognitionResultGeneratedEventArgs instance. If the confidence is greater than the threshold you defined, your app should note that the event happened. Save the event data so that you can use it in a later update loop.

From HolographicVoiceInputSampleMain.cpp:

// Change the cube color, if we get a valid result.
   void HolographicVoiceInputSampleMain::OnResultGenerated(SpeechContinuousRecognitionSession ^sender, SpeechContinuousRecognitionResultGeneratedEventArgs ^args)
   {
       if (args->Result->RawConfidence > 0.5f)
       {
           m_lastCommand = args->Result->Text;
       }
   }

In our example code, we change the color of the spinning hologram cube according to the user's command.

From HolographicVoiceInputSampleMain::Update:

// Check for new speech input since the last frame.
   if (m_lastCommand != nullptr)
   {
       auto command = m_lastCommand;
       m_lastCommand = nullptr;

       int i = 0;
       for each (auto& iter in m_speechCommandList)
       {
           if (iter == command)
           {
               m_spinningCubeRenderer->SetColor(m_speechCommandData[i]);
               break;
           }

           ++i;
       }
   }

Use "one-shot" recognition

You can configure a speech recognizer to listen for phrases or sentences that the user speaks. In this case, we apply a SpeechRecognitionTopicConstraint that tells the speech recognizer what type of input to expect. Here's an app workflow for this scenario:

Your app creates the SpeechRecognizer, provides UI prompts, and starts listening for a spoken command.
The user speaks a phrase or sentence.
Recognition of the user's speech occurs, and a result is returned to the app. At this point, your app should provide a UI prompt to indicate that recognition has occurred.
Depending on the confidence level that you want to respond to and the confidence level of the speech recognition result, your app can process the result and respond as appropriate.

This section describes how to create a SpeechRecognizer, compile the constraint, and listen for speech input.

The following code compiles the topic constraint, which in this case is optimized for web search.

auto constraint = ref new SpeechRecognitionTopicConstraint(SpeechRecognitionScenario::WebSearch, L"webSearch");
   m_speechRecognizer->Constraints->Clear();
   m_speechRecognizer->Constraints->Append(constraint);
   return create_task(m_speechRecognizer->CompileConstraintsAsync())
       .then([this](task<SpeechRecognitionCompilationResult^> previousTask)
   {

If compilation succeeds, we can continue with speech recognition.

try
       {
           SpeechRecognitionCompilationResult^ compilationResult = previousTask.get();

           // Check to make sure that the constraints were in a proper format and the recognizer was able to compile it.
           if (compilationResult->Status == SpeechRecognitionResultStatus::Success)
           {
               // If the compilation succeeded, we can start listening for the user's spoken phrase or sentence.
               create_task(m_speechRecognizer->RecognizeAsync()).then([this](task<SpeechRecognitionResult^>& previousTask)
               {

The result is then returned to the app. If we're confident enough in the result, we can process the command. This code example processes results with at least medium confidence.

try
                   {
                       auto result = previousTask.get();

                       if (result->Status != SpeechRecognitionResultStatus::Success)
                       {
                           PrintWstringToDebugConsole(
                               std::wstring(L"Speech recognition was not successful: ") +
                               result->Status.ToString()->Data() +
                               L"\n"
                               );
                       }

                       // In this example, we look for at least medium confidence in the speech result.
                       if ((result->Confidence == SpeechRecognitionConfidence::High) ||
                           (result->Confidence == SpeechRecognitionConfidence::Medium))
                       {
                           // If the user said a color name anywhere in their phrase, it will be recognized in the
                           // Update loop; then, the cube will change color.
                           m_lastCommand = result->Text;

                           PrintWstringToDebugConsole(
                               std::wstring(L"Speech phrase was: ") +
                               m_lastCommand->Data() +
                               L"\n"
                               );
                       }
                       else
                       {
                           PrintWstringToDebugConsole(
                               std::wstring(L"Recognition confidence not high enough: ") +
                               result->Confidence.ToString()->Data() +
                               L"\n"
                               );
                       }
                   }

Whenever you use speech recognition, watch for exceptions that could indicate that the user has turned off the microphone in the system privacy settings. This can happen during initialization or recognition.

catch (Exception^ exception)
                   {
                       // Note that if you get an "Access is denied" exception, you might need to enable the microphone
                       // privacy setting on the device and/or add the microphone capability to your app manifest.

                       PrintWstringToDebugConsole(
                           std::wstring(L"Speech recognizer error: ") +
                           exception->ToString()->Data() +
                           L"\n"
                           );
                   }
               });

               return true;
           }
           else
           {
               OutputDebugStringW(L"Could not initialize predefined grammar speech engine!\n");

               // Handle errors here.
               return false;
           }
       }
       catch (Exception^ exception)
       {
           // Note that if you get an "Access is denied" exception, you might need to enable the microphone
           // privacy setting on the device and/or add the microphone capability to your app manifest.

           PrintWstringToDebugConsole(
               std::wstring(L"Exception while trying to initialize predefined grammar speech engine:") +
               exception->Message->Data() +
               L"\n"
               );

           // Handle exceptions here.
           return false;
       }
   });

Note

There are several predefined SpeechRecognitionScenarios that you can use to optimize speech recognition.

To optimize for dictation, use the dictation scenario.

// Compile the dictation topic constraint, which optimizes for speech dictation.
auto dictationConstraint = ref new SpeechRecognitionTopicConstraint(SpeechRecognitionScenario::Dictation, "dictation");
m_speechRecognizer->Constraints->Append(dictationConstraint);

For speech web searches, use the following web-specific scenario constraint.

// Add a web search topic constraint to the recognizer.
auto webSearchConstraint = ref new SpeechRecognitionTopicConstraint(SpeechRecognitionScenario::WebSearch, "webSearch");
speechRecognizer->Constraints->Append(webSearchConstraint);

Use the form constraint to fill out forms. In this case, it's best to apply your own grammar that's optimized for filling out the form.

// Add a form constraint to the recognizer.
auto formConstraint = ref new SpeechRecognitionTopicConstraint(SpeechRecognitionScenario::FormFilling, "formFilling");
speechRecognizer->Constraints->Append(formConstraint );

You can provide your own grammar in the SRGS format.

Use continuous recognition

For the continuous-dictation scenario, see the Windows 10 UWP speech code sample.

Handle quality degradation

Environmental conditions sometimes interfere with speech recognition. For example, the room might be too noisy, or the user might speak too loudly. Whenever possible, the speech recognition API provides information about the conditions that caused the quality degradation. This information is pushed to your app through a WinRT event. The following example shows how to subscribe to this event.

m_speechRecognizer->RecognitionQualityDegrading +=
       ref new TypedEventHandler<SpeechRecognizer^, SpeechRecognitionQualityDegradingEventArgs^>(
           std::bind(&HolographicVoiceInputSampleMain::OnSpeechQualityDegraded, this, _1, _2)
           );

In our code sample, we write the conditions information to the debug console. An app might want to provide feedback to the user through the UI, speech synthesis, and another method. Or it might need to behave differently when speech is interrupted by a temporary reduction in quality.

void HolographicSpeechPromptSampleMain::OnSpeechQualityDegraded(SpeechRecognizer^ recognizer, SpeechRecognitionQualityDegradingEventArgs^ args)
   {
       switch (args->Problem)
       {
       case SpeechRecognitionAudioProblem::TooFast:
           OutputDebugStringW(L"The user spoke too quickly.\n");
           break;

       case SpeechRecognitionAudioProblem::TooSlow:
           OutputDebugStringW(L"The user spoke too slowly.\n");
           break;

       case SpeechRecognitionAudioProblem::TooQuiet:
           OutputDebugStringW(L"The user spoke too softly.\n");
           break;

       case SpeechRecognitionAudioProblem::TooLoud:
           OutputDebugStringW(L"The user spoke too loudly.\n");
           break;

       case SpeechRecognitionAudioProblem::TooNoisy:
           OutputDebugStringW(L"There is too much noise in the signal.\n");
           break;

       case SpeechRecognitionAudioProblem::NoSignal:
           OutputDebugStringW(L"There is no signal.\n");
           break;

       case SpeechRecognitionAudioProblem::None:
       default:
           OutputDebugStringW(L"An error was reported with no information.\n");
           break;
       }
   }

If you're not using ref classes to create your DirectX app, you must unsubscribe from the event before you release or recreate your speech recognizer. The HolographicSpeechPromptSample has a routine to stop recognition and unsubscribe from events.

Concurrency::task<void> HolographicSpeechPromptSampleMain::StopCurrentRecognizerIfExists()
   {
       return create_task([this]()
       {
           if (m_speechRecognizer != nullptr)
           {
               return create_task(m_speechRecognizer->StopRecognitionAsync()).then([this]()
               {
                   m_speechRecognizer->RecognitionQualityDegrading -= m_speechRecognitionQualityDegradedToken;

                   if (m_speechRecognizer->ContinuousRecognitionSession != nullptr)
                   {
                       m_speechRecognizer->ContinuousRecognitionSession->ResultGenerated -= m_speechRecognizerResultEventToken;
                   }
               });
           }
           else
           {
               return create_task([this]() { m_speechRecognizer = nullptr; });
           }
       });
   }

Use speech synthesis to provide audible prompts

The holographic speech samples use speech synthesis to provide audible instructions to the user. This section shows how to create a synthesized voice sample and then play it back through the HRTF audio APIs.

We recommend providing your own speech prompts when you request phrase input. Prompts can also help indicate when speech commands can be spoken for a continuous-recognition scenario. The following example demonstrates how to use a speech synthesizer to do this. You could also use a pre-recorded voice clip, a visual UI, or another indicator of what to say, for example in scenarios where the prompt isn't dynamic.

First, create the SpeechSynthesizer object.

auto speechSynthesizer = ref new Windows::Media::SpeechSynthesis::SpeechSynthesizer();

You also need a string that includes the text to synthesize.

// Phrase recognition works best when requesting a phrase or sentence.
   StringReference voicePrompt = L"At the prompt: Say a phrase, asking me to change the cube to a specific color.";

Speech is synthesized asynchronously through SynthesizeTextToStreamAsync. Here, we start an async task to synthesize the speech.

create_task(speechSynthesizer->SynthesizeTextToStreamAsync(voicePrompt), task_continuation_context::use_current())
       .then([this, speechSynthesizer](task<Windows::Media::SpeechSynthesis::SpeechSynthesisStream^> synthesisStreamTask)
   {
       try
       {

The speech synthesis is sent as a byte stream. We can use that byte stream to initialize an XAudio2 voice. For our holographic code samples, we play it back as an HRTF audio effect.

Windows::Media::SpeechSynthesis::SpeechSynthesisStream^ stream = synthesisStreamTask.get();

           auto hr = m_speechSynthesisSound.Initialize(stream, 0);
           if (SUCCEEDED(hr))
           {
               m_speechSynthesisSound.SetEnvironment(HrtfEnvironment::Small);
               m_speechSynthesisSound.Start();

               // Amount of time to pause after the audio prompt is complete, before listening
               // for speech input.
               static const float bufferTime = 0.15f;

               // Wait until the prompt is done before listening.
               m_secondsUntilSoundIsComplete = m_speechSynthesisSound.GetDuration() + bufferTime;
               m_waitingForSpeechPrompt = true;
           }
       }

As with speech recognition, speech synthesis throws an exception if something goes wrong.

catch (Exception^ exception)
       {
           PrintWstringToDebugConsole(
               std::wstring(L"Exception while trying to synthesize speech: ") +
               exception->Message->Data() +
               L"\n"
               );

           // Handle exceptions here.
       }
   });

Voice input in DirectX

Use SpeechRecognizer for continuous speech recognition

Use "one-shot" recognition

Use continuous recognition

Handle quality degradation

Use speech synthesis to provide audible prompts

See also

Feedback

Additional resources