DirectX の音声入力

[アーティクル]
03/21/2023

Note

この記事は、従来のWinRTネイティブAPIに関連します。新しいネイティブアプリプロジェクトの場合は、OpenXR API を使用することをお勧めします。

この記事では、Windows Mixed Reality 用の DirectX アプリに音声コマンドと小さな語句や文の認識を実装する方法について説明します。

Note

この記事のコードスニペットでは、C++ ホログラフィックプロジェクトテンプレートで使用される C++17 準拠の C++/WinRT ではなく、C++/CX を使用します。概念は C++/WinRT プロジェクトの場合と同じですが、コードを変換する必要があります。

SpeechRecognizer を使用して継続的な音声認識を行う

このセクションでは、継続的な音声認識を使用して、アプリ内で音声コマンドを有効にする方法について説明します。このチュートリアルでは、HolographicVoiceInput サンプルのコードを使用します。サンプルを実行しているときに、登録済みの色コマンドの 1 つの名前を読み上げて、回転するキューブの色を変更します。

最初に、新しい Windows::Media::SpeechRecognition::SpeechRecognizer インスタンスを作成します。

HolographicVoiceInputSampleMain::CreateSpeechConstraintsForCurrentState から:

m_speechRecognizer = ref new SpeechRecognizer();

認識エンジンがリッスンする音声コマンドの一覧を作成します。ここでは、ホログラムの色を変更する一連のコマンドを作成します。利便性を考慮して、後でコマンドに使用するデータも作成します。

m_speechCommandList = ref new Platform::Collections::Vector<String^>();
   m_speechCommandData.clear();
   m_speechCommandList->Append(StringReference(L"white"));
   m_speechCommandData.push_back(float4(1.f, 1.f, 1.f, 1.f));
   m_speechCommandList->Append(StringReference(L"grey"));
   m_speechCommandData.push_back(float4(0.5f, 0.5f, 0.5f, 1.f));
   m_speechCommandList->Append(StringReference(L"green"));
   m_speechCommandData.push_back(float4(0.f, 1.f, 0.f, 1.f));
   m_speechCommandList->Append(StringReference(L"black"));
   m_speechCommandData.push_back(float4(0.1f, 0.1f, 0.1f, 1.f));
   m_speechCommandList->Append(StringReference(L"red"));
   m_speechCommandData.push_back(float4(1.f, 0.f, 0.f, 1.f));
   m_speechCommandList->Append(StringReference(L"yellow"));
   m_speechCommandData.push_back(float4(1.f, 1.f, 0.f, 1.f));
   m_speechCommandList->Append(StringReference(L"aquamarine"));
   m_speechCommandData.push_back(float4(0.f, 1.f, 1.f, 1.f));
   m_speechCommandList->Append(StringReference(L"blue"));
   m_speechCommandData.push_back(float4(0.f, 0.f, 1.f, 1.f));
   m_speechCommandList->Append(StringReference(L"purple"));
   m_speechCommandData.push_back(float4(1.f, 0.f, 1.f, 1.f));

辞書にない可能性がある音声単語を使用して、コマンドを指定できます。

m_speechCommandList->Append(StringReference(L"SpeechRecognizer"));
   m_speechCommandData.push_back(float4(0.5f, 0.1f, 1.f, 1.f));

コマンドの一覧を音声認識エンジンの制約の一覧に読み込むには、SpeechRecognitionListConstraint オブジェクトを使用します。

SpeechRecognitionListConstraint^ spConstraint = ref new SpeechRecognitionListConstraint(m_speechCommandList);
   m_speechRecognizer->Constraints->Clear();
   m_speechRecognizer->Constraints->Append(spConstraint);
   create_task(m_speechRecognizer->CompileConstraintsAsync()).then([this](SpeechRecognitionCompilationResult^ compilationResult)
   {
       if (compilationResult->Status == SpeechRecognitionResultStatus::Success)
       {
           m_speechRecognizer->ContinuousRecognitionSession->StartAsync();
       }
       else
       {
           // Handle errors here.
       }
   });

音声認識エンジンの SpeechContinuousRecognitionSession で ResultGenerated イベントをサブスクライブします。コマンドの 1 つが認識されると、このイベントがアプリに通知します。

m_speechRecognizer->ContinuousRecognitionSession->ResultGenerated +=
       ref new TypedEventHandler<SpeechContinuousRecognitionSession^, SpeechContinuousRecognitionResultGeneratedEventArgs^>(
           std::bind(&HolographicVoiceInputSampleMain::OnResultGenerated, this, _1, _2)
           );

OnResultGenerated イベントハンドラーは、SpeechContinuousRecognitionResultGeneratedEventArgs インスタンス内でイベントデータを受け取ります。信頼度が定義済みのしきい値を超える場合、アプリではイベントが発生したことに注意する必要があります。後の更新ループで使用できるように、イベントデータを保存します。

HolographicVoiceInputSampleMain.cpp から:

// Change the cube color, if we get a valid result.
   void HolographicVoiceInputSampleMain::OnResultGenerated(SpeechContinuousRecognitionSession ^sender, SpeechContinuousRecognitionResultGeneratedEventArgs ^args)
   {
       if (args->Result->RawConfidence > 0.5f)
       {
           m_lastCommand = args->Result->Text;
       }
   }

このコード例では、ユーザーのコマンドに従って、回転するホログラムキューブの色を変更します。

HolographicVoiceInputSampleMain::Update から:

// Check for new speech input since the last frame.
   if (m_lastCommand != nullptr)
   {
       auto command = m_lastCommand;
       m_lastCommand = nullptr;

       int i = 0;
       for each (auto& iter in m_speechCommandList)
       {
           if (iter == command)
           {
               m_spinningCubeRenderer->SetColor(m_speechCommandData[i]);
               break;
           }

           ++i;
       }
   }

"1 回限り" の認識を使用する

ユーザーが話す語句や文をリッスンするように音声認識エンジンを構成できます。この場合は、予想される入力の種類を音声認識エンジンに示す SpeechRecognitionTopicConstraint を適用します。このシナリオのアプリワークフローを次に示します。

アプリは SpeechRecognizer を作成し、UI プロンプトを表示し、音声コマンドのリッスンを開始します。
ユーザーが語句または文を話します。
ユーザーの音声の認識が行われ、結果がアプリに返されます。この時点で、認識が発生したことを示す UI プロンプトをアプリで提供する必要があります。
応答する信頼レベルと音声認識結果の信頼レベルに応じて、アプリで結果を処理し、必要に応じて応答できます。

このセクションでは、SpeechRecognizer の作成、制約のコンパイル、音声入力のリッスンの方法について説明します。

次のコードは、トピック制約をコンパイルします。この場合は、Web 検索用に最適化されます。

auto constraint = ref new SpeechRecognitionTopicConstraint(SpeechRecognitionScenario::WebSearch, L"webSearch");
   m_speechRecognizer->Constraints->Clear();
   m_speechRecognizer->Constraints->Append(constraint);
   return create_task(m_speechRecognizer->CompileConstraintsAsync())
       .then([this](task<SpeechRecognitionCompilationResult^> previousTask)
   {

コンパイルが成功した場合は、音声認識に進むことができます。

try
       {
           SpeechRecognitionCompilationResult^ compilationResult = previousTask.get();

           // Check to make sure that the constraints were in a proper format and the recognizer was able to compile it.
           if (compilationResult->Status == SpeechRecognitionResultStatus::Success)
           {
               // If the compilation succeeded, we can start listening for the user's spoken phrase or sentence.
               create_task(m_speechRecognizer->RecognizeAsync()).then([this](task<SpeechRecognitionResult^>& previousTask)
               {

その後、結果がアプリに返されます。結果に十分な自信がある場合は、コマンドを処理できます。このコード例は、少なくとも中程度の信頼度で結果を処理します。

try
                   {
                       auto result = previousTask.get();

                       if (result->Status != SpeechRecognitionResultStatus::Success)
                       {
                           PrintWstringToDebugConsole(
                               std::wstring(L"Speech recognition was not successful: ") +
                               result->Status.ToString()->Data() +
                               L"\n"
                               );
                       }

                       // In this example, we look for at least medium confidence in the speech result.
                       if ((result->Confidence == SpeechRecognitionConfidence::High) ||
                           (result->Confidence == SpeechRecognitionConfidence::Medium))
                       {
                           // If the user said a color name anywhere in their phrase, it will be recognized in the
                           // Update loop; then, the cube will change color.
                           m_lastCommand = result->Text;

                           PrintWstringToDebugConsole(
                               std::wstring(L"Speech phrase was: ") +
                               m_lastCommand->Data() +
                               L"\n"
                               );
                       }
                       else
                       {
                           PrintWstringToDebugConsole(
                               std::wstring(L"Recognition confidence not high enough: ") +
                               result->Confidence.ToString()->Data() +
                               L"\n"
                               );
                       }
                   }

音声認識を使用するたびに、ユーザーがシステムのプライバシー設定でマイクをオフにした可能性があることを示す例外を監視します。これは、初期化中または認識中に発生する可能性があります。

catch (Exception^ exception)
                   {
                       // Note that if you get an "Access is denied" exception, you might need to enable the microphone
                       // privacy setting on the device and/or add the microphone capability to your app manifest.

                       PrintWstringToDebugConsole(
                           std::wstring(L"Speech recognizer error: ") +
                           exception->ToString()->Data() +
                           L"\n"
                           );
                   }
               });

               return true;
           }
           else
           {
               OutputDebugStringW(L"Could not initialize predefined grammar speech engine!\n");

               // Handle errors here.
               return false;
           }
       }
       catch (Exception^ exception)
       {
           // Note that if you get an "Access is denied" exception, you might need to enable the microphone
           // privacy setting on the device and/or add the microphone capability to your app manifest.

           PrintWstringToDebugConsole(
               std::wstring(L"Exception while trying to initialize predefined grammar speech engine:") +
               exception->Message->Data() +
               L"\n"
               );

           // Handle exceptions here.
           return false;
       }
   });

Note

音声認識を最適化するために使用できる、SpeechRecognitionScenarios がいくつか事前に定義されています。

ディクテーションを最適化するには、ディクテーションシナリオを使用します。

// Compile the dictation topic constraint, which optimizes for speech dictation.
auto dictationConstraint = ref new SpeechRecognitionTopicConstraint(SpeechRecognitionScenario::Dictation, "dictation");
m_speechRecognizer->Constraints->Append(dictationConstraint);

音声 Web 検索の場合は、次の Web 固有シナリオの制約を使用します。

// Add a web search topic constraint to the recognizer.
auto webSearchConstraint = ref new SpeechRecognitionTopicConstraint(SpeechRecognitionScenario::WebSearch, "webSearch");
speechRecognizer->Constraints->Append(webSearchConstraint);

フォームに入力するには、フォーム制約を使用します。この場合に最適なのは、フォームに入力するために最適化された独自の文法を適用することです。

// Add a form constraint to the recognizer.
auto formConstraint = ref new SpeechRecognitionTopicConstraint(SpeechRecognitionScenario::FormFilling, "formFilling");
speechRecognizer->Constraints->Append(formConstraint );

独自の文法を SRGS 形式で指定できます。

継続的な認識を使用する

継続的なディクテーションのシナリオについては、Windows 10 UWP 音声コードのサンプルを参照してください。

品質低下を処理する

環境条件が音声認識に干渉する場合があります。たとえば、部屋がうるさすぎる場合や、ユーザーが話す声が大きすぎる場合です。可能な場合に、音声認識 API は、品質低下の原因となった条件に関する情報を提供します。この情報は、WinRT イベントを介してアプリにプッシュされます。次の例は、このイベントをサブスクライブする方法を示しています。

m_speechRecognizer->RecognitionQualityDegrading +=
       ref new TypedEventHandler<SpeechRecognizer^, SpeechRecognitionQualityDegradingEventArgs^>(
           std::bind(&HolographicVoiceInputSampleMain::OnSpeechQualityDegraded, this, _1, _2)
           );

このコードサンプルでは、条件情報がデバッグコンソールに書き込まれます。アプリでは、UI、音声合成、その他の方法を使用してユーザーにフィードバックを提供することができます。または、品質の一時的な低下によって音声が中断された場合に、異なる動作が必要になる場合があります。

void HolographicSpeechPromptSampleMain::OnSpeechQualityDegraded(SpeechRecognizer^ recognizer, SpeechRecognitionQualityDegradingEventArgs^ args)
   {
       switch (args->Problem)
       {
       case SpeechRecognitionAudioProblem::TooFast:
           OutputDebugStringW(L"The user spoke too quickly.\n");
           break;

       case SpeechRecognitionAudioProblem::TooSlow:
           OutputDebugStringW(L"The user spoke too slowly.\n");
           break;

       case SpeechRecognitionAudioProblem::TooQuiet:
           OutputDebugStringW(L"The user spoke too softly.\n");
           break;

       case SpeechRecognitionAudioProblem::TooLoud:
           OutputDebugStringW(L"The user spoke too loudly.\n");
           break;

       case SpeechRecognitionAudioProblem::TooNoisy:
           OutputDebugStringW(L"There is too much noise in the signal.\n");
           break;

       case SpeechRecognitionAudioProblem::NoSignal:
           OutputDebugStringW(L"There is no signal.\n");
           break;

       case SpeechRecognitionAudioProblem::None:
       default:
           OutputDebugStringW(L"An error was reported with no information.\n");
           break;
       }
   }

ref クラスを使用して DirectX アプリを作成していない場合は、音声認識エンジンを解放または再作成する前に、イベントのサブスクライブを解除する必要があります。 HolographicSpeechPromptSample には、イベントの認識とサブスクライブ解除を停止するルーチンがあります。

Concurrency::task<void> HolographicSpeechPromptSampleMain::StopCurrentRecognizerIfExists()
   {
       return create_task([this]()
       {
           if (m_speechRecognizer != nullptr)
           {
               return create_task(m_speechRecognizer->StopRecognitionAsync()).then([this]()
               {
                   m_speechRecognizer->RecognitionQualityDegrading -= m_speechRecognitionQualityDegradedToken;

                   if (m_speechRecognizer->ContinuousRecognitionSession != nullptr)
                   {
                       m_speechRecognizer->ContinuousRecognitionSession->ResultGenerated -= m_speechRecognizerResultEventToken;
                   }
               });
           }
           else
           {
               return create_task([this]() { m_speechRecognizer = nullptr; });
           }
       });
   }

音声合成を使用して音声プロンプトを提供する

ホログラフィック音声サンプルでは、音声合成を使用して、ユーザーに音声で指示します。このセクションでは、合成音声サンプルを作成し、HRTF オーディオ API を介してそれを再生する方法について説明します。

語句の入力を要求するときに、独自の音声プロンプトを指定することをお勧めします。プロンプトは、継続的な認識のシナリオで音声コマンドを読み上げ可能なときに、それを示すのにも役立ちます。次の例では、音声シンセサイザーを使用してこれを行う方法を示します。また、プロンプトが動的ではないシナリオなどで何を言えばよいかを示す、事前に録音された音声クリップ、ビジュアル UI、または別のインジケーターを使用することもできます。

まず、SpeechSynthesizer オブジェクトを作成します。

auto speechSynthesizer = ref new Windows::Media::SpeechSynthesis::SpeechSynthesizer();

合成するテキストを含む文字列も必要です。

// Phrase recognition works best when requesting a phrase or sentence.
   StringReference voicePrompt = L"At the prompt: Say a phrase, asking me to change the cube to a specific color.";

音声は、SynthesizeTextToStreamAsync を介して非同期的に合成されます。ここでは、非同期タスクを開始して音声を合成します。

create_task(speechSynthesizer->SynthesizeTextToStreamAsync(voicePrompt), task_continuation_context::use_current())
       .then([this, speechSynthesizer](task<Windows::Media::SpeechSynthesis::SpeechSynthesisStream^> synthesisStreamTask)
   {
       try
       {

音声合成はバイトストリームとして送信されます。そのバイトストリームを使用して、XAudio2 音声を初期化できます。ホログラフィックコードサンプルでは、それを HRTF オーディオ効果として再生します。

Windows::Media::SpeechSynthesis::SpeechSynthesisStream^ stream = synthesisStreamTask.get();

           auto hr = m_speechSynthesisSound.Initialize(stream, 0);
           if (SUCCEEDED(hr))
           {
               m_speechSynthesisSound.SetEnvironment(HrtfEnvironment::Small);
               m_speechSynthesisSound.Start();

               // Amount of time to pause after the audio prompt is complete, before listening
               // for speech input.
               static const float bufferTime = 0.15f;

               // Wait until the prompt is done before listening.
               m_secondsUntilSoundIsComplete = m_speechSynthesisSound.GetDuration() + bufferTime;
               m_waitingForSpeechPrompt = true;
           }
       }

音声認識と同様に、何か問題が発生した場合は、音声合成が例外をスローします。

catch (Exception^ exception)
       {
           PrintWstringToDebugConsole(
               std::wstring(L"Exception while trying to synthesize speech: ") +
               exception->Message->Data() +
               L"\n"
               );

           // Handle exceptions here.
       }
   });

DirectX の音声入力

SpeechRecognizer を使用して継続的な音声認識を行う

"1 回限り" の認識を使用する

継続的な認識を使用する

品質低下を処理する

音声合成を使用して音声プロンプトを提供する

関連項目

その他のリソース