DirectX의 음성 입력

아티클
03/21/2023

참고

이 문서는 레거시 WinRT 네이티브 API와 관련이 있습니다. 새 네이티브 앱 프로젝트의 경우 OpenXR API를 사용하는 것이 좋습니다.

이 문서에서는 Windows Mixed Reality DirectX 앱에서 음성 명령과 작은 구 및 문장 인식을 구현하는 방법을 설명합니다.

참고

이 문서의 코드 조각은 C++ 홀로그램 프로젝트 템플릿에서 사용되는 C++17 규격 C++/WinRT 대신 C++/CX를 사용합니다. 개념은 C++/WinRT 프로젝트에 해당하지만 코드를 번역해야 합니다.

연속 음성 인식에 SpeechRecognizer 사용

이 섹션에서는 연속 음성 인식을 사용하여 앱에서 음성 명령을 사용하도록 설정하는 방법을 설명합니다. 이 연습에서는 HolographicVoiceInput 샘플의 코드를 사용합니다. 샘플이 실행 중일 때 등록된 색 명령 중 하나의 이름을 말하여 회전하는 큐브의 색을 변경합니다.

먼저 새 Windows::Media::SpeechRecognition::SpeechRecognizer instance 만듭니다.

HolographicVoiceInputSampleMain::CreateSpeechConstraintsForCurrentState에서:

m_speechRecognizer = ref new SpeechRecognizer();

인식기가 수신 대기할 음성 명령 목록을 만듭니다. 여기서는 홀로그램의 색을 변경하는 명령 집합을 구성합니다. 편의를 위해 나중에 명령에 사용할 데이터도 만듭니다.

m_speechCommandList = ref new Platform::Collections::Vector<String^>();
   m_speechCommandData.clear();
   m_speechCommandList->Append(StringReference(L"white"));
   m_speechCommandData.push_back(float4(1.f, 1.f, 1.f, 1.f));
   m_speechCommandList->Append(StringReference(L"grey"));
   m_speechCommandData.push_back(float4(0.5f, 0.5f, 0.5f, 1.f));
   m_speechCommandList->Append(StringReference(L"green"));
   m_speechCommandData.push_back(float4(0.f, 1.f, 0.f, 1.f));
   m_speechCommandList->Append(StringReference(L"black"));
   m_speechCommandData.push_back(float4(0.1f, 0.1f, 0.1f, 1.f));
   m_speechCommandList->Append(StringReference(L"red"));
   m_speechCommandData.push_back(float4(1.f, 0.f, 0.f, 1.f));
   m_speechCommandList->Append(StringReference(L"yellow"));
   m_speechCommandData.push_back(float4(1.f, 1.f, 0.f, 1.f));
   m_speechCommandList->Append(StringReference(L"aquamarine"));
   m_speechCommandData.push_back(float4(0.f, 1.f, 1.f, 1.f));
   m_speechCommandList->Append(StringReference(L"blue"));
   m_speechCommandData.push_back(float4(0.f, 0.f, 1.f, 1.f));
   m_speechCommandList->Append(StringReference(L"purple"));
   m_speechCommandData.push_back(float4(1.f, 0.f, 1.f, 1.f));

사전에 없을 수 있는 윗주 단어를 사용하여 명령을 지정할 수 있습니다.

m_speechCommandList->Append(StringReference(L"SpeechRecognizer"));
   m_speechCommandData.push_back(float4(0.5f, 0.1f, 1.f, 1.f));

명령 목록을 음성 인식기의 제약 조건 목록에 로드하려면 SpeechRecognitionListConstraint 개체를 사용합니다.

SpeechRecognitionListConstraint^ spConstraint = ref new SpeechRecognitionListConstraint(m_speechCommandList);
   m_speechRecognizer->Constraints->Clear();
   m_speechRecognizer->Constraints->Append(spConstraint);
   create_task(m_speechRecognizer->CompileConstraintsAsync()).then([this](SpeechRecognitionCompilationResult^ compilationResult)
   {
       if (compilationResult->Status == SpeechRecognitionResultStatus::Success)
       {
           m_speechRecognizer->ContinuousRecognitionSession->StartAsync();
       }
       else
       {
           // Handle errors here.
       }
   });

음성 인식기 SpeechContinuousRecognitionSession에서 ResultGenerated 이벤트를 구독합니다. 이 이벤트는 명령 중 하나가 인식되면 앱에 알 수 있습니다.

m_speechRecognizer->ContinuousRecognitionSession->ResultGenerated +=
       ref new TypedEventHandler<SpeechContinuousRecognitionSession^, SpeechContinuousRecognitionResultGeneratedEventArgs^>(
           std::bind(&HolographicVoiceInputSampleMain::OnResultGenerated, this, _1, _2)
           );

OnResultGenerated 이벤트 처리기는 SpeechContinuousRecognitionResultGeneratedEventArgs instance 이벤트 데이터를 수신합니다. 신뢰도가 정의한 임계값보다 크면 앱에서 이벤트가 발생했음을 유의해야 합니다. 이후 업데이트 루프에서 사용할 수 있도록 이벤트 데이터를 저장합니다.

HolographicVoiceInputSampleMain.cpp에서:

// Change the cube color, if we get a valid result.
   void HolographicVoiceInputSampleMain::OnResultGenerated(SpeechContinuousRecognitionSession ^sender, SpeechContinuousRecognitionResultGeneratedEventArgs ^args)
   {
       if (args->Result->RawConfidence > 0.5f)
       {
           m_lastCommand = args->Result->Text;
       }
   }

예제 코드에서는 사용자의 명령에 따라 회전하는 홀로그램 큐브의 색을 변경합니다.

HolographicVoiceInputSampleMain::Update에서:

// Check for new speech input since the last frame.
   if (m_lastCommand != nullptr)
   {
       auto command = m_lastCommand;
       m_lastCommand = nullptr;

       int i = 0;
       for each (auto& iter in m_speechCommandList)
       {
           if (iter == command)
           {
               m_spinningCubeRenderer->SetColor(m_speechCommandData[i]);
               break;
           }

           ++i;
       }
   }

"원샷" 인식 사용

사용자가 말하는 구 또는 문장을 수신 대기하도록 음성 인식기를 구성할 수 있습니다. 이 경우 음성 인식기에서 예상되는 입력 유형을 알려주는 SpeechRecognitionTopicConstraint 를 적용합니다. 이 시나리오에 대한 앱 워크플로는 다음과 같습니다.

앱에서 SpeechRecognizer를 만들고, UI 프롬프트를 제공하고, 음성 명령을 수신 대기하기 시작합니다.
사용자가 구 또는 문장을 말합니다.
사용자의 음성 인식이 발생하고 결과가 앱에 반환됩니다. 이 시점에서 앱은 인식이 발생했음을 나타내는 UI 프롬프트를 제공해야 합니다.
응답하려는 신뢰 수준 및 음성 인식 결과의 신뢰도 수준에 따라 앱에서 결과를 처리하고 적절하게 응답할 수 있습니다.

이 섹션에서는 SpeechRecognizer를 만들고, 제약 조건을 컴파일하고, 음성 입력을 수신 대기하는 방법을 설명합니다.

다음 코드는 토픽 제약 조건을 컴파일합니다. 이 경우 웹 검색에 최적화됩니다.

auto constraint = ref new SpeechRecognitionTopicConstraint(SpeechRecognitionScenario::WebSearch, L"webSearch");
   m_speechRecognizer->Constraints->Clear();
   m_speechRecognizer->Constraints->Append(constraint);
   return create_task(m_speechRecognizer->CompileConstraintsAsync())
       .then([this](task<SpeechRecognitionCompilationResult^> previousTask)
   {

컴파일에 성공하면 음성 인식을 계속할 수 있습니다.

try
       {
           SpeechRecognitionCompilationResult^ compilationResult = previousTask.get();

           // Check to make sure that the constraints were in a proper format and the recognizer was able to compile it.
           if (compilationResult->Status == SpeechRecognitionResultStatus::Success)
           {
               // If the compilation succeeded, we can start listening for the user's spoken phrase or sentence.
               create_task(m_speechRecognizer->RecognizeAsync()).then([this](task<SpeechRecognitionResult^>& previousTask)
               {

그러면 결과가 앱에 반환됩니다. 결과를 충분히 확신하는 경우 명령을 처리할 수 있습니다. 이 코드 예제에서는 적어도 중간 신뢰도로 결과를 처리합니다.

try
                   {
                       auto result = previousTask.get();

                       if (result->Status != SpeechRecognitionResultStatus::Success)
                       {
                           PrintWstringToDebugConsole(
                               std::wstring(L"Speech recognition was not successful: ") +
                               result->Status.ToString()->Data() +
                               L"\n"
                               );
                       }

                       // In this example, we look for at least medium confidence in the speech result.
                       if ((result->Confidence == SpeechRecognitionConfidence::High) ||
                           (result->Confidence == SpeechRecognitionConfidence::Medium))
                       {
                           // If the user said a color name anywhere in their phrase, it will be recognized in the
                           // Update loop; then, the cube will change color.
                           m_lastCommand = result->Text;

                           PrintWstringToDebugConsole(
                               std::wstring(L"Speech phrase was: ") +
                               m_lastCommand->Data() +
                               L"\n"
                               );
                       }
                       else
                       {
                           PrintWstringToDebugConsole(
                               std::wstring(L"Recognition confidence not high enough: ") +
                               result->Confidence.ToString()->Data() +
                               L"\n"
                               );
                       }
                   }

음성 인식을 사용할 때마다 사용자가 시스템 개인 정보 설정에서 마이크를 해제했음을 나타낼 수 있는 예외에 대해 watch. 초기화 또는 인식 중에 발생할 수 있습니다.

catch (Exception^ exception)
                   {
                       // Note that if you get an "Access is denied" exception, you might need to enable the microphone
                       // privacy setting on the device and/or add the microphone capability to your app manifest.

                       PrintWstringToDebugConsole(
                           std::wstring(L"Speech recognizer error: ") +
                           exception->ToString()->Data() +
                           L"\n"
                           );
                   }
               });

               return true;
           }
           else
           {
               OutputDebugStringW(L"Could not initialize predefined grammar speech engine!\n");

               // Handle errors here.
               return false;
           }
       }
       catch (Exception^ exception)
       {
           // Note that if you get an "Access is denied" exception, you might need to enable the microphone
           // privacy setting on the device and/or add the microphone capability to your app manifest.

           PrintWstringToDebugConsole(
               std::wstring(L"Exception while trying to initialize predefined grammar speech engine:") +
               exception->Message->Data() +
               L"\n"
               );

           // Handle exceptions here.
           return false;
       }
   });

참고

음성 인식을 최적화하는 데 사용할 수 있는 미리 정의된 SpeechRecognitionScenarios 가 몇 가지 있습니다.

받아쓰기를 최적화하려면 받아쓰기 시나리오를 사용합니다.

// Compile the dictation topic constraint, which optimizes for speech dictation.
auto dictationConstraint = ref new SpeechRecognitionTopicConstraint(SpeechRecognitionScenario::Dictation, "dictation");
m_speechRecognizer->Constraints->Append(dictationConstraint);

음성 웹 검색의 경우 다음 웹별 시나리오 제약 조건을 사용합니다.

// Add a web search topic constraint to the recognizer.
auto webSearchConstraint = ref new SpeechRecognitionTopicConstraint(SpeechRecognitionScenario::WebSearch, "webSearch");
speechRecognizer->Constraints->Append(webSearchConstraint);

양식 제약 조건을 사용하여 양식을 작성합니다. 이 경우 양식을 작성하는 데 최적화된 고유한 문법을 적용하는 것이 가장 좋습니다.

// Add a form constraint to the recognizer.
auto formConstraint = ref new SpeechRecognitionTopicConstraint(SpeechRecognitionScenario::FormFilling, "formFilling");
speechRecognizer->Constraints->Append(formConstraint );

SRGS 형식으로 고유한 문법을 제공할 수 있습니다.

연속 인식 사용

연속 받아쓰기 시나리오는 Windows 10 UWP 음성 코드 샘플을 참조하세요.

품질 저하 처리

환경 조건은 때때로 음성 인식을 방해합니다. 예를 들어 방이 너무 시끄럽거나 사용자가 너무 크게 말할 수 있습니다. 가능하면 음성 인식 API는 품질 저하를 일으킨 조건에 대한 정보를 제공합니다. 이 정보는 WinRT 이벤트를 통해 앱에 푸시됩니다. 다음 예제에서는 이 이벤트를 구독하는 방법을 보여 줍니다.

m_speechRecognizer->RecognitionQualityDegrading +=
       ref new TypedEventHandler<SpeechRecognizer^, SpeechRecognitionQualityDegradingEventArgs^>(
           std::bind(&HolographicVoiceInputSampleMain::OnSpeechQualityDegraded, this, _1, _2)
           );

코드 샘플에서는 디버그 콘솔에 조건 정보를 작성합니다. 앱은 UI, 음성 합성 및 다른 방법을 통해 사용자에게 피드백을 제공할 수 있습니다. 또는 일시적으로 품질 저하로 인해 음성이 중단될 때 다르게 동작해야 할 수도 있습니다.

void HolographicSpeechPromptSampleMain::OnSpeechQualityDegraded(SpeechRecognizer^ recognizer, SpeechRecognitionQualityDegradingEventArgs^ args)
   {
       switch (args->Problem)
       {
       case SpeechRecognitionAudioProblem::TooFast:
           OutputDebugStringW(L"The user spoke too quickly.\n");
           break;

       case SpeechRecognitionAudioProblem::TooSlow:
           OutputDebugStringW(L"The user spoke too slowly.\n");
           break;

       case SpeechRecognitionAudioProblem::TooQuiet:
           OutputDebugStringW(L"The user spoke too softly.\n");
           break;

       case SpeechRecognitionAudioProblem::TooLoud:
           OutputDebugStringW(L"The user spoke too loudly.\n");
           break;

       case SpeechRecognitionAudioProblem::TooNoisy:
           OutputDebugStringW(L"There is too much noise in the signal.\n");
           break;

       case SpeechRecognitionAudioProblem::NoSignal:
           OutputDebugStringW(L"There is no signal.\n");
           break;

       case SpeechRecognitionAudioProblem::None:
       default:
           OutputDebugStringW(L"An error was reported with no information.\n");
           break;
       }
   }

ref 클래스를 사용하여 DirectX 앱을 만들지 않는 경우 음성 인식기를 해제하거나 다시 만들기 전에 이벤트에서 구독을 취소해야 합니다. HolographicSpeechPromptSample에는 인식을 중지하고 이벤트 구독을 취소하는 루틴이 있습니다.

Concurrency::task<void> HolographicSpeechPromptSampleMain::StopCurrentRecognizerIfExists()
   {
       return create_task([this]()
       {
           if (m_speechRecognizer != nullptr)
           {
               return create_task(m_speechRecognizer->StopRecognitionAsync()).then([this]()
               {
                   m_speechRecognizer->RecognitionQualityDegrading -= m_speechRecognitionQualityDegradedToken;

                   if (m_speechRecognizer->ContinuousRecognitionSession != nullptr)
                   {
                       m_speechRecognizer->ContinuousRecognitionSession->ResultGenerated -= m_speechRecognizerResultEventToken;
                   }
               });
           }
           else
           {
               return create_task([this]() { m_speechRecognizer = nullptr; });
           }
       });
   }

음성 합성을 사용하여 가청 프롬프트 제공

홀로그램 음성 샘플은 음성 합성을 사용하여 사용자에게 가청 지침을 제공합니다. 이 섹션에서는 합성된 음성 샘플을 만든 다음 HRTF 오디오 API를 통해 다시 재생하는 방법을 보여 줍니다.

구 입력을 요청할 때 고유한 음성 프롬프트를 제공하는 것이 좋습니다. 프롬프트는 연속 인식 시나리오에 대해 음성 명령을 사용할 수 있는 시기를 나타내는 데 도움이 될 수도 있습니다. 다음 예제에서는 음성 신시사이저를 사용하여 이 작업을 수행하는 방법을 보여 줍니다. 프롬프트가 동적이 아닌 시나리오와 같이 미리 녹음된 음성 클립, 시각적 UI 또는 말의 다른 표시기를 사용할 수도 있습니다.

먼저 SpeechSynthesizer 개체를 만듭니다.

auto speechSynthesizer = ref new Windows::Media::SpeechSynthesis::SpeechSynthesizer();

합성할 텍스트를 포함하는 문자열도 필요합니다.

// Phrase recognition works best when requesting a phrase or sentence.
   StringReference voicePrompt = L"At the prompt: Say a phrase, asking me to change the cube to a specific color.";

음성은 SynthesizeTextToStreamAsync를 통해 비동기적으로 합성됩니다. 여기서는 음성을 합성하는 비동기 작업을 시작합니다.

create_task(speechSynthesizer->SynthesizeTextToStreamAsync(voicePrompt), task_continuation_context::use_current())
       .then([this, speechSynthesizer](task<Windows::Media::SpeechSynthesis::SpeechSynthesisStream^> synthesisStreamTask)
   {
       try
       {

음성 합성은 바이트 스트림으로 전송됩니다. 해당 바이트 스트림을 사용하여 XAudio2 음성을 초기화할 수 있습니다. 홀로그램 코드 샘플의 경우 HRTF 오디오 효과로 재생합니다.

Windows::Media::SpeechSynthesis::SpeechSynthesisStream^ stream = synthesisStreamTask.get();

           auto hr = m_speechSynthesisSound.Initialize(stream, 0);
           if (SUCCEEDED(hr))
           {
               m_speechSynthesisSound.SetEnvironment(HrtfEnvironment::Small);
               m_speechSynthesisSound.Start();

               // Amount of time to pause after the audio prompt is complete, before listening
               // for speech input.
               static const float bufferTime = 0.15f;

               // Wait until the prompt is done before listening.
               m_secondsUntilSoundIsComplete = m_speechSynthesisSound.GetDuration() + bufferTime;
               m_waitingForSpeechPrompt = true;
           }
       }

음성 인식과 마찬가지로 음성 합성은 문제가 발생하면 예외를 throw합니다.

catch (Exception^ exception)
       {
           PrintWstringToDebugConsole(
               std::wstring(L"Exception while trying to synthesize speech: ") +
               exception->Message->Data() +
               L"\n"
               );

           // Handle exceptions here.
       }
   });