How to: Add the Bing Speech Recognition Control to an application with a custom UI

Article
07/05/2016

This document describes how to implement speech recognition with a custom UI. To use the SpeechRecognizerUx control in your application, which implements some of the UI functionality automatically, see How to: Add the Bing Speech Recognition Control to an application with the SpeechRecognizerUx class.

Prerequisites for Speech Recognition in Windows Store applications

Before creating speech-enabled applications, you must install the speech control from Visual Studio Gallery or from the Visual Studio Extension Manager, as described in How to: Register and install the Bing Speech Recognition Control. Then, for each project that will use the Speech control, you must complete the preparatory steps described in How to: Enable a project for the Bing Speech Recognition Control.

Adding Speech Recognition to an application with custom UI

Speech recognition functionality depends on the SpeechRecognizer class and its methods and events. The methods allow us to stop and start speech recognition, and the events let us show volume and results data, and mark the different steps in the speech recognition process so that we can adjust the UI accordingly.

For a complete code example using speech recognition with a custom UI, see the SpeechRecognizer documentation. For the list of currently supported languages, see The Bing Speech Recognition Control.

Creating the SpeechRecognizer object

Before calling the SpeechRecognizer.SpeechRecognizer(string, SpeechAuthorizationParameters) constructor, you must create a SpeechAuthorizationParameters object and populate it with your Azure Data Marketplace credentials. These credentials enable the SpeechRecognizer to contact the web service that analyzes the audio data and converts it to text.

The following code example creates a SpeechRecognizer and adds its event handlers.

public MainPage()
{
    this.InitializeComponent();
    this.Loaded += MainPage_Loaded;
}

SpeechRecognizer SR;
private void MainPage_Loaded(object sender, RoutedEventArgs e)
{
    // Apply credentials from the Windows Azure Data Marketplace.
    var credentials = new SpeechAuthorizationParameters();
    credentials.ClientId = "<YOUR CLIENT ID>";
    credentials.ClientSecret = "<YOUR CLIENT SECRET>";

    // Initialize the speech recognizer.
    SR = new SpeechRecognizer("en-US", credentials);

    // Add speech recognition event handlers.
    SR.AudioCaptureStateChanged += SR_AudioCaptureStateChanged;
    SR.AudioLevelChanged += SR_AudioLevelChanged;
    SR.RecognizerResultReceived += SR_RecognizerResultReceived;
}

var SR;
function body_OnLoad() {
    // Apply credentials from the Windows Azure Data Marketplace.
    var credentials = new Bing.Speech.SpeechAuthorizationParameters();
    credentials.clientId = "<YOUR CLIENT ID>";
    credentials.clientSecret = "<YOUR CLIENT SECRET>";

    // Initialize the speech recognizer.
    SR = new Bing.Speech.SpeechRecognizer("en-US", credentials);

    // Add speech recognition event handlers.
    SR.onaudiocapturestatechanged = SR_AudioCaptureStateChanged;
    SR.onaudiolevelchanged = SR_AudioLevelChanged;
    SR.onrecognizerresultreceived = SR_RecognizerResultReceived;
    }

Exposing the SpeechRecognizer methods in the UI

The most important speech recognition function is the SpeechRecognizer.RecognizeSpeechToTextAsync() method, which starts the speech recognition session, raises the events, and returns the results. You can expose this method directly through a UI element, such as a Button, or call it in response to another application event, such as loading a page or recognizing a microphone.

Once a speech recognition session has been started, you can end it at any time with the RequestCancelOperationl() method. This stops the session and discards any data that may have accumulated. A Cancel button is often useful to users who wish to restart their speech. You can also call this method as part of your error handling.

When the user stops speaking, the SpeechRecognizer.RecognizeSpeechToTextAsync() method detects the drop in audio input levels, stops recording, and finishes interpreting the audio data. Sometimes, especially if there is background noise, there can be a delay between the end of speech and the end of recording. Including a button to call the StopListeningAndProcessAudio() method gives your users a way to end the recording without waiting for the speech recognizer to decide they are done.

The following markup creates buttons for each of the SpeechRecognizer methods.

<!-- If your app targets Windows 8.1 Or higher, use this markup. -->
<AppBarButton x:Name="SpeakButton" Icon="Microphone" 
    Click="SpeakButton_Click"></AppBarButton>
<AppBarButton x:Name="StopButton" Icon="Stop" 
    Click="StopButton_Click"></AppBarButton>
<AppBarButton x:Name="CancelButton" Icon="Cancel" 
    Click="CancelButton_Click"></AppBarButton>

<!--If your app targets Windows 8, use this markup. -->
<Button x:Name="SpeakButton" Click="SpeakButton_Click" 
        Style="{StaticResource MicrophoneAppBarButtonStyle}" />
        AutomationProperties.Name="Speak" />
<Button x:Name="StopButton" Click="StopButton_Click"
        Style="{StaticResource StopAppBarButtonStyle}"  
        AutomationProperties.Name="Done" />
<Button x:Name="CancelButton" Click="CancelButton_Click"
        Style="{StaticResource ClosePaneAppBarButtonStyle}" 
        AutomationProperties.Name="Cancel" Content="&#xE10A;" />
<TextBlock x:Name="ResultText" />

<div id="SpeakButton" onclick="SpeakButton_Click();" 
    style="font-size:30">&#xe1d6;</div> 
<div>Speak</div>
<div id="StopButton" onclick="StopButton_Click();" 
    style="font-size:30">&#xe15b;</div> 
<div>Done</div>
<div id="CancelButton" onclick="CancelButton_Click();" 
    style="font-size:30">&#xe10a;</div> 
<div>Cancel</div>
<div id="ResultText"></div>

Note

The XAML Style attributes in Windows 8 are set to standard UI element styles available for Windows Store applications. These styles are found in Common\StandardStyles.xaml in Solution Explorer but must be uncommented before use. Setting the AutomationProperties.Name attribute assigns a caption to the bottom of the element, and the Content attribute specifies a character code for a character that appears in the foreground of the button. Not specifying a caption or character code applies the default values specified in the style definition.

The following code behind handles the click events for the buttons.

private async void SpeakButton_Click(object sender, RoutedEventArgs e)
{
    // Always call RecognizeSpeechToTextAsync from inside
    // a try block because it calls a web service.
    try
    {
        // Start speech recognition.
        var result = await SR.RecognizeSpeechToTextAsync();

        // Write the result to the TextBlock.
        ResultText.Text = result.Text;
    }
    catch (Exception ex)
    {
        // If there's an exception, show the Type and Message
        // in TextBlock ResultText.
        ResultText.Text = string.Format("{0}: {1}",
            ex.GetType().ToString(), ex.Message);
    }
}

private void CancelButton_Click(object sender, RoutedEventArgs e)
{
    // Cancel the current speech session.
    SR.RequestCancelOperation();
}

private void StopButton_Click(object sender, RoutedEventArgs e)
{
    // Stop listening and move to Thinking state.
    SR.StopListeningAndProcessAudio();
}

function SpeakButton_Click() {
    // Declare a string to hold the result text.
    var s = "";

    // Start speech recognition.
    SR.recognizeSpeechToTextAsync()
            .then(
                // Write the result to the string.
                function (result) {
                    if (typeof (result.text) == "string") {
                        s = result.text;
                    }
                    else {
                        // Catch the error from speech that is too quiet or unintelligible.
                        s = "I'm sorry. I couldn't understand you."
                    }


                },
                // If there's another error, write the Type and Message to the string.
                function (error) {
                    s = "Error: (" + error.number + ") " + error.message;
                }
            )
        .done(
        // Write the string to ResultText.
        function (result) {
            document.getElementById('ResultText').innerHTML = window.toStaticHTML(s);
        }
    );
}

function CancelButton_Click() {
    // Cancel the current speech session.
    SR.requestCancelOperation();
}

function StopButton_Click() {
    // Stop listening and move to Thinking state.
    SR.stopListeningAndProcessAudio();
}

Note that the JavaScript code in the example above checks the value of result.text to make sure it is a string before attempting to assign it to another variable. This is because quiet or unclear speech may cause the recognizeSpeechToTextAsync() method to return an error object with error# -2147467261 in place of result.text. This error object is also passed to the intermediate results via the SpeechRecognitionResult.Text property in the SpeechRecognizer.RecognizerResultReceived event. Validating the type of result in this way maintains program flow and gives the opportunity to bypass or respond to a known error.

Handling the SpeechRecognizer events

The most important SpeechRecognizer event is the AudioCaptureStateChanged event, because it tells you where you are in the speech recognition process. Use the SpeechRecognitionAudioCaptureStateChangedEventArgs.State property to access capture state information.

The following AudioCaptureStateChanged event handler and associated helper function show and hide a series of StackPanel elements (XAML) or Div elements (HTML) that correspond to different UI states.

void SR_AudioCaptureStateChanged(SpeechRecognizer sender,
    SpeechRecognitionAudioCaptureStateChangedEventArgs args)
{
    // Show the panel that corresponds to the current state.
    switch (args.State)
    {
        case SpeechRecognizerAudioCaptureState.Complete:
            if (uiState == "ListenPanel" || uiState == "ThinkPanel")
            {
                SetPanel(CompletePanel);  
            }
            break;
        case SpeechRecognizerAudioCaptureState.Initializing:
            SetPanel(InitPanel);
            break;
        case SpeechRecognizerAudioCaptureState.Listening:
            SetPanel(ListenPanel);
            break;
        case SpeechRecognizerAudioCaptureState.Thinking:
            SetPanel(ThinkPanel);
            break;
        default:
            break;
    }
}

string uiState = "";
private void SetPanel(StackPanel panel)
{
    // Hide all the panels.
    InitPanel.Visibility = Visibility.Collapsed;
    ListenPanel.Visibility = Visibility.Collapsed;
    ThinkPanel.Visibility = Visibility.Collapsed;
    CompletePanel.Visibility = Visibility.Collapsed;
    StartPanel.Visibility = Visibility.Collapsed;

    // Show the selected panel and flag the UI state.
    panel.Visibility = Visibility.Visible;
    uiState = panel.Name;
}

function SR_AudioCaptureStateChanged(args)
{
// Show the div that corresponds to the current state.
    switch (args.State)
{
        case SpeechRecognizerAudioCaptureState.Complete:
            if (uiState == "ListenPanel" || uiState == "ThinkPanel")
            {
                SetDiv("InitDiv");
            }
            break;
        case SpeechRecognizerAudioCaptureState.Initializing:
            SetDiv("InitDiv");
            break;
        case SpeechRecognizerAudioCaptureState.Listening:
            SetDiv("ListenDiv");
            break;
        case SpeechRecognizerAudioCaptureState.Thinking:
            SetDiv("ThinkDiv");
            break;
        default:
            break;
    }
}

var uiState = "";
function SetDiv(div)
{
    // Hide all the Divs.
    document.getElementById("InitDiv").style.display = 'none';
    document.getElementById("ListenDiv").style.display = 'none';
    document.getElementById("ThinkDiv").style.display = 'none';
    document.getElementById("CompleteDiv").style.display = 'none';
    document.getElementById("CancellingDiv").style.display = 'none';
    document.getElementById("StartDiv").style.display = 'none';

    // Show the selected Div and flag the UI state.
    document.getElementById(div).style.display = 'block';
    uiState = panel.Name;
}

You can use the SpeechRecognizer.AudioLevelChanged event to show real-time audio levels in a variety of ways, or to advise users when they should adjust their speaking volume. The following example represents current audio levels by changing the opacity of a UI element named VolumeMeter.

void SR_AudioLevelChanged(SpeechRecognizer sender,
    SpeechRecognitionAudioLevelChangedEventArgs args)
{
    var v = args.AudioLevel;
    if (v > 0) VolumeMeter.Opacity = v / 50;
    else VolumeMeter.Opacity = Math.Abs((v - 50) / 100);
}

function SR_AudioLevelChanged(args) {
    var volumeMeter = document.getElementById("VolumeMeter");
    var v = args.audioLevel;
    if (v > 0) volumeMeter.style.opacity = v / 50;
    else volumeMeter.style.opacity = Math.abs((v - 50) / 100);
}

While the speech recognizer is receiving an audio stream, it makes multiple attempts to interpret the audio data collected so far and convert it to text. At the end of each attempt, it raises the RecognizerResultReceived event. You can use this event to capture the intermediate results while the speech recognizer is still processing. You can also identify the final result text by checking the SpeechRecognitionResultRecievedEventArgs.IsHypothesis property, but the SpeechRecognizer.RecognizeSpeechToTextAsync() returns a more complete result in the form of a SpeechRecognitionResult object.

The following example writes intermediate results to a TextBlock named IntermediateResults.

void SR_RecognizerResultReceived(SpeechRecognizer sender,
    SpeechRecognitionResultReceivedEventArgs args)
{
    IntermediateResults.Text = args.Text;
}

function SR_RecognizerResultReceived(args)
{
    if (typeof (args.text) == "string") {
        document.getElementById("IntermediateResults").innerText = args.text;
    }
}

Next steps

You now have all of the pieces to put together a custom speech recognition UI. For more information on processing the results, including assessing TextConfidence and making the Alternates list available, see Handling Bing Speech Recognition data.