Handling Bing Speech Recognition data

Article
07/05/2016

This document describes how to work with the text and other information returned by the Bing Speech Recognition Control.

Prerequisites

Before creating speech-enabled applications, you must install the speech control from Visual Studio Gallery or from the Visual Studio Extension Manager, as described in How to: Register and install the Bing Speech Recognition Control. Then, for each project that will use the Speech control, you must complete the preparatory steps described in How to: Enable a project for the Bing Speech Recognition Control.

This document assumes you have created a SpeechRecognizer object and UI elements to support it, as described in How to: Add the Bing Speech Recognition Control to an application with the SpeechRecognizerUx class and How to: Add the Bing Speech Recognition Control to an application with a custom UI.

Sources and types of information returned by speech recognition

When you run the SpeechRecognizer.RecognizeSpeechToTextAsync() method, it returns a SpeechRecognitionResult object. This includes the result text, a TextConfidence property that gives the estimated accuracy of the result, and a list of alternate results available through the GetAlternates(int) method. Additional information is available through the SpeechRecognizer.AudioCaptureStateChanged event which identifies the different stages of the speech recognition process, the SpeechRecognizer.AudioLevelChanged event which tracks the current audio input volume, and the SpeechRecognizer.RecognizerResultReceived event which tracks possible results identified by the speech recognition web service. For more information about using the SpeechRecognizer events, see How to: Add the Bing Speech Recognition Control to an application with a custom UI.

Result text

The final result text from a speech recognition session resides in the SpeechRecognitionResult.Text property. This is the result deemed most likely to be accurate by the SpeechRecognizer. In addition, the RecognizerResultReceived event provides intermediate results through the SpeechRecognitionResultRecievedEventArgs.Text property.

Confidence

Text confidence indicates the estimated accuracy of result text. Confidence is returned as a SpeechRecognitionConfidence enumeration value.

Alternates

The list of alternates is an array of SpeechRecognitionResult objects, arranged in order of confidence, with the final result as item[0] in the array. Calling GetAlternates(int) from any included result will return the same array.

The following example starts a speech recognition session and displays the result text and confidence in a TextBox named ResultText. It then lists alternate results and their confidence in a ListBox named ResultChooser. When a user selects an item from the list, the selected item is then displayed in ResultText. Intermediate results are shown in ResultText until they are overwritten by the final result or by an error message.

private async void SpeakButton_Click(object sender, RoutedEventArgs e)
{
    try
    {
        // Start speech recognition.
        var result = await SR.RecognizeSpeechToTextAsync();

        // Populate ResultText.
        ResultText.Text = string.Format("{0} -- {1}",
            result.Text,
            result.TextConfidence.ToString());

        // Read the alternates into a string array.
        var alternates = result.GetAlternates(5);
        if (alternates.Count > 1)
        {
            string[] s = new string[alternates.Count];
            for (int i = 1; i < alternates.Count; i++)
            {
                s[i] = string.Format("{0} -- {1}",
                    alternates[i].Text,
                    alternates[i].TextConfidence.ToString());
            }

            // Set ResultChooser to display the alternates list.
            this.ResultChooser.ItemsSource = s;
        }
    }
    catch (System.Exception ex)
    {
        string s = ex.Message;
        ResultText.Text = s;
    }
}

private void ResultChooser_SelectionChanged(object sender, SelectionChangedEventArgs e)
{
    // Set ResultText to display the selected alternate.
    var item = ResultChooser.SelectedItem;
    ResultText.Text = item.ToString();
}

void SR_RecognizerResultReceived(SpeechRecognizer sender,
    SpeechRecognitionResultReceivedEventArgs args)
{
    ResultText.Text = args.Text;
}

function SpeakButton_Click() {
    var resultText = "";

    // Re-bind to the control and start speech recognition. Skip this
    // step if you are not using a SpeechRecognizerUx control.
    document.getElementById('SpeechControl').winControl.speechRecognizer = SR;
    SR.recognizeSpeechToTextAsync()
            .then(
                function (result) {
                    // Make sure result.text is a string and not an error object.
                    if (typeof (result.text) == "string") {

                        // Get the result text.
                        resultText = result.text;

                        // Load the alternates into ResultText as <option> elements.
                        var alternates = result.getAlternates(5);
                        if (alternates.length > 1) {
                            for (var i = 0; i < alternates.length; i++) {
                                var opt = document.createElement("option");
                                opt.innerHTML = alternates[i].text;
                                document.getElementById('ResultChooser').appendChild(opt);
                            }
                        }
                    }
                    else {
                        // Handle the error from speech that is too quiet or unclear.
                        s = "I'm sorry. I couldn't understand you."
                    }
                },
                function (error) {
                    resultText = "Error: (" + error.number + ") " + error.message;
                }
            )
        .done(
        function (result) {
            // Load the result text into ResultText.
            document.getElementById('ResultText').innerHTML = window.toStaticHTML(resultText);
            }
        );
    }

function AlternatesListBox_SelectionChanged(sender, e) {
    var alts = document.getElementById('AlternatesListBox');
    var item = alts.childNodes[alts.selectedIndex];
    document.getElementById('ResultText').innerText = item.textContent;
}
function SR_RecognizerResultReceived(args) {
    if (typeof (args.text) == "string") {
        document.getElementById("IntermediateResults").innerText = args.text;
    }    document.getElementById("ResultText").innerText = args.text;
}

Caution

When collecting speech results or intermediate results in a JavaScript application, quiet or unclear speech may cause the recognizeSpeechToTextAsync() method to return an error object in place of result text. To maintain smooth program flow, verify that the result text is a string before attempting to read it. For more information, see How to: Add the Bing Speech Recognition Control to an application with a custom UI.

Strategies for handling speech recognition data

Because speaking styles vary, it is usually a good idea to provide options for users when the result text is incorrect, such as choosing from a list of alternates or accepting/rejecting a given result. You can also provide guidance in the SpeechRecognizerUx.Tips property or elsewhere in the UI to help users phrase their speech in ways that are more likely to be understood. This is particularly important if your application will respond to particular keywords, phrases, or syntax in user speech.

The Bing Speech Recognition control is optimized for sessions of usually two sentences or less. If your application will be used for dictating longer messages or documents, you may want to configure your UI to encourage a speaking flow that is compatible with this session length, and with short pauses in between for interpretation.

Depending on the intended context of your application, you may want to use SpeechSynthesis, otherwise known as Text To Speech (TTS) to communicate with your users instead of onscreen text. For information on Speech Synthesis for Windows 8.1, see the Windows.Media.SpeechSynthesis documentation.