Get started with speech-to-text

One of the core features of the Speech service is the ability to recognize and transcribe human speech (often referred to as speech-to-text). In this quickstart, you learn how to use the Speech SDK in your apps and products to perform high-quality speech-to-text conversion.

Skip to samples on GitHub

If you want to skip straight to sample code, see the C# quickstart samples on GitHub.

Prerequisites

This article assumes that you have an Azure account and Speech service subscription. If you don't have an account and subscription, try the Speech service for free.

Install the Speech SDK

If you just want the package name to get started, run Install-Package Microsoft.CognitiveServices.Speech in the NuGet console.

For platform-specific installation instructions, see the following links:

Create a speech configuration

To call the Speech service using the Speech SDK, you need to create a SpeechConfig. This class includes information about your subscription, like your key and associated location/region, endpoint, host, or authorization token. Create a SpeechConfig by using your key and location/region. See the Find keys and location/region page to find your key-location/region pair.

using System;
using System.IO;
using System.Threading.Tasks;
using Microsoft.CognitiveServices.Speech;
using Microsoft.CognitiveServices.Speech.Audio;

class Program 
{
    async static Task Main(string[] args)
    {
        var speechConfig = SpeechConfig.FromSubscription("<paste-your-speech-key-here>", "<paste-your-speech-location/region-here>");
    }
}

There are a few other ways that you can initialize a SpeechConfig:

  • With an endpoint: pass in a Speech service endpoint. A key or authorization token is optional.
  • With a host: pass in a host address. A key or authorization token is optional.
  • With an authorization token: pass in an authorization token and the associated region/location.

Note

Regardless of whether you're performing speech recognition, speech synthesis, translation, or intent recognition, you'll always create a configuration.

Recognize from microphone

To recognize speech using your device microphone, create an AudioConfig using FromDefaultMicrophoneInput(). Then initialize a SpeechRecognizer, passing your audioConfig and speechConfig.

using System;
using System.IO;
using System.Threading.Tasks;
using Microsoft.CognitiveServices.Speech;
using Microsoft.CognitiveServices.Speech.Audio;

class Program 
{
    async static Task FromMic(SpeechConfig speechConfig)
    {
        using var audioConfig = AudioConfig.FromDefaultMicrophoneInput();
        using var recognizer = new SpeechRecognizer(speechConfig, audioConfig);

        Console.WriteLine("Speak into your microphone.");
        var result = await recognizer.RecognizeOnceAsync();
        Console.WriteLine($"RECOGNIZED: Text={result.Text}");
    }

    async static Task Main(string[] args)
    {
        var speechConfig = SpeechConfig.FromSubscription("<paste-your-speech-key-here>", "<paste-your-speech-location/region-here>");
        await FromMic(speechConfig);
    }
}

If you want to use a specific audio input device, you need to specify the device ID in the AudioConfig. Learn how to get the device ID for your audio input device.

Recognize from file

If you want to recognize speech from an audio file instead of a microphone, you still need to create an AudioConfig. However, when you create the AudioConfig, instead of calling FromDefaultMicrophoneInput(), you call FromWavFileInput() and pass the file path.

using System;
using System.IO;
using System.Threading.Tasks;
using Microsoft.CognitiveServices.Speech;
using Microsoft.CognitiveServices.Speech.Audio;

class Program 
{
    async static Task FromFile(SpeechConfig speechConfig)
    {
        using var audioConfig = AudioConfig.FromWavFileInput("PathToFile.wav");
        using var recognizer = new SpeechRecognizer(speechConfig, audioConfig);

        var result = await recognizer.RecognizeOnceAsync();
        Console.WriteLine($"RECOGNIZED: Text={result.Text}");
    }

    async static Task Main(string[] args)
    {
        var speechConfig = SpeechConfig.FromSubscription("<paste-your-speech-key-here>", "<paste-your-speech-location/region-here>");
        await FromFile(speechConfig);
    }
}

Recognize from in-memory stream

For many use-cases, it is likely your audio data will be coming from blob storage, or otherwise already be in-memory as a byte[] or similar raw data structure. The following example uses a PushAudioInputStream to recognize speech, which is essentially an abstracted memory stream. The sample code does the following:

  • Writes raw audio data (PCM) to the PushAudioInputStream using the Write() function, which accepts a byte[].
  • Reads a .wav file using a FileReader for demonstration purposes, but if you already have audio data in a byte[], you can skip directly to writing the content to the input stream.
  • The default format is 16 bit, 16khz mono PCM. To customize the format, you can pass an AudioStreamFormat object to CreatePushStream() using the static function AudioStreamFormat.GetWaveFormatPCM(sampleRate, (byte)bitRate, (byte)channels).
using System;
using System.IO;
using System.Threading.Tasks;
using Microsoft.CognitiveServices.Speech;
using Microsoft.CognitiveServices.Speech.Audio;

class Program 
{
    async static Task FromStream(SpeechConfig speechConfig)
    {
        var reader = new BinaryReader(File.OpenRead("PathToFile.wav"));
        using var audioInputStream = AudioInputStream.CreatePushStream();
        using var audioConfig = AudioConfig.FromStreamInput(audioInputStream);
        using var recognizer = new SpeechRecognizer(speechConfig, audioConfig);

        byte[] readBytes;
        do
        {
            readBytes = reader.ReadBytes(1024);
            audioInputStream.Write(readBytes, readBytes.Length);
        } while (readBytes.Length > 0);

        var result = await recognizer.RecognizeOnceAsync();
        Console.WriteLine($"RECOGNIZED: Text={result.Text}");
    }

    async static Task Main(string[] args)
    {
        var speechConfig = SpeechConfig.FromSubscription("<paste-your-speech-key-here>", "<paste-your-speech-location/region-here>");
        await FromStream(speechConfig);
    }
}

Using a push stream as input assumes that the audio data is a raw PCM, e.g. skipping any headers. The API will still work in certain cases if the header has not been skipped, but for the best results consider implementing logic to read off the headers so the byte[] starts at the start of the audio data.

Error handling

The previous examples simply get the recognized text from result.text, but to handle errors and other responses, you'll need to write some code to handle the result. The following code evaluates the result.Reason property and:

  • Prints the recognition result: ResultReason.RecognizedSpeech
  • If there is no recognition match, inform the user: ResultReason.NoMatch
  • If an error is encountered, print the error message: ResultReason.Canceled
switch (result.Reason)
{
    case ResultReason.RecognizedSpeech:
        Console.WriteLine($"RECOGNIZED: Text={result.Text}");
        break;
    case ResultReason.NoMatch:
        Console.WriteLine($"NOMATCH: Speech could not be recognized.");
        break;
    case ResultReason.Canceled:
        var cancellation = CancellationDetails.FromResult(result);
        Console.WriteLine($"CANCELED: Reason={cancellation.Reason}");

        if (cancellation.Reason == CancellationReason.Error)
        {
            Console.WriteLine($"CANCELED: ErrorCode={cancellation.ErrorCode}");
            Console.WriteLine($"CANCELED: ErrorDetails={cancellation.ErrorDetails}");
            Console.WriteLine($"CANCELED: Did you update the speech key and location/region info?");
        }
        break;
}

Continuous recognition

The previous examples use single-shot recognition, which recognizes a single utterance. The end of a single utterance is determined by listening for silence at the end or until a maximum of 15 seconds of audio is processed.

In contrast, continuous recognition is used when you want to control when to stop recognizing. It requires you to subscribe to the Recognizing, Recognized, and Canceled events to get the recognition results. To stop recognition, you must call StopContinuousRecognitionAsync. Here's an example of how continuous recognition is performed on an audio input file.

Start by defining the input and initializing a SpeechRecognizer:

using var audioConfig = AudioConfig.FromWavFileInput("YourAudioFile.wav");
using var recognizer = new SpeechRecognizer(speechConfig, audioConfig);

Then create a TaskCompletionSource<int> to manage the state of speech recognition.

var stopRecognition = new TaskCompletionSource<int>();

Next, subscribe to the events sent from the SpeechRecognizer.

  • Recognizing: Signal for events containing intermediate recognition results.
  • Recognized: Signal for events containing final recognition results (indicating a successful recognition attempt).
  • SessionStopped: Signal for events indicating the end of a recognition session (operation).
  • Canceled: Signal for events containing canceled recognition results (indicating a recognition attempt that was canceled as a result or a direct cancellation request or, alternatively, a transport or protocol failure).
recognizer.Recognizing += (s, e) =>
{
    Console.WriteLine($"RECOGNIZING: Text={e.Result.Text}");
};

recognizer.Recognized += (s, e) =>
{
    if (e.Result.Reason == ResultReason.RecognizedSpeech)
    {
        Console.WriteLine($"RECOGNIZED: Text={e.Result.Text}");
    }
    else if (e.Result.Reason == ResultReason.NoMatch)
    {
        Console.WriteLine($"NOMATCH: Speech could not be recognized.");
    }
};

recognizer.Canceled += (s, e) =>
{
    Console.WriteLine($"CANCELED: Reason={e.Reason}");

    if (e.Reason == CancellationReason.Error)
    {
        Console.WriteLine($"CANCELED: ErrorCode={e.ErrorCode}");
        Console.WriteLine($"CANCELED: ErrorDetails={e.ErrorDetails}");
        Console.WriteLine($"CANCELED: Did you update the speech key and location/region info?");
    }

    stopRecognition.TrySetResult(0);
};

recognizer.SessionStopped += (s, e) =>
{
    Console.WriteLine("\n    Session stopped event.");
    stopRecognition.TrySetResult(0);
};

With everything set up, call StartContinuousRecognitionAsync to start recognizing.

await recognizer.StartContinuousRecognitionAsync();

// Waits for completion. Use Task.WaitAny to keep the task rooted.
Task.WaitAny(new[] { stopRecognition.Task });

// make the following call at some point to stop recognition.
// await recognizer.StopContinuousRecognitionAsync();

Dictation mode

When using continuous recognition, you can enable dictation processing by using the corresponding "enable dictation" function. This mode will cause the speech config instance to interpret word descriptions of sentence structures such as punctuation. For example, the utterance "Do you live in town question mark" would be interpreted as the text "Do you live in town?".

To enable dictation mode, use the EnableDictation method on your SpeechConfig.

speechConfig.EnableDictation();

Change source language

A common task for speech recognition is specifying the input (or source) language. Let's take a look at how you would change the input language to Italian. In your code, find your SpeechConfig, then add this line directly below it.

speechConfig.SpeechRecognitionLanguage = "it-IT";

The SpeechRecognitionLanguage property expects a language-locale format string. You can provide any value in the Locale column in the list of supported locales/languages.

Improve recognition accuracy

Phrase Lists are used to identify known phrases in audio data, like a person's name or a specific location. By providing a list of phrases, you improve the accuracy of speech recognition.

As an example, if you have a command "Move to" and a possible destination of "Ward" that may be spoken, you can add an entry of "Move to Ward". Adding a phrase will increase the probability that when the audio is recognized that "Move to Ward" will be recognized instead of "Move toward"

Single words or complete phrases can be added to a Phrase List. During recognition, an entry in a phrase list is used to boost recognition of the words and phrases in the list even when the entries appear in the middle of the utterance.

Important

The Phrase List feature is available in the following languages: en-US, de-DE, en-AU, en-CA, en-GB, en-IN, es-ES, fr-FR, it-IT, ja-JP, pt-BR, zh-CN

The Phrase List feature should be used with no more than a few hundred phrases. If you have a larger list or for languages that are not currently supported, training a custom model will likely be the better choice to improve accuracy.

The Phrase List feature is not supported with custom endpoints. Do not use it with custom endpoints. Instead, train a custom model that includes the phrases.

To use a phrase list, first create a PhraseListGrammar object, then add specific words and phrases with AddPhrase.

Any changes to PhraseListGrammar take effect on the next recognition or after a reconnection to the Speech service.

var phraseList = PhraseListGrammar.FromRecognizer(recognizer);
phraseList.AddPhrase("Supercalifragilisticexpialidocious");

If you need to clear your phrase list:

phraseList.Clear();

Other options to improve recognition accuracy

Phrase lists are only one option to improve recognition accuracy. You can also:

One of the core features of the Speech service is the ability to recognize and transcribe human speech (often referred to as speech-to-text). In this quickstart, you learn how to use the Speech SDK in your apps and products to perform high-quality speech-to-text conversion.

Skip to samples on GitHub

If you want to skip straight to sample code, see the C++ quickstart samples on GitHub.

Prerequisites

This article assumes that you have an Azure account and Speech service subscription. If you don't have an account and subscription, try the Speech service for free.

Install the Speech SDK

Before you can do anything, you'll need to install the Speech SDK. Depending on your platform, use the following instructions:

Create a speech configuration

To call the Speech service using the Speech SDK, you need to create a SpeechConfig. This class includes information about your subscription, like your key and associated location/region, endpoint, host, or authorization token. Create a SpeechConfig by using your key and region. See the Find keys and location/region page to find your key-location/region pair.

using namespace std;
using namespace Microsoft::CognitiveServices::Speech;

auto config = SpeechConfig::FromSubscription("<paste-your-speech-key-here>", "<paste-your-speech-location/region-here>");

There are a few other ways that you can initialize a SpeechConfig:

  • With an endpoint: pass in a Speech service endpoint. A key or authorization token is optional.
  • With a host: pass in a host address. A key or authorization token is optional.
  • With an authorization token: pass in an authorization token and the associated region.

Note

Regardless of whether you're performing speech recognition, speech synthesis, translation, or intent recognition, you'll always create a configuration.

Recognize from microphone

To recognize speech using your device microphone, create an AudioConfig using FromDefaultMicrophoneInput(). Then initialize a SpeechRecognizer, passing your audioConfig and config.

using namespace Microsoft::CognitiveServices::Speech::Audio;

auto audioConfig = AudioConfig::FromDefaultMicrophoneInput();
auto recognizer = SpeechRecognizer::FromConfig(config, audioConfig);

cout << "Speak into your microphone." << std::endl;
auto result = recognizer->RecognizeOnceAsync().get();
cout << "RECOGNIZED: Text=" << result->Text << std::endl;

If you want to use a specific audio input device, you need to specify the device ID in the AudioConfig. Learn how to get the device ID for your audio input device.

Recognize from file

If you want to recognize speech from an audio file instead of using a microphone, you still need to create an AudioConfig. However, when you create the AudioConfig, instead of calling FromDefaultMicrophoneInput(), you call FromWavFileInput() and pass the file path.

using namespace Microsoft::CognitiveServices::Speech::Audio;

auto audioInput = AudioConfig::FromWavFileInput("YourAudioFile.wav");
auto recognizer = SpeechRecognizer::FromConfig(config, audioInput);

auto result = recognizer->RecognizeOnceAsync().get();
cout << "RECOGNIZED: Text=" << result->Text << std::endl;

Recognize speech

The Recognizer class for the Speech SDK for C++ exposes a few methods that you can use for speech recognition.

Single-shot recognition

Single-shot recognition asynchronously recognizes a single utterance. The end of a single utterance is determined by listening for silence at the end or until a maximum of 15 seconds of audio is processed. Here's an example of asynchronous single-shot recognition using RecognizeOnceAsync:

auto result = recognizer->RecognizeOnceAsync().get();

You'll need to write some code to handle the result. This sample evaluates the result->Reason:

  • Prints the recognition result: ResultReason::RecognizedSpeech
  • If there is no recognition match, inform the user: ResultReason::NoMatch
  • If an error is encountered, print the error message: ResultReason::Canceled
switch (result->Reason)
{
    case ResultReason::RecognizedSpeech:
        cout << "We recognized: " << result->Text << std::endl;
        break;
    case ResultReason::NoMatch:
        cout << "NOMATCH: Speech could not be recognized." << std::endl;
        break;
    case ResultReason::Canceled:
        {
            auto cancellation = CancellationDetails::FromResult(result);
            cout << "CANCELED: Reason=" << (int)cancellation->Reason << std::endl;
    
            if (cancellation->Reason == CancellationReason::Error) {
                cout << "CANCELED: ErrorCode= " << (int)cancellation->ErrorCode << std::endl;
                cout << "CANCELED: ErrorDetails=" << cancellation->ErrorDetails << std::endl;
                cout << "CANCELED: Did you update the speech key and location/region info?" << std::endl;
            }
        }
        break;
    default:
        break;
}

Continuous recognition

Continuous recognition is a bit more involved than single-shot recognition. It requires you to subscribe to the Recognizing, Recognized, and Canceled events to get the recognition results. To stop recognition, you must call StopContinuousRecognitionAsync. Here's an example of how continuous recognition is performed on an audio input file.

Let's start by defining the input and initializing a SpeechRecognizer:

auto audioInput = AudioConfig::FromWavFileInput("YourAudioFile.wav");
auto recognizer = SpeechRecognizer::FromConfig(config, audioInput);

Next, let's create a variable to manage the state of speech recognition. To start, we'll declare a promise<void>, since at the start of recognition we can safely assume that it's not finished.

promise<void> recognitionEnd;

We'll subscribe to the events sent from the SpeechRecognizer.

  • Recognizing: Signal for events containing intermediate recognition results.
  • Recognized: Signal for events containing final recognition results (indicating a successful recognition attempt).
  • SessionStopped: Signal for events indicating the end of a recognition session (operation).
  • Canceled: Signal for events containing canceled recognition results (indicating a recognition attempt that was canceled as a result or a direct cancellation request or, alternatively, a transport or protocol failure).
recognizer->Recognizing.Connect([](const SpeechRecognitionEventArgs& e)
    {
        cout << "Recognizing:" << e.Result->Text << std::endl;
    });

recognizer->Recognized.Connect([](const SpeechRecognitionEventArgs& e)
    {
        if (e.Result->Reason == ResultReason::RecognizedSpeech)
        {
            cout << "RECOGNIZED: Text=" << e.Result->Text 
                 << " (text could not be translated)" << std::endl;
        }
        else if (e.Result->Reason == ResultReason::NoMatch)
        {
            cout << "NOMATCH: Speech could not be recognized." << std::endl;
        }
    });

recognizer->Canceled.Connect([&recognitionEnd](const SpeechRecognitionCanceledEventArgs& e)
    {
        cout << "CANCELED: Reason=" << (int)e.Reason << std::endl;
        if (e.Reason == CancellationReason::Error)
        {
            cout << "CANCELED: ErrorCode=" << (int)e.ErrorCode << "\n"
                 << "CANCELED: ErrorDetails=" << e.ErrorDetails << "\n"
                 << "CANCELED: Did you update the speech key and location/region info?" << std::endl;

            recognitionEnd.set_value(); // Notify to stop recognition.
        }
    });

recognizer->SessionStopped.Connect([&recognitionEnd](const SessionEventArgs& e)
    {
        cout << "Session stopped.";
        recognitionEnd.set_value(); // Notify to stop recognition.
    });

With everything set up, we can call StopContinuousRecognitionAsync.

// Starts continuous recognition. Uses StopContinuousRecognitionAsync() to stop recognition.
recognizer->StartContinuousRecognitionAsync().get();

// Waits for recognition end.
recognitionEnd.get_future().get();

// Stops recognition.
recognizer->StopContinuousRecognitionAsync().get();

Dictation mode

When using continuous recognition, you can enable dictation processing by using the corresponding "enable dictation" function. This mode will cause the speech config instance to interpret word descriptions of sentence structures such as punctuation. For example, the utterance "Do you live in town question mark" would be interpreted as the text "Do you live in town?".

To enable dictation mode, use the EnableDictation method on your SpeechConfig.

config->EnableDictation();

Change source language

A common task for speech recognition is specifying the input (or source) language. Let's take a look at how you would change the input language to German. In your code, find your SpeechConfig, then add this line directly below it.

config->SetSpeechRecognitionLanguage("de-DE");

SetSpeechRecognitionLanguage is a parameter that takes a string as an argument. You can provide any value in the list of supported locales/languages.

Improve recognition accuracy

Phrase Lists are used to identify known phrases in audio data, like a person's name or a specific location. By providing a list of phrases, you improve the accuracy of speech recognition.

As an example, if you have a command "Move to" and a possible destination of "Ward" that may be spoken, you can add an entry of "Move to Ward". Adding a phrase will increase the probability that when the audio is recognized that "Move to Ward" will be recognized instead of "Move toward"

Single words or complete phrases can be added to a Phrase List. During recognition, an entry in a phrase list is used to boost recognition of the words and phrases in the list even when the entries appear in the middle of the utterance.

Important

The Phrase List feature is available in the following languages: en-US, de-DE, en-AU, en-CA, en-GB, en-IN, es-ES, fr-FR, it-IT, ja-JP, pt-BR, zh-CN

The Phrase List feature should be used with no more than a few hundred phrases. If you have a larger list or for languages that are not currently supported, training a custom model will likely be the better choice to improve accuracy.

The Phrase List feature is not supported with custom endpoints. Do not use it with custom endpoints. Instead, train a custom model that includes the phrases.

To use a phrase list, first create a PhraseListGrammar object, then add specific words and phrases with AddPhrase.

Any changes to PhraseListGrammar take effect on the next recognition or after a reconnection to the Speech service.

auto phraseListGrammar = PhraseListGrammar::FromRecognizer(recognizer);
phraseListGrammar->AddPhrase("Supercalifragilisticexpialidocious");

If you need to clear your phrase list:

phraseListGrammar->Clear();

Other options to improve recognition accuracy

Phrase lists are only one option to improve recognition accuracy. You can also:

One of the core features of the Speech service is the ability to recognize and transcribe human speech (often referred to as speech-to-text). In this quickstart, you learn how to use the Speech SDK in your apps and products to perform high-quality speech-to-text conversion.

Skip to samples on GitHub

If you want to skip straight to sample code, see the Go quickstart samples on GitHub.

Prerequisites

This article assumes that you have an Azure account and Speech service subscription. If you don't have an account and subscription, try the Speech service for free.

Install the Speech SDK

Before you can do anything, you'll need to install the Speech SDK for Go.

Speech-to-text from microphone

Use the following code sample to run speech recognition from your default device microphone. Replace the variables subscription and region with your speech key and location/region. See the Find keys and location/region page to find your key-location/region pair. Running the script will start a recognition session on your default microphone and output text.

package main

import (
	"bufio"
	"fmt"
	"os"

	"github.com/Microsoft/cognitive-services-speech-sdk-go/audio"
	"github.com/Microsoft/cognitive-services-speech-sdk-go/speech"
)

func sessionStartedHandler(event speech.SessionEventArgs) {
	defer event.Close()
	fmt.Println("Session Started (ID=", event.SessionID, ")")
}

func sessionStoppedHandler(event speech.SessionEventArgs) {
	defer event.Close()
	fmt.Println("Session Stopped (ID=", event.SessionID, ")")
}

func recognizingHandler(event speech.SpeechRecognitionEventArgs) {
	defer event.Close()
	fmt.Println("Recognizing:", event.Result.Text)
}

func recognizedHandler(event speech.SpeechRecognitionEventArgs) {
	defer event.Close()
	fmt.Println("Recognized:", event.Result.Text)
}

func cancelledHandler(event speech.SpeechRecognitionCanceledEventArgs) {
	defer event.Close()
	fmt.Println("Received a cancellation: ", event.ErrorDetails)
}

func main() {
    subscription :=  "<paste-your-speech-key-here>"
    region := "<paste-your-speech-location/region-here>"

	audioConfig, err := audio.NewAudioConfigFromDefaultMicrophoneInput()
	if err != nil {
		fmt.Println("Got an error: ", err)
		return
	}
	defer audioConfig.Close()
	config, err := speech.NewSpeechConfigFromSubscription(subscription, region)
	if err != nil {
		fmt.Println("Got an error: ", err)
		return
	}
	defer config.Close()
	speechRecognizer, err := speech.NewSpeechRecognizerFromConfig(config, audioConfig)
	if err != nil {
		fmt.Println("Got an error: ", err)
		return
	}
	defer speechRecognizer.Close()
	speechRecognizer.SessionStarted(sessionStartedHandler)
	speechRecognizer.SessionStopped(sessionStoppedHandler)
	speechRecognizer.Recognizing(recognizingHandler)
	speechRecognizer.Recognized(recognizedHandler)
	speechRecognizer.Canceled(cancelledHandler)
	speechRecognizer.StartContinuousRecognitionAsync()
	defer speechRecognizer.StopContinuousRecognitionAsync()
	bufio.NewReader(os.Stdin).ReadBytes('\n')
}

Run the following commands to create a go.mod file that links to components hosted on GitHub.

go mod init quickstart
go get github.com/Microsoft/cognitive-services-speech-sdk-go

Now build and run the code.

go build
go run quickstart

See the reference docs for detailed information on the SpeechConfig and SpeechRecognizer classes.

Speech-to-text from audio file

Use the following sample to run speech recognition from an audio file. Replace the variables subscription and region with your speech key and location/region. See the Find keys and location/region page to find your key-location/region pair. Additionally, replace the variable file with a path to a .wav file. Running the script will recognize speech from the file, and output the text result.

package main

import (
	"fmt"
	"time"

	"github.com/Microsoft/cognitive-services-speech-sdk-go/audio"
	"github.com/Microsoft/cognitive-services-speech-sdk-go/speech"
)

func main() {
    subscription :=  "<paste-your-speech-key-here>"
    region := "<paste-your-speech-location/region-here>"
    file := "path/to/file.wav"

	audioConfig, err := audio.NewAudioConfigFromWavFileInput(file)
	if err != nil {
		fmt.Println("Got an error: ", err)
		return
	}
	defer audioConfig.Close()
	config, err := speech.NewSpeechConfigFromSubscription(subscription, region)
	if err != nil {
		fmt.Println("Got an error: ", err)
		return
	}
	defer config.Close()
	speechRecognizer, err := speech.NewSpeechRecognizerFromConfig(config, audioConfig)
	if err != nil {
		fmt.Println("Got an error: ", err)
		return
	}
	defer speechRecognizer.Close()
	speechRecognizer.SessionStarted(func(event speech.SessionEventArgs) {
		defer event.Close()
		fmt.Println("Session Started (ID=", event.SessionID, ")")
	})
	speechRecognizer.SessionStopped(func(event speech.SessionEventArgs) {
		defer event.Close()
		fmt.Println("Session Stopped (ID=", event.SessionID, ")")
	})

	task := speechRecognizer.RecognizeOnceAsync()
	var outcome speech.SpeechRecognitionOutcome
	select {
	case outcome = <-task:
	case <-time.After(5 * time.Second):
		fmt.Println("Timed out")
		return
	}
	defer outcome.Close()
	if outcome.Error != nil {
		fmt.Println("Got an error: ", outcome.Error)
	}
	fmt.Println("Got a recognition!")
	fmt.Println(outcome.Result.Text)
}

Run the following commands to create a go.mod file that links to components hosted on GitHub.

go mod init quickstart
go get github.com/Microsoft/cognitive-services-speech-sdk-go

Now build and run the code.

go build
go run quickstart

See the reference docs for detailed information on the SpeechConfig and SpeechRecognizer classes.

One of the core features of the Speech service is the ability to recognize and transcribe human speech (often referred to as speech-to-text). In this quickstart, you learn how to use the Speech SDK in your apps and products to perform high-quality speech-to-text conversion.

Skip to samples on GitHub

If you want to skip straight to sample code, see the Java quickstart samples on GitHub.

Prerequisites

This article assumes that you have an Azure account and Speech service subscription. If you don't have an account and subscription, try the Speech service for free.

Install the Speech SDK

Before you can do anything, you'll need to install the Speech SDK. Depending on your platform, use the following instructions:

Create a speech configuration

To call the Speech service using the Speech SDK, you need to create a SpeechConfig. This class includes information about your subscription, like your key and associated location/region, endpoint, host, or authorization token. Create a SpeechConfig by using your key and location/region. See the Find keys and location/region page to find your key-location/region pair.

import com.microsoft.cognitiveservices.speech.*;
import com.microsoft.cognitiveservices.speech.audio.AudioConfig;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.Future;

public class Program {
    public static void main(String[] args) throws InterruptedException, ExecutionException {
        SpeechConfig speechConfig = SpeechConfig.fromSubscription("<paste-your-subscription-key>", "<paste-your-region>");
    }
}

There are a few other ways that you can initialize a SpeechConfig:

  • With an endpoint: pass in a Speech service endpoint. A key or authorization token is optional.
  • With a host: pass in a host address. A key or authorization token is optional.
  • With an authorization token: pass in an authorization token and the associated region.

Note

Regardless of whether you're performing speech recognition, speech synthesis, translation, or intent recognition, you'll always create a configuration.

Recognize from microphone

To recognize speech using your device microphone, create an AudioConfig using fromDefaultMicrophoneInput(). Then initialize a SpeechRecognizer, passing your audioConfig and config.

import com.microsoft.cognitiveservices.speech.*;
import com.microsoft.cognitiveservices.speech.audio.AudioConfig;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.Future;

public class Program {
    public static void main(String[] args) throws InterruptedException, ExecutionException {
        SpeechConfig speechConfig = SpeechConfig.fromSubscription("<paste-your-subscription-key>", "<paste-your-region>");
        fromMic(speechConfig);
    }

    public static void fromMic(SpeechConfig speechConfig) throws InterruptedException, ExecutionException {
        AudioConfig audioConfig = AudioConfig.fromDefaultMicrophoneInput();
        SpeechRecognizer recognizer = new SpeechRecognizer(speechConfig, audioConfig);

        System.out.println("Speak into your microphone.");
        Future<SpeechRecognitionResult> task = recognizer.recognizeOnceAsync();
        SpeechRecognitionResult result = task.get();
        System.out.println("RECOGNIZED: Text=" + result.getText());
    }
}

If you want to use a specific audio input device, you need to specify the device ID in the AudioConfig. Learn how to get the device ID for your audio input device.

Recognize from file

If you want to recognize speech from an audio file instead of using a microphone, you still need to create an AudioConfig. However, when you create the AudioConfig, instead of calling fromDefaultMicrophoneInput(), call fromWavFileInput() and pass the file path.

import com.microsoft.cognitiveservices.speech.*;
import com.microsoft.cognitiveservices.speech.audio.AudioConfig;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.Future;

public class Program {
    public static void main(String[] args) throws InterruptedException, ExecutionException {
        SpeechConfig speechConfig = SpeechConfig.fromSubscription("<paste-your-subscription-key>", "<paste-your-region>");
        fromFile(speechConfig);
    }

    public static void fromFile(SpeechConfig speechConfig) throws InterruptedException, ExecutionException {
        AudioConfig audioConfig = AudioConfig.fromWavFileInput("YourAudioFile.wav");
        SpeechRecognizer recognizer = new SpeechRecognizer(speechConfig, audioConfig);
        
        Future<SpeechRecognitionResult> task = recognizer.recognizeOnceAsync();
        SpeechRecognitionResult result = task.get();
        System.out.println("RECOGNIZED: Text=" + result.getText());
    }
}

Error handling

The previous examples simply get the recognized text using result.getText(), but to handle errors and other responses, you'll need to write some code to handle the result. The following example evaluates result.getReason() and:

  • Prints the recognition result: ResultReason.RecognizedSpeech
  • If there is no recognition match, inform the user: ResultReason.NoMatch
  • If an error is encountered, print the error message: ResultReason.Canceled
switch (result.getReason()) {
    case ResultReason.RecognizedSpeech:
        System.out.println("We recognized: " + result.getText());
        exitCode = 0;
        break;
    case ResultReason.NoMatch:
        System.out.println("NOMATCH: Speech could not be recognized.");
        break;
    case ResultReason.Canceled: {
            CancellationDetails cancellation = CancellationDetails.fromResult(result);
            System.out.println("CANCELED: Reason=" + cancellation.getReason());

            if (cancellation.getReason() == CancellationReason.Error) {
                System.out.println("CANCELED: ErrorCode=" + cancellation.getErrorCode());
                System.out.println("CANCELED: ErrorDetails=" + cancellation.getErrorDetails());
                System.out.println("CANCELED: Did you update the subscription info?");
            }
        }
        break;
}

Continuous recognition

The previous examples use single-shot recognition, which recognizes a single utterance. The end of a single utterance is determined by listening for silence at the end or until a maximum of 15 seconds of audio is processed.

In contrast, continuous recognition is used when you want to control when to stop recognizing. It requires you to subscribe to the recognizing, recognized, and canceled events to get the recognition results. To stop recognition, you must call stopContinuousRecognitionAsync. Here's an example of how continuous recognition is performed on an audio input file.

Let's start by defining the input and initializing a SpeechRecognizer:

AudioConfig audioConfig = AudioConfig.fromWavFileInput("YourAudioFile.wav");
SpeechRecognizer recognizer = new SpeechRecognizer(config, audioConfig);

Next, let's create a variable to manage the state of speech recognition. To start, we'll declare a Semaphore at the class scope.

private static Semaphore stopTranslationWithFileSemaphore;

We'll subscribe to the events sent from the SpeechRecognizer.

  • recognizing: Signal for events containing intermediate recognition results.
  • recognized: Signal for events containing final recognition results (indicating a successful recognition attempt).
  • sessionStopped: Signal for events indicating the end of a recognition session (operation).
  • canceled: Signal for events containing canceled recognition results (indicating a recognition attempt that was canceled as a result or a direct cancellation request or, alternatively, a transport or protocol failure).
// First initialize the semaphore.
stopTranslationWithFileSemaphore = new Semaphore(0);

recognizer.recognizing.addEventListener((s, e) -> {
    System.out.println("RECOGNIZING: Text=" + e.getResult().getText());
});

recognizer.recognized.addEventListener((s, e) -> {
    if (e.getResult().getReason() == ResultReason.RecognizedSpeech) {
        System.out.println("RECOGNIZED: Text=" + e.getResult().getText());
    }
    else if (e.getResult().getReason() == ResultReason.NoMatch) {
        System.out.println("NOMATCH: Speech could not be recognized.");
    }
});

recognizer.canceled.addEventListener((s, e) -> {
    System.out.println("CANCELED: Reason=" + e.getReason());

    if (e.getReason() == CancellationReason.Error) {
        System.out.println("CANCELED: ErrorCode=" + e.getErrorCode());
        System.out.println("CANCELED: ErrorDetails=" + e.getErrorDetails());
        System.out.println("CANCELED: Did you update the subscription info?");
    }

    stopTranslationWithFileSemaphore.release();
});

recognizer.sessionStopped.addEventListener((s, e) -> {
    System.out.println("\n    Session stopped event.");
    stopTranslationWithFileSemaphore.release();
});

With everything set up, we can call startContinuousRecognitionAsync.

// Starts continuous recognition. Uses StopContinuousRecognitionAsync() to stop recognition.
recognizer.startContinuousRecognitionAsync().get();

// Waits for completion.
stopTranslationWithFileSemaphore.acquire();

// Stops recognition.
recognizer.stopContinuousRecognitionAsync().get();

Dictation mode

When using continuous recognition, you can enable dictation processing by using the corresponding "enable dictation" function. This mode will cause the speech config instance to interpret word descriptions of sentence structures such as punctuation. For example, the utterance "Do you live in town question mark" would be interpreted as the text "Do you live in town?".

To enable dictation mode, use the enableDictation method on your SpeechConfig.

config.enableDictation();

Change source language

A common task for speech recognition is specifying the input (or source) language. Let's take a look at how you would change the input language to French. In your code, find your SpeechConfig, then add this line directly below it.

config.setSpeechRecognitionLanguage("fr-FR");

setSpeechRecognitionLanguage is a parameter that takes a string as an argument. You can provide any value in the list of supported locales/languages.

Improve recognition accuracy

Phrase Lists are used to identify known phrases in audio data, like a person's name or a specific location. By providing a list of phrases, you improve the accuracy of speech recognition.

As an example, if you have a command "Move to" and a possible destination of "Ward" that may be spoken, you can add an entry of "Move to Ward". Adding a phrase will increase the probability that when the audio is recognized that "Move to Ward" will be recognized instead of "Move toward"

Single words or complete phrases can be added to a Phrase List. During recognition, an entry in a phrase list is used to boost recognition of the words and phrases in the list even when the entries appear in the middle of the utterance.

Important

The Phrase List feature is available in the following languages: en-US, de-DE, en-AU, en-CA, en-GB, en-IN, es-ES, fr-FR, it-IT, ja-JP, pt-BR, zh-CN

The Phrase List feature should be used with no more than a few hundred phrases. If you have a larger list or for languages that are not currently supported, training a custom model will likely be the better choice to improve accuracy.

The Phrase List feature is not supported with custom endpoints. Do not use it with custom endpoints. Instead, train a custom model that includes the phrases.

To use a phrase list, first create a PhraseListGrammar object, then add specific words and phrases with AddPhrase.

Any changes to PhraseListGrammar take effect on the next recognition or after a reconnection to the Speech service.

PhraseListGrammar phraseList = PhraseListGrammar.fromRecognizer(recognizer);
phraseList.addPhrase("Supercalifragilisticexpialidocious");

If you need to clear your phrase list:

phraseList.clear();

Other options to improve recognition accuracy

Phrase lists are only one option to improve recognition accuracy. You can also:

One of the core features of the Speech service is the ability to recognize and transcribe human speech (often referred to as speech-to-text). In this quickstart, you learn how to use the Speech SDK in your apps and products to perform high-quality speech-to-text conversion.

Skip to samples on GitHub

If you want to skip straight to sample code, see the JavaScript quickstart samples on GitHub.

Alternatively, see the React sample to learn how to use the Speech SDK in a browser-based environment.

Prerequisites

This article assumes that you have an Azure account and Speech service subscription. If you don't have an account and subscription, try the Speech service for free.

Install the Speech SDK

Before you can do anything, you need to install the Speech SDK for Node.js. If you just want the package name to install, run npm install microsoft-cognitiveservices-speech-sdk. For guided installation instructions, see the get started article.

Use the following require statement to import the SDK.

const sdk = require("microsoft-cognitiveservices-speech-sdk");

For more information on require, see the require documentation.

Create a speech configuration

To call the Speech service using the Speech SDK, you need to create a SpeechConfig. This class includes information about your subscription, like your key and associated location/region, endpoint, host, or authorization token. Create a SpeechConfig using your key and location/region. See the Find keys and location/region page to find your key-location/region pair.

const speechConfig = sdk.SpeechConfig.fromSubscription("<paste-your-speech-key-here>", "<paste-your-speech-location/region-here>");

There are a few other ways that you can initialize a SpeechConfig:

  • With an endpoint: pass in a Speech service endpoint. A key or authorization token is optional.
  • With a host: pass in a host address. A key or authorization token is optional.
  • With an authorization token: pass in an authorization token and the associated location/region.

Note

Regardless of whether you're performing speech recognition, speech synthesis, translation, or intent recognition, you'll always create a configuration.

Recognize from microphone (Browser only)

Recognizing speech from a microphone is not supported in Node.js, and is only supported in a browser-based JavaScript environment. See the React sample on GitHub to see the speech-to-text from microphone implementation.

Note

If you want to use a specific audio input device, you need to specify the device ID in the AudioConfig. Learn how to get the device ID for your audio input device.

Recognize from file

To recognize speech from an audio file, create an AudioConfig using fromWavFileInput() which accepts a Buffer object. Then initialize a SpeechRecognizer, passing your audioConfig and speechConfig.

const fs = require('fs');
const sdk = require("microsoft-cognitiveservices-speech-sdk");
const speechConfig = sdk.SpeechConfig.fromSubscription("<paste-your-speech-key-here>", "<paste-your-speech-location/region-here>");

function fromFile() {
    let audioConfig = sdk.AudioConfig.fromWavFileInput(fs.readFileSync("YourAudioFile.wav"));
    let recognizer = new sdk.SpeechRecognizer(speechConfig, audioConfig);

    recognizer.recognizeOnceAsync(result => {
        console.log(`RECOGNIZED: Text=${result.text}`);
        recognizer.close();
    });
}
fromFile();

Recognize from in-memory stream

For many use-cases, it is likely your audio data will be coming from blob storage, or otherwise already be in-memory as an ArrayBuffer or similar raw data structure. The following code:

  • Creates a push stream using createPushStream().
  • Reads a .wav file using fs.createReadStream for demonstration purposes, but if you already have audio data in an ArrayBuffer, you can skip directly to writing the content to the input stream.
  • Creates an audio config using the push stream.
const fs = require('fs');
const sdk = require("microsoft-cognitiveservices-speech-sdk");
const speechConfig = sdk.SpeechConfig.fromSubscription("<paste-your-speech-key-here>", "<paste-your-speech-location/region-here>");

function fromStream() {
    let pushStream = sdk.AudioInputStream.createPushStream();

    fs.createReadStream("YourAudioFile.wav").on('data', function(arrayBuffer) {
        pushStream.write(arrayBuffer.slice());
    }).on('end', function() {
        pushStream.close();
    });
 
    let audioConfig = sdk.AudioConfig.fromStreamInput(pushStream);
    let recognizer = new sdk.SpeechRecognizer(speechConfig, audioConfig);
    recognizer.recognizeOnceAsync(result => {
        console.log(`RECOGNIZED: Text=${result.text}`);
        recognizer.close();
    });
}
fromStream();

Using a push stream as input assumes that the audio data is a raw PCM, e.g. skipping any headers. The API will still work in certain cases if the header has not been skipped, but for the best results consider implementing logic to read off the headers so the fs starts at the start of the audio data.

Error handling

The previous examples simply get the recognized text from result.text, but to handle errors and other responses, you'll need to write some code to handle the result. The following code evaluates the result.reason property and:

  • Prints the recognition result: ResultReason.RecognizedSpeech
  • If there is no recognition match, inform the user: ResultReason.NoMatch
  • If an error is encountered, print the error message: ResultReason.Canceled
switch (result.reason) {
    case sdk.ResultReason.RecognizedSpeech:
        console.log(`RECOGNIZED: Text=${result.text}`);
        break;
    case sdk.ResultReason.NoMatch:
        console.log("NOMATCH: Speech could not be recognized.");
        break;
    case sdk.ResultReason.Canceled:
        const cancellation = CancellationDetails.fromResult(result);
        console.log(`CANCELED: Reason=${cancellation.reason}`);

        if (cancellation.reason == sdk.CancellationReason.Error) {
            console.log(`CANCELED: ErrorCode=${cancellation.ErrorCode}`);
            console.log(`CANCELED: ErrorDetails=${cancellation.errorDetails}`);
            console.log("CANCELED: Did you update the key and location/region info?");
        }
        break;
    }

Continuous recognition

The previous examples use single-shot recognition, which recognizes a single utterance. The end of a single utterance is determined by listening for silence at the end or until a maximum of 15 seconds of audio is processed.

In contrast, continuous recognition is used when you want to control when to stop recognizing. It requires you to subscribe to the Recognizing, Recognized, and Canceled events to get the recognition results. To stop recognition, you must call stopContinuousRecognitionAsync. Here's an example of how continuous recognition is performed on an audio input file.

Start by defining the input and initializing a SpeechRecognizer:

const recognizer = new sdk.SpeechRecognizer(speechConfig, audioConfig);

Next, subscribe to the events sent from the SpeechRecognizer.

  • recognizing: Signal for events containing intermediate recognition results.
  • recognized: Signal for events containing final recognition results (indicating a successful recognition attempt).
  • sessionStopped: Signal for events indicating the end of a recognition session (operation).
  • canceled: Signal for events containing canceled recognition results (indicating a recognition attempt that was canceled as a result or a direct cancellation request or, alternatively, a transport or protocol failure).
recognizer.recognizing = (s, e) => {
    console.log(`RECOGNIZING: Text=${e.result.text}`);
};

recognizer.recognized = (s, e) => {
    if (e.result.reason == sdk.ResultReason.RecognizedSpeech) {
        console.log(`RECOGNIZED: Text=${e.result.text}`);
    }
    else if (e.result.reason == sdk.ResultReason.NoMatch) {
        console.log("NOMATCH: Speech could not be recognized.");
    }
};

recognizer.canceled = (s, e) => {
    console.log(`CANCELED: Reason=${e.reason}`);

    if (e.reason == sdk.CancellationReason.Error) {
        console.log(`"CANCELED: ErrorCode=${e.errorCode}`);
        console.log(`"CANCELED: ErrorDetails=${e.errorDetails}`);
        console.log("CANCELED: Did you update the key and location/region info?");
    }

    recognizer.stopContinuousRecognitionAsync();
};

recognizer.sessionStopped = (s, e) => {
    console.log("\n    Session stopped event.");
    recognizer.stopContinuousRecognitionAsync();
};

With everything set up, call startContinuousRecognitionAsync to start recognizing.

recognizer.startContinuousRecognitionAsync();

// make the following call at some point to stop recognition.
// recognizer.stopContinuousRecognitionAsync();

Dictation mode

When using continuous recognition, you can enable dictation processing by using the corresponding "enable dictation" function. This mode will cause the speech config instance to interpret word descriptions of sentence structures such as punctuation. For example, the utterance "Do you live in town question mark" would be interpreted as the text "Do you live in town?".

To enable dictation mode, use the enableDictation method on your SpeechConfig.

speechConfig.enableDictation();

Change source language

A common task for speech recognition is specifying the input (or source) language. Let's take a look at how you would change the input language to Italian. In your code, find your SpeechConfig, then add this line directly below it.

speechConfig.speechRecognitionLanguage = "it-IT";

The speechRecognitionLanguage property expects a language-locale format string. You can provide any value in the Locale column in the list of supported locales/languages.

Improve recognition accuracy

Phrase Lists are used to identify known phrases in audio data, like a person's name or a specific location. By providing a list of phrases, you improve the accuracy of speech recognition.

As an example, if you have a command "Move to" and a possible destination of "Ward" that may be spoken, you can add an entry of "Move to Ward". Adding a phrase will increase the probability that when the audio is recognized that "Move to Ward" will be recognized instead of "Move toward"

Single words or complete phrases can be added to a Phrase List. During recognition, an entry in a phrase list is used to boost recognition of the words and phrases in the list even when the entries appear in the middle of the utterance.

Important

The Phrase List feature is available in the following languages: en-US, de-DE, en-AU, en-CA, en-GB, en-IN, es-ES, fr-FR, it-IT, ja-JP, pt-BR, zh-CN

The Phrase List feature should be used with no more than a few hundred phrases. If you have a larger list or for languages that are not currently supported, training a custom model will likely be the better choice to improve accuracy.

The Phrase List feature is not supported with custom endpoints. Do not use it with custom endpoints. Instead, train a custom model that includes the phrases.

To use a phrase list, first create a PhraseListGrammar object, then add specific words and phrases with addPhrase.

Any changes to PhraseListGrammar take effect on the next recognition or after a reconnection to the Speech service.

const phraseList = sdk.PhraseListGrammar.fromRecognizer(recognizer);
phraseList.addPhrase("Supercalifragilisticexpialidocious");

If you need to clear your phrase list:

phraseList.clear();

Other options to improve recognition accuracy

Phrase lists are only one option to improve recognition accuracy. You can also:

One of the core features of the Speech service is the ability to recognize and transcribe human speech (often referred to as speech-to-text). In this sample, you learn how to use the Speech SDK in your apps and products to perform high-quality speech-to-text conversion.

React sample on GitHub

Go to the React sample on GitHub to learn how to use the Speech SDK in a browser-based JavaScript environment. This sample shows design pattern examples for authentication token exchange and management, and capturing audio from a microphone or file for speech-to-text conversions.

Additionally, the design patterns used in the Node.js quickstart can also be used in a browser environment.

You can transcribe speech into text using the Speech SDK for Swift and Objective-C.

Prerequisites

The following samples assume that you have an Azure account and Speech service subscription. If you don't have an account and subscription, try the Speech service for free.

Install Speech SDK and samples

The Cognitive Services Speech SDK contains samples written in in Swift and Objective-C for iOS and Mac. Click a link to see installation instructions for each sample:

We also provide an online Speech SDK for Objective-C Reference.

One of the core features of the Speech service is the ability to recognize and transcribe human speech (often referred to as speech-to-text). In this quickstart, you learn how to use the Speech SDK in your apps and products to perform high-quality speech-to-text conversion.

Skip to samples on GitHub

If you want to skip straight to sample code, see the Python quickstart samples on GitHub.

Prerequisites

This article assumes:

Install and import the Speech SDK

Before you can do anything, you'll need to install the Speech SDK.

pip install azure-cognitiveservices-speech

If you're on macOS and run into install issues, you may need to run this command first.

python3 -m pip install --upgrade pip

After the Speech SDK is installed, import it into your Python project.

import azure.cognitiveservices.speech as speechsdk

Create a speech configuration

To call the Speech service using the Speech SDK, you need to create a SpeechConfig. This class includes information about your subscription, like your key and associated location/region, endpoint, host, or authorization token. Create a SpeechConfig using your key and location/region. See the Find keys and location/region page to find your key-location/region pair.

speech_config = speechsdk.SpeechConfig(subscription="<paste-your-speech-key-here>", region="<paste-your-speech-location/region-here>")

There are a few other ways that you can initialize a SpeechConfig:

  • With an endpoint: pass in a Speech service endpoint. A key or authorization token is optional.
  • With a host: pass in a host address. A key or authorization token is optional.
  • With an authorization token: pass in an authorization token and the associated region.

Note

Regardless of whether you're performing speech recognition, speech synthesis, translation, or intent recognition, you'll always create a configuration.

Recognize from microphone

To recognize speech using your device microphone, simply create a SpeechRecognizer without passing an AudioConfig, and pass your speech_config.

import azure.cognitiveservices.speech as speechsdk

def from_mic():
    speech_config = speechsdk.SpeechConfig(subscription="<paste-your-speech-key-here>", region="<paste-your-speech-location/region-here>")
    speech_recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config)
    
    print("Speak into your microphone.")
    result = speech_recognizer.recognize_once_async().get()
    print(result.text)

from_mic()

If you want to use a specific audio input device, you need to specify the device ID in an AudioConfig, and pass it to the SpeechRecognizer constructor's audio_config param. Learn how to get the device ID for your audio input device.

Recognize from file

If you want to recognize speech from an audio file instead of using a microphone, create an AudioConfig and use the filename parameter.

import azure.cognitiveservices.speech as speechsdk

def from_file():
    speech_config = speechsdk.SpeechConfig(subscription="<paste-your-speech-key-here>", region="<paste-your-speech-location/region-here>")
    audio_input = speechsdk.AudioConfig(filename="your_file_name.wav")
    speech_recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config, audio_config=audio_input)
    
    result = speech_recognizer.recognize_once_async().get()
    print(result.text)

from_file()

Error handling

The previous examples simply get the recognized text from result.text, but to handle errors and other responses, you'll need to write some code to handle the result. The following code evaluates the result.reason property and:

  • Prints the recognition result: speechsdk.ResultReason.RecognizedSpeech
  • If there is no recognition match, inform the user: speechsdk.ResultReason.NoMatch
  • If an error is encountered, print the error message: speechsdk.ResultReason.Canceled
if result.reason == speechsdk.ResultReason.RecognizedSpeech:
    print("Recognized: {}".format(result.text))
elif result.reason == speechsdk.ResultReason.NoMatch:
    print("No speech could be recognized: {}".format(result.no_match_details))
elif result.reason == speechsdk.ResultReason.Canceled:
    cancellation_details = result.cancellation_details
    print("Speech Recognition canceled: {}".format(cancellation_details.reason))
    if cancellation_details.reason == speechsdk.CancellationReason.Error:
        print("Error details: {}".format(cancellation_details.error_details))

Continuous recognition

The previous examples use single-shot recognition, which recognizes a single utterance. The end of a single utterance is determined by listening for silence at the end or until a maximum of 15 seconds of audio is processed.

In contrast, continuous recognition is used when you want to control when to stop recognizing. It requires you to connect to the EventSignal to get the recognition results, and to stop recognition, you must call stop_continuous_recognition() or stop_continuous_recognition(). Here's an example of how continuous recognition is performed on an audio input file.

Let's start by defining the input and initializing a SpeechRecognizer:

audio_config = speechsdk.audio.AudioConfig(filename=weatherfilename)
speech_recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config, audio_config=audio_config)

Next, let's create a variable to manage the state of speech recognition. To start, we'll set this to False, since at the start of recognition we can safely assume that it's not finished.

done = False

Now, we're going to create a callback to stop continuous recognition when an evt is received. There's a few things to keep in mind.

  • When an evt is received, the evt message is printed.
  • After an evt is received, stop_continuous_recognition() is called to stop recognition.
  • The recognition state is changed to True.
def stop_cb(evt):
    print('CLOSING on {}'.format(evt))
    speech_recognizer.stop_continuous_recognition()
    nonlocal done
    done = True

This code sample shows how to connect callbacks to events sent from the SpeechRecognizer.

  • recognizing: Signal for events containing intermediate recognition results.
  • recognized: Signal for events containing final recognition results (indicating a successful recognition attempt).
  • session_started: Signal for events indicating the start of a recognition session (operation).
  • session_stopped: Signal for events indicating the end of a recognition session (operation).
  • canceled: Signal for events containing canceled recognition results (indicating a recognition attempt that was canceled as a result or a direct cancellation request or, alternatively, a transport or protocol failure).
speech_recognizer.recognizing.connect(lambda evt: print('RECOGNIZING: {}'.format(evt)))
speech_recognizer.recognized.connect(lambda evt: print('RECOGNIZED: {}'.format(evt)))
speech_recognizer.session_started.connect(lambda evt: print('SESSION STARTED: {}'.format(evt)))
speech_recognizer.session_stopped.connect(lambda evt: print('SESSION STOPPED {}'.format(evt)))
speech_recognizer.canceled.connect(lambda evt: print('CANCELED {}'.format(evt)))

speech_recognizer.session_stopped.connect(stop_cb)
speech_recognizer.canceled.connect(stop_cb)

With everything set up, we can call start_continuous_recognition().

speech_recognizer.start_continuous_recognition()
while not done:
    time.sleep(.5)

Dictation mode

When using continuous recognition, you can enable dictation processing by using the corresponding "enable dictation" function. This mode will cause the speech config instance to interpret word descriptions of sentence structures such as punctuation. For example, the utterance "Do you live in town question mark" would be interpreted as the text "Do you live in town?".

To enable dictation mode, use the enable_dictation() method on your SpeechConfig.

SpeechConfig.enable_dictation()

Change source language

A common task for speech recognition is specifying the input (or source) language. Let's take a look at how you would change the input language to German. In your code, find your SpeechConfig, then add this line directly below it.

speech_config.speech_recognition_language="de-DE"

speech_recognition_language is a parameter that takes a string as an argument. You can provide any value in the list of supported locales/languages.

Improve recognition accuracy

Phrase Lists are used to identify known phrases in audio data, like a person's name or a specific location. By providing a list of phrases, you improve the accuracy of speech recognition.

As an example, if you have a command "Move to" and a possible destination of "Ward" that may be spoken, you can add an entry of "Move to Ward". Adding a phrase will increase the probability that when the audio is recognized that "Move to Ward" will be recognized instead of "Move toward"

Single words or complete phrases can be added to a Phrase List. During recognition, an entry in a phrase list is used to boost recognition of the words and phrases in the list even when the entries appear in the middle of the utterance.

Important

The Phrase List feature is available in the following languages: en-US, de-DE, en-AU, en-CA, en-GB, en-IN, es-ES, fr-FR, it-IT, ja-JP, pt-BR, zh-CN

The Phrase List feature should be used with no more than a few hundred phrases. If you have a larger list or for languages that are not currently supported, training a custom model will likely be the better choice to improve accuracy.

The Phrase List feature is not supported with custom endpoints. Do not use it with custom endpoints. Instead, train a custom model that includes the phrases.

To use a phrase list, first create a PhraseListGrammar object, then add specific words and phrases with addPhrase.

Any changes to PhraseListGrammar take effect on the next recognition or after a reconnection to the Speech service.

phrase_list_grammar = speechsdk.PhraseListGrammar.from_recognizer(reco)
phrase_list_grammar.addPhrase("Supercalifragilisticexpialidocious")

If you need to clear your phrase list:

phrase_list_grammar.clear()

Other options to improve recognition accuracy

Phrase lists are only one option to improve recognition accuracy. You can also:

In this quickstart, you learn how to convert speech to text using the Speech service and cURL.

For a high-level look at Speech-to-Text concepts, see the overview article.

Prerequisites

This article assumes that you have an Azure account and Speech service subscription. If you don't have an account and subscription, try the Speech service for free.

Convert speech to text

At a command prompt, run the following command. You will need to insert the following values into the command.

  • Your Speech service subscription key.
  • Your Speech service region.
  • The input audio file path. You can generate audio files using text-to-speech.
curl --location --request POST 'https://INSERT_REGION_HERE.stt.speech.microsoft.com/speech/recognition/conversation/cognitiveservices/v1?language=en-US' \
--header 'Ocp-Apim-Subscription-Key: INSERT_SUBSCRIPTION_KEY_HERE' \
--header 'Content-Type: audio/wav' \
--data-binary 'INSERT_AUDIO_FILE_PATH_HERE'

You should receive a response like the following one.

{
    "RecognitionStatus": "Success",
    "DisplayText": "My voice is my passport, verify me.",
    "Offset": 6600000,
    "Duration": 32100000
}

For more information see the speech-to-text REST API reference.

One of the core features of the Speech service is the ability to recognize and transcribe human speech (often referred to as speech-to-text). In this quickstart, you learn how to use the Speech CLI in your apps and products to perform high-quality speech-to-text conversion.

Download and install

Follow these steps to install the Speech CLI on Windows:

  1. On Windows, you need the Microsoft Visual C++ Redistributable for Visual Studio 2019 for your platform. Installing this for the first time may require a restart.

  2. Install .NET Core 3.1 SDK.

  3. Install the Speech CLI using NuGet by entering this command:

    dotnet tool install --global Microsoft.CognitiveServices.Speech.CLI
    

Type spx to see help for the Speech CLI.

Note

As an alternative to NuGet, you can download and extract the Speech CLI for Windows as a zip file.

Font limitations

On Windows, the Speech CLI can only show fonts available to the command prompt on the local computer. Windows Terminal supports all fonts produced interactively by the Speech CLI.

If you output to a file, a text editor like Notepad or a web browser like Microsoft Edge can also show all fonts.

Create subscription config

To start using the Speech CLI, you need to enter your Speech subscription key and region identifier. Get these credentials by following steps in Try the Speech service for free. Once you have your subscription key and region identifier (ex. eastus, westus), run the following commands.

spx config @key --set SUBSCRIPTION-KEY
spx config @region --set REGION

Your subscription authentication is now stored for future SPX requests. If you need to remove either of these stored values, run spx config @region --clear or spx config @key --clear.

Speech-to-text from microphone

Plug in and turn on your PC microphone, and turn off any apps that might also use the microphone. Some computers have a built-in microphone, while others require configuration of a Bluetooth device.

Now you're ready to run the Speech CLI to recognize speech from your microphone. From the command line, change to the directory that contains the Speech CLI binary file, and run the following command.

spx recognize --microphone

Note

The Speech CLI defaults to English. You can choose a different language from the Speech-to-text table. For example, add --source de-DE to recognize German speech.

Speak into the microphone, and you see transcription of your words into text in real-time. The Speech CLI will stop after a period of silence, or when you press ctrl-C.

Speech-to-text from audio file

The Speech CLI can recognize speech in many file formats and natural languages. In this example, you can use any WAV file (16kHz or 8kHz, 16-bit, and mono PCM) that contains English speech. Or if you want a quick sample, download the whatstheweatherlike.wav file and copy it to the same directory as the Speech CLI binary file.

Now you're ready to run the Speech CLI to recognize speech found in the audio file by running the following command.

spx recognize --file whatstheweatherlike.wav

Note

The Speech CLI defaults to English. You can choose a different language from the Speech-to-text table. For example, add --source de-DE to recognize German speech.

The Speech CLI will show a text transcription of the speech on the screen.

Next steps