Get started with speech-to-text

One of the core features of the Speech service is the ability to recognize and transcribe human speech (often referred to as speech-to-text). In this quickstart, you learn how to use the Speech SDK in your apps and products to perform high-quality speech-to-text conversion.

Skip to samples on GitHub

If you want to skip straight to sample code, see the C# quickstart samples on GitHub.

Prerequisites

This article assumes that you have an Azure account and Speech service subscription. If you don't have an account and subscription, try the Speech service for free.

Install the Speech SDK

Before you can do anything, you'll need to install the Speech SDK. Depending on your platform, use the following instructions:

Create a speech configuration

To call the Speech service using the Speech SDK, you need to create a SpeechConfig. This class includes information about your subscription, like your key and associated region, endpoint, host, or authorization token.

Note

Regardless of whether you're performing speech recognition, speech synthesis, translation, or intent recognition, you'll always create a configuration.

There are a few ways that you can initialize a SpeechConfig:

  • With a subscription: pass in a key and the associated region.
  • With an endpoint: pass in a Speech service endpoint. A key or authorization token is optional.
  • With a host: pass in a host address. A key or authorization token is optional.
  • With an authorization token: pass in an authorization token and the associated region.

Let's take a look at how a SpeechConfig is created using a key and region. See the region support page to find your region identifier.

var speechConfig = SpeechConfig.FromSubscription("YourSubscriptionKey", "YourServiceRegion");

Initialize a recognizer

After you've created a SpeechConfig, the next step is to initialize a SpeechRecognizer. When you initialize a SpeechRecognizer, you pass it your speechConfig. This provides the credentials that the speech service requires to validate your request.

using var recognizer = new SpeechRecognizer(speechConfig);

Recognize from microphone or file

If you want to specify the audio input device, you need to create an AudioConfig and pass it as a parameter when initializing your SpeechRecognizer.

To recognize speech using your device microphone, create an AudioConfig using FromDefaultMicrophoneInput(), then pass the audio config when creating your SpeechRecognizer object.

using Microsoft.CognitiveServices.Speech.Audio;

using var audioConfig = AudioConfig.FromDefaultMicrophoneInput();
using var recognizer = new SpeechRecognizer(speechConfig, audioConfig);

If you want to recognize speech from an audio file instead of a microphone, you still need to create an AudioConfig. However, when you create the AudioConfig, instead of calling FromDefaultMicrophoneInput(), you call FromWavFileInput() and pass the filename parameter.

using var audioConfig = AudioConfig.FromWavFileInput("YourAudioFile.wav");
using var recognizer = new SpeechRecognizer(speechConfig, audioConfig);

Recognize speech

The Recognizer class for the Speech SDK for C# exposes a few methods that you can use for speech recognition.

  • Single-shot recognition (async) - Performs recognition in a non-blocking (asynchronous) mode. This will recognize a single utterance. The end of a single utterance is determined by listening for silence at the end or until a maximum of 15 seconds of audio is processed.
  • Continuous recognition (async) - Asynchronously initiates continuous recognition operation. The user registers to events and handles various application state. To stop asynchronous continuous recognition, call StopContinuousRecognitionAsync.

Note

Learn more about how to choose a speech recognition mode.

Single-shot recognition

Here's an example of asynchronous single-shot recognition using RecognizeOnceAsync:

var result = await recognizer.RecognizeOnceAsync();

You'll need to write some code to handle the result. This sample evaluates the result.Reason:

  • Prints the recognition result: ResultReason.RecognizedSpeech
  • If there is no recognition match, inform the user: ResultReason.NoMatch
  • If an error is encountered, print the error message: ResultReason.Canceled
switch (result.Reason)
{
    case ResultReason.RecognizedSpeech:
        Console.WriteLine($"RECOGNIZED: Text={result.Text}");
        Console.WriteLine($"    Intent not recognized.");
        break;
    case ResultReason.NoMatch:
        Console.WriteLine($"NOMATCH: Speech could not be recognized.");
        break;
    case ResultReason.Canceled:
        var cancellation = CancellationDetails.FromResult(result);
        Console.WriteLine($"CANCELED: Reason={cancellation.Reason}");

        if (cancellation.Reason == CancellationReason.Error)
        {
            Console.WriteLine($"CANCELED: ErrorCode={cancellation.ErrorCode}");
            Console.WriteLine($"CANCELED: ErrorDetails={cancellation.ErrorDetails}");
            Console.WriteLine($"CANCELED: Did you update the subscription info?");
        }
        break;
}

Continuous recognition

Continuous recognition is a bit more involved than single-shot recognition. It requires you to subscribe to the Recognizing, Recognized, and Canceled events to get the recognition results. To stop recognition, you must call StopContinuousRecognitionAsync. Here's an example of how continuous recognition is performed on an audio input file.

Let's start by defining the input and initializing a SpeechRecognizer:

using var audioConfig = AudioConfig.FromWavFileInput("YourAudioFile.wav");
using var recognizer = new SpeechRecognizer(speechConfig, audioConfig);

Next, let's create a variable to manage the state of speech recognition. To start, we'll declare a TaskCompletionSource<int> after the previous declarations.

var stopRecognition = new TaskCompletionSource<int>();

We'll subscribe to the events sent from the SpeechRecognizer.

  • Recognizing: Signal for events containing intermediate recognition results.
  • Recognized: Signal for events containing final recognition results (indicating a successful recognition attempt).
  • SessionStopped: Signal for events indicating the end of a recognition session (operation).
  • Canceled: Signal for events containing canceled recognition results (indicating a recognition attempt that was canceled as a result or a direct cancellation request or, alternatively, a transport or protocol failure).
recognizer.Recognizing += (s, e) =>
{
    Console.WriteLine($"RECOGNIZING: Text={e.Result.Text}");
};

recognizer.Recognized += (s, e) =>
{
    if (e.Result.Reason == ResultReason.RecognizedSpeech)
    {
        Console.WriteLine($"RECOGNIZED: Text={e.Result.Text}");
    }
    else if (e.Result.Reason == ResultReason.NoMatch)
    {
        Console.WriteLine($"NOMATCH: Speech could not be recognized.");
    }
};

recognizer.Canceled += (s, e) =>
{
    Console.WriteLine($"CANCELED: Reason={e.Reason}");

    if (e.Reason == CancellationReason.Error)
    {
        Console.WriteLine($"CANCELED: ErrorCode={e.ErrorCode}");
        Console.WriteLine($"CANCELED: ErrorDetails={e.ErrorDetails}");
        Console.WriteLine($"CANCELED: Did you update the subscription info?");
    }

    stopRecognition.TrySetResult(0);
};

recognizer.SessionStopped += (s, e) =>
{
    Console.WriteLine("\n    Session stopped event.");
    stopRecognition.TrySetResult(0);
};

With everything set up, we can call StopContinuousRecognitionAsync.

// Starts continuous recognition. Uses StopContinuousRecognitionAsync() to stop recognition.
await recognizer.StartContinuousRecognitionAsync();

// Waits for completion. Use Task.WaitAny to keep the task rooted.
Task.WaitAny(new[] { stopRecognition.Task });

// Stops recognition.
await recognizer.StopContinuousRecognitionAsync();

Dictation mode

When using continuous recognition, you can enable dictation processing by using the corresponding "enable dictation" function. This mode will cause the speech config instance to interpret word descriptions of sentence structures such as punctuation. For example, the utterance "Do you live in town question mark" would be interpreted as the text "Do you live in town?".

To enable dictation mode, use the EnableDictation method on your SpeechConfig.

speechConfig.EnableDictation();

Change source language

A common task for speech recognition is specifying the input (or source) language. Let's take a look at how you would change the input language to Italian. In your code, find your SpeechConfig, then add this line directly below it.

speechConfig.SpeechRecognitionLanguage = "it-IT";

The SpeechRecognitionLanguage property expects a language-locale format string. You can provide any value in the Locale column in the list of supported locales/languages.

Improve recognition accuracy

There are a few ways to improve recognition accuracy with the Speech SDK. Let's take a look at Phrase Lists. Phrase Lists are used to identify known phrases in audio data, like a person's name or a specific location. Single words or complete phrases can be added to a Phrase List. During recognition, an entry in a phrase list is used if an exact match for the entire phrase is included in the audio. If an exact match to the phrase is not found, recognition is not assisted.

Important

The Phrase List feature is only available in English.

To use a phrase list, first create a PhraseListGrammar object, then add specific words and phrases with AddPhrase.

Any changes to PhraseListGrammar take effect on the next recognition or after a reconnection to the Speech service.

var phraseList = PhraseListGrammar.FromRecognizer(recognizer);
phraseList.AddPhrase("Supercalifragilisticexpialidocious");

If you need to clear your phrase list:

phraseList.Clear();

Other options to improve recognition accuracy

Phrase lists are only one option to improve recognition accuracy. You can also:

One of the core features of the Speech service is the ability to recognize and transcribe human speech (often referred to as speech-to-text). In this quickstart, you learn how to use the Speech SDK in your apps and products to perform high-quality speech-to-text conversion.

Skip to samples on GitHub

If you want to skip straight to sample code, see the C++ quickstart samples on GitHub.

Prerequisites

This article assumes that you have an Azure account and Speech service subscription. If you don't have an account and subscription, try the Speech service for free.

Install the Speech SDK

Before you can do anything, you'll need to install the Speech SDK. Depending on your platform, use the following instructions:

Create a speech configuration

To call the Speech service using the Speech SDK, you need to create a SpeechConfig. This class includes information about your subscription, like your key and associated region, endpoint, host, or authorization token.

Note

Regardless of whether you're performing speech recognition, speech synthesis, translation, or intent recognition, you'll always create a configuration.

There are a few ways that you can initialize a SpeechConfig:

  • With a subscription: pass in a key and the associated region.
  • With an endpoint: pass in a Speech service endpoint. A key or authorization token is optional.
  • With a host: pass in a host address. A key or authorization token is optional.
  • With an authorization token: pass in an authorization token and the associated region.

Let's take a look at how a SpeechConfig is created using a key and region. See the region support page to find your region identifier.

auto config = SpeechConfig::FromSubscription("YourSubscriptionKey", "YourServiceRegion");

Initialize a recognizer

After you've created a SpeechConfig, the next step is to initialize a SpeechRecognizer. When you initialize a SpeechRecognizer, you'll need to pass it your speech_config. This provides the credentials that the speech service requires to validate your request.

auto recognizer = SpeechRecognizer::FromConfig(config);

Recognize from microphone or file

If you want to specify the audio input device, you need to create an AudioConfig and pass it as a parameter when initializing your SpeechRecognizer.

To recognize speech using your device microphone, create an AudioConfig using FromDefaultMicrophoneInput(), then pass the audio config when creating your SpeechRecognizer object.

using namespace Microsoft::CognitiveServices::Speech::Audio;

auto audioConfig = AudioConfig::FromDefaultMicrophoneInput();
auto recognizer = SpeechRecognizer::FromConfig(config, audioConfig);

If you want to recognize speech from an audio file instead of using a microphone, you still need to create an AudioConfig. However, when you create the AudioConfig, instead of calling FromDefaultMicrophoneInput(), you call FromWavFileInput() and pass the filename parameter.

auto audioInput = AudioConfig::FromWavFileInput("YourAudioFile.wav");
auto recognizer = SpeechRecognizer::FromConfig(config, audioInput);

Recognize speech

The Recognizer class for the Speech SDK for C++ exposes a few methods that you can use for speech recognition.

  • Single-shot recognition (async) - Performs recognition in a non-blocking (asynchronous) mode. This will recognize a single utterance. The end of a single utterance is determined by listening for silence at the end or until a maximum of 15 seconds of audio is processed.
  • Continuous recognition (async) - Asynchronously initiates continuous recognition operation. The user has to connect to handle event to receive recognition results. To stop asynchronous continuous recognition, call StopContinuousRecognitionAsync.

Note

Learn more about how to choose a speech recognition mode.

Single-shot recognition

Here's an example of asynchronous single-shot recognition using RecognizeOnceAsync:

auto result = recognizer->RecognizeOnceAsync().get();

You'll need to write some code to handle the result. This sample evaluates the result->Reason:

  • Prints the recognition result: ResultReason::RecognizedSpeech
  • If there is no recognition match, inform the user: ResultReason::NoMatch
  • If an error is encountered, print the error message: ResultReason::Canceled
switch (result->Reason)
{
    case ResultReason::RecognizedSpeech:
        cout << "We recognized: " << result->Text << std::endl;
        break;
    case ResultReason::NoMatch:
        cout << "NOMATCH: Speech could not be recognized." << std::endl;
        break;
    case ResultReason::Canceled:
        {
            auto cancellation = CancellationDetails::FromResult(result);
            cout << "CANCELED: Reason=" << (int)cancellation->Reason << std::endl;
    
            if (cancellation->Reason == CancellationReason::Error) {
                cout << "CANCELED: ErrorCode= " << (int)cancellation->ErrorCode << std::endl;
                cout << "CANCELED: ErrorDetails=" << cancellation->ErrorDetails << std::endl;
                cout << "CANCELED: Did you update the subscription info?" << std::endl;
            }
        }
        break;
    default:
        break;
}

Continuous recognition

Continuous recognition is a bit more involved than single-shot recognition. It requires you to subscribe to the Recognizing, Recognized, and Canceled events to get the recognition results. To stop recognition, you must call StopContinuousRecognitionAsync. Here's an example of how continuous recognition is performed on an audio input file.

Let's start by defining the input and initializing a SpeechRecognizer:

auto audioInput = AudioConfig::FromWavFileInput("YourAudioFile.wav");
auto recognizer = SpeechRecognizer::FromConfig(config, audioInput);

Next, let's create a variable to manage the state of speech recognition. To start, we'll declare a promise<void>, since at the start of recognition we can safely assume that it's not finished.

promise<void> recognitionEnd;

We'll subscribe to the events sent from the SpeechRecognizer.

  • Recognizing: Signal for events containing intermediate recognition results.
  • Recognized: Signal for events containing final recognition results (indicating a successful recognition attempt).
  • SessionStopped: Signal for events indicating the end of a recognition session (operation).
  • Canceled: Signal for events containing canceled recognition results (indicating a recognition attempt that was canceled as a result or a direct cancellation request or, alternatively, a transport or protocol failure).
recognizer->Recognizing.Connect([](const SpeechRecognitionEventArgs& e)
    {
        cout << "Recognizing:" << e.Result->Text << std::endl;
    });

recognizer->Recognized.Connect([](const SpeechRecognitionEventArgs& e)
    {
        if (e.Result->Reason == ResultReason::RecognizedSpeech)
        {
            cout << "RECOGNIZED: Text=" << e.Result->Text 
                 << " (text could not be translated)" << std::endl;
        }
        else if (e.Result->Reason == ResultReason::NoMatch)
        {
            cout << "NOMATCH: Speech could not be recognized." << std::endl;
        }
    });

recognizer->Canceled.Connect([&recognitionEnd](const SpeechRecognitionCanceledEventArgs& e)
    {
        cout << "CANCELED: Reason=" << (int)e.Reason << std::endl;
        if (e.Reason == CancellationReason::Error)
        {
            cout << "CANCELED: ErrorCode=" << (int)e.ErrorCode << "\n"
                 << "CANCELED: ErrorDetails=" << e.ErrorDetails << "\n"
                 << "CANCELED: Did you update the subscription info?" << std::endl;

            recognitionEnd.set_value(); // Notify to stop recognition.
        }
    });

recognizer->SessionStopped.Connect([&recognitionEnd](const SessionEventArgs& e)
    {
        cout << "Session stopped.";
        recognitionEnd.set_value(); // Notify to stop recognition.
    });

With everything set up, we can call StopContinuousRecognitionAsync.

// Starts continuous recognition. Uses StopContinuousRecognitionAsync() to stop recognition.
recognizer->StartContinuousRecognitionAsync().get();

// Waits for recognition end.
recognitionEnd.get_future().get();

// Stops recognition.
recognizer->StopContinuousRecognitionAsync().get();

Dictation mode

When using continuous recognition, you can enable dictation processing by using the corresponding "enable dictation" function. This mode will cause the speech config instance to interpret word descriptions of sentence structures such as punctuation. For example, the utterance "Do you live in town question mark" would be interpreted as the text "Do you live in town?".

To enable dictation mode, use the EnableDictation method on your SpeechConfig.

config->EnableDictation();

Change source language

A common task for speech recognition is specifying the input (or source) language. Let's take a look at how you would change the input language to German. In your code, find your SpeechConfig, then add this line directly below it.

config->SetSpeechRecognitionLanguage("de-DE");

SetSpeechRecognitionLanguage is a parameter that takes a string as an argument. You can provide any value in the list of supported locales/languages.

Improve recognition accuracy

There are a few ways to improve recognition accuracy with the Speech SDK. Let's take a look at Phrase Lists. Phrase Lists are used to identify known phrases in audio data, like a person's name or a specific location. Single words or complete phrases can be added to a Phrase List. During recognition, an entry in a phrase list is used if an exact match for the entire phrase is included in the audio. If an exact match to the phrase is not found, recognition is not assisted.

Important

The Phrase List feature is only available in English.

To use a phrase list, first create a PhraseListGrammar object, then add specific words and phrases with AddPhrase.

Any changes to PhraseListGrammar take effect on the next recognition or after a reconnection to the Speech service.

auto phraseListGrammar = PhraseListGrammar::FromRecognizer(recognizer);
phraseListGrammar->AddPhrase("Supercalifragilisticexpialidocious");

If you need to clear your phrase list:

phraseListGrammar->Clear();

Other options to improve recognition accuracy

Phrase lists are only one option to improve recognition accuracy. You can also:

One of the core features of the Speech service is the ability to recognize and transcribe human speech (often referred to as speech-to-text). In this quickstart, you learn how to use the Speech SDK in your apps and products to perform high-quality speech-to-text conversion.

Skip to samples on GitHub

If you want to skip straight to sample code, see the Go quickstart samples on GitHub.

Prerequisites

This article assumes that you have an Azure account and Speech service subscription. If you don't have an account and subscription, try the Speech service for free.

Install the Speech SDK

Before you can do anything, you'll need to install the Speech SDK for Go.

Speech-to-text from microphone

Use the following code sample to run speech recognition from your default device microphone. Replace the variables subscription and region with your subscription and region keys. Running the script will start a recognition session on your default microphone and output text.

import (
	"bufio"
	"fmt"
	"os"

	"github.com/Microsoft/cognitive-services-speech-sdk-go/audio"
	"github.com/Microsoft/cognitive-services-speech-sdk-go/speech"
)

func sessionStartedHandler(event speech.SessionEventArgs) {
	defer event.Close()
	fmt.Println("Session Started (ID=", event.SessionID, ")")
}

func sessionStoppedHandler(event speech.SessionEventArgs) {
	defer event.Close()
	fmt.Println("Session Stopped (ID=", event.SessionID, ")")
}

func recognizingHandler(event speech.SpeechRecognitionEventArgs) {
	defer event.Close()
	fmt.Println("Recognizing:", event.Result.Text)
}

func recognizedHandler(event speech.SpeechRecognitionEventArgs) {
	defer event.Close()
	fmt.Println("Recognized:", event.Result.Text)
}

func cancelledHandler(event speech.SpeechRecognitionCanceledEventArgs) {
	defer event.Close()
	fmt.Println("Received a cancellation: ", event.ErrorDetails)
}

func main() {
    subscription :=  "YOUR_SUBSCRIPTION_KEY"
    region := "YOUR_SUBSCRIPTIONKEY_REGION"

	audioConfig, err := audio.NewAudioConfigFromDefaultMicrophoneInput()
	if err != nil {
		fmt.Println("Got an error: ", err)
		return
	}
	defer audioConfig.Close()
	config, err := speech.NewSpeechConfigFromSubscription(subscription, region)
	if err != nil {
		fmt.Println("Got an error: ", err)
		return
	}
	defer config.Close()
	speechRecognizer, err := speech.NewSpeechRecognizerFromConfig(config, audioConfig)
	if err != nil {
		fmt.Println("Got an error: ", err)
		return
	}
	defer speechRecognizer.Close()
	speechRecognizer.SessionStarted(sessionStartedHandler)
	speechRecognizer.SessionStopped(sessionStoppedHandler)
	speechRecognizer.Recognizing(recognizingHandler)
	speechRecognizer.Recognized(recognizedHandler)
	speechRecognizer.Canceled(cancelledHandler)
	speechRecognizer.StartContinuousRecognitionAsync()
	defer speechRecognizer.StopContinuousRecognitionAsync()
	bufio.NewReader(os.Stdin).ReadBytes('\n')
}

See the reference docs for detailed information on the SpeechConfig and SpeechRecognizer classes.

Speech-to-text from audio file

Use the following sample to run speech recognition from an audio file. Replace the variables subscription and region with your subscription and region keys. Additionally, replace the variable file with a path to a .wav file. Running the script will recognize speech from the file, and output the text result.

import (
	"fmt"
	"time"

	"github.com/Microsoft/cognitive-services-speech-sdk-go/audio"
	"github.com/Microsoft/cognitive-services-speech-sdk-go/speech"
)

func main() {
    subscription :=  "YOUR_SUBSCRIPTION_KEY"
    region := "YOUR_SUBSCRIPTIONKEY_REGION"
    file := "path/to/file.wav"

	audioConfig, err := audio.NewAudioConfigFromWavFileInput(file)
	if err != nil {
		fmt.Println("Got an error: ", err)
		return
	}
	defer audioConfig.Close()
	config, err := speech.NewSpeechConfigFromSubscription(subscription, region)
	if err != nil {
		fmt.Println("Got an error: ", err)
		return
	}
	defer config.Close()
	speechRecognizer, err := speech.NewSpeechRecognizerFromConfig(config, audioConfig)
	if err != nil {
		fmt.Println("Got an error: ", err)
		return
	}
	defer speechRecognizer.Close()
	speechRecognizer.SessionStarted(func(event speech.SessionEventArgs) {
		defer event.Close()
		fmt.Println("Session Started (ID=", event.SessionID, ")")
	})
	speechRecognizer.SessionStopped(func(event speech.SessionEventArgs) {
		defer event.Close()
		fmt.Println("Session Stopped (ID=", event.SessionID, ")")
	})

	task := speechRecognizer.RecognizeOnceAsync()
	var outcome speech.SpeechRecognitionOutcome
	select {
	case outcome = <-task:
	case <-time.After(5 * time.Second):
		fmt.Println("Timed out")
		return
	}
	defer outcome.Close()
	if outcome.Error != nil {
		fmt.Println("Got an error: ", outcome.Error)
	}
	fmt.Println("Got a recognition!")
	fmt.Println(outcome.Result.Text)
}

See the reference docs for detailed information on the SpeechConfig and SpeechRecognizer classes.

One of the core features of the Speech service is the ability to recognize and transcribe human speech (often referred to as speech-to-text). In this quickstart, you learn how to use the Speech SDK in your apps and products to perform high-quality speech-to-text conversion.

Skip to samples on GitHub

If you want to skip straight to sample code, see the Java quickstart samples on GitHub.

Prerequisites

This article assumes that you have an Azure account and Speech service subscription. If you don't have an account and subscription, try the Speech service for free.

Install the Speech SDK

Before you can do anything, you'll need to install the Speech SDK. Depending on your platform, use the following instructions:

Create a speech configuration

To call the Speech service using the Speech SDK, you need to create a SpeechConfig. This class includes information about your subscription, like your key and associated region, endpoint, host, or authorization token.

Note

Regardless of whether you're performing speech recognition, speech synthesis, translation, or intent recognition, you'll always create a configuration.

There are a few ways that you can initialize a SpeechConfig:

  • With a subscription: pass in a key and the associated region.
  • With an endpoint: pass in a Speech service endpoint. A key or authorization token is optional.
  • With a host: pass in a host address. A key or authorization token is optional.
  • With an authorization token: pass in an authorization token and the associated region.

Let's take a look at how a SpeechConfig is created using a key and region. See the region support page to find your region identifier.

SpeechConfig config = SpeechConfig.fromSubscription("YourSubscriptionKey", "YourServiceRegion");

Initialize a recognizer

After you've created a SpeechConfig, the next step is to initialize a SpeechRecognizer. When you initialize a SpeechRecognizer, you pass it your SpeechConfig. This provides the credentials that the speech service requires to validate your request.

SpeechRecognizer recognizer = new SpeechRecognizer(config);

Recognize from microphone or file

If you want to specify the audio input device, you need to create an AudioConfig and pass it as a parameter when initializing your SpeechRecognizer.

To recognize speech using your device microphone, create an AudioConfig using fromDefaultMicrophoneInput(), then pass the audio config when creating your SpeechRecognizer object.

import java.util.concurrent.Future;
import com.microsoft.cognitiveservices.speech.*;

AudioConfig audioConfig = AudioConfig.fromDefaultMicrophoneInput();
SpeechRecognizer recognizer = new SpeechRecognizer(config, audioConfig);

If you want to recognize speech from an audio file instead of using a microphone, you still need to create an AudioConfig. However, when you create the AudioConfig, instead of calling fromDefaultMicrophoneInput(), you'll call fromWavFileInput() and pass the filename parameter.

AudioConfig audioConfig = AudioConfig.fromWavFileInput("YourAudioFile.wav");
SpeechRecognizer recognizer = new SpeechRecognizer(config, audioConfig);

Recognize speech

The Recognizer class for the Speech SDK for Java exposes a few methods that you can use for speech recognition.

  • Single-shot recognition (async) - Performs recognition in a non-blocking (asynchronous) mode. This will recognize a single utterance. The end of a single utterance is determined by listening for silence at the end or until a maximum of 15 seconds of audio is processed.
  • Continuous recognition (async) - Asynchronously initiates continuous recognition operation. If you want to provide an audio file instead of using a microphone, you'll still need to provide an audioConfig. To stop asynchronous continuous recognition, call stopContinuousRecognitionAsync.

Note

Learn more about how to choose a speech recognition mode.

Single-shot recognition

Here's an example of asynchronous single-shot recognition using recognizeOnceAsync:

Future<SpeechRecognitionResult> task = recognizer.recognizeOnceAsync();
SpeechRecognitionResult result = task.get();

You'll need to write some code to handle the result. This sample evaluates the result.getReason():

  • Prints the recognition result: ResultReason.RecognizedSpeech
  • If there is no recognition match, inform the user: ResultReason.NoMatch
  • If an error is encountered, print the error message: ResultReason.Canceled
switch (result.getReason()) {
    case ResultReason.RecognizedSpeech:
        System.out.println("We recognized: " + result.getText());
        exitCode = 0;
        break;
    case ResultReason.NoMatch:
        System.out.println("NOMATCH: Speech could not be recognized.");
        break;
    case ResultReason.Canceled: {
            CancellationDetails cancellation = CancellationDetails.fromResult(result);
            System.out.println("CANCELED: Reason=" + cancellation.getReason());

            if (cancellation.getReason() == CancellationReason.Error) {
                System.out.println("CANCELED: ErrorCode=" + cancellation.getErrorCode());
                System.out.println("CANCELED: ErrorDetails=" + cancellation.getErrorDetails());
                System.out.println("CANCELED: Did you update the subscription info?");
            }
        }
        break;
}

Continuous recognition

Continuous recognition is a bit more involved than single-shot recognition. It requires you to subscribe to the recognizing, recognized, and canceled events to get the recognition results. To stop recognition, you must call stopContinuousRecognitionAsync. Here's an example of how continuous recognition is performed on an audio input file.

Let's start by defining the input and initializing a SpeechRecognizer:

AudioConfig audioConfig = AudioConfig.fromWavFileInput("YourAudioFile.wav");
SpeechRecognizer recognizer = new SpeechRecognizer(config, audioConfig);

Next, let's create a variable to manage the state of speech recognition. To start, we'll declare a Semaphore at the class scope.

private static Semaphore stopTranslationWithFileSemaphore;

We'll subscribe to the events sent from the SpeechRecognizer.

  • recognizing: Signal for events containing intermediate recognition results.
  • recognized: Signal for events containing final recognition results (indicating a successful recognition attempt).
  • sessionStopped: Signal for events indicating the end of a recognition session (operation).
  • canceled: Signal for events containing canceled recognition results (indicating a recognition attempt that was canceled as a result or a direct cancellation request or, alternatively, a transport or protocol failure).
// First initialize the semaphore.
stopTranslationWithFileSemaphore = new Semaphore(0);

recognizer.recognizing.addEventListener((s, e) -> {
    System.out.println("RECOGNIZING: Text=" + e.getResult().getText());
});

recognizer.recognized.addEventListener((s, e) -> {
    if (e.getResult().getReason() == ResultReason.RecognizedSpeech) {
        System.out.println("RECOGNIZED: Text=" + e.getResult().getText());
    }
    else if (e.getResult().getReason() == ResultReason.NoMatch) {
        System.out.println("NOMATCH: Speech could not be recognized.");
    }
});

recognizer.canceled.addEventListener((s, e) -> {
    System.out.println("CANCELED: Reason=" + e.getReason());

    if (e.getReason() == CancellationReason.Error) {
        System.out.println("CANCELED: ErrorCode=" + e.getErrorCode());
        System.out.println("CANCELED: ErrorDetails=" + e.getErrorDetails());
        System.out.println("CANCELED: Did you update the subscription info?");
    }

    stopTranslationWithFileSemaphore.release();
});

recognizer.sessionStopped.addEventListener((s, e) -> {
    System.out.println("\n    Session stopped event.");
    stopTranslationWithFileSemaphore.release();
});

With everything set up, we can call startContinuousRecognitionAsync.

// Starts continuous recognition. Uses StopContinuousRecognitionAsync() to stop recognition.
recognizer.startContinuousRecognitionAsync().get();

// Waits for completion.
stopTranslationWithFileSemaphore.acquire();

// Stops recognition.
recognizer.stopContinuousRecognitionAsync().get();

Dictation mode

When using continuous recognition, you can enable dictation processing by using the corresponding "enable dictation" function. This mode will cause the speech config instance to interpret word descriptions of sentence structures such as punctuation. For example, the utterance "Do you live in town question mark" would be interpreted as the text "Do you live in town?".

To enable dictation mode, use the enableDictation method on your SpeechConfig.

config.enableDictation();

Change source language

A common task for speech recognition is specifying the input (or source) language. Let's take a look at how you would change the input language to French. In your code, find your SpeechConfig, then add this line directly below it.

config.setSpeechRecognitionLanguage("fr-FR");

setSpeechRecognitionLanguage is a parameter that takes a string as an argument. You can provide any value in the list of supported locales/languages.

Improve recognition accuracy

There are a few ways to improve recognition accuracy with the Speech SDK. Let's take a look at Phrase Lists. Phrase Lists are used to identify known phrases in audio data, like a person's name or a specific location. Single words or complete phrases can be added to a Phrase List. During recognition, an entry in a phrase list is used if an exact match for the entire phrase is included in the audio. If an exact match to the phrase is not found, recognition is not assisted.

Important

The Phrase List feature is only available in English.

To use a phrase list, first create a PhraseListGrammar object, then add specific words and phrases with AddPhrase.

Any changes to PhraseListGrammar take effect on the next recognition or after a reconnection to the Speech service.

PhraseListGrammar phraseList = PhraseListGrammar.fromRecognizer(recognizer);
phraseList.addPhrase("Supercalifragilisticexpialidocious");

If you need to clear your phrase list:

phraseList.clear();

Other options to improve recognition accuracy

Phrase lists are only one option to improve recognition accuracy. You can also:

One of the core features of the Speech service is the ability to recognize and transcribe human speech (often referred to as speech-to-text). In this quickstart, you learn how to use the Speech SDK in your apps and products to perform high-quality speech-to-text conversion.

Skip to samples on GitHub

If you want to skip straight to sample code, see the JavaScript quickstart samples on GitHub.

Prerequisites

This article assumes that you have an Azure account and Speech service subscription. If you don't have an account and subscription, try the Speech service for free.

Install the Speech SDK

Before you can do anything, you'll need to install the Speech SDK for JavaScript . Depending on your platform, use the following instructions:

Additionally, depending on the target environment use one of the following:

Download and extract the Speech SDK for JavaScript microsoft.cognitiveservices.speech.sdk.bundle.js file, and place it in a folder accessible to your HTML file.

<script src="microsoft.cognitiveservices.speech.sdk.bundle.js"></script>;

Tip

If you're targeting a web browser, and using the <script> tag; the sdk prefix is not needed. The sdk prefix is an alias used to name the require module.

Create a speech configuration

To call the Speech service using the Speech SDK, you need to create a SpeechConfig. This class includes information about your subscription, like your key and associated region, endpoint, host, or authorization token.

Note

Regardless of whether you're performing speech recognition, speech synthesis, translation, or intent recognition, you'll always create a configuration.

There are a few ways that you can initialize a SpeechConfig:

  • With a subscription: pass in a key and the associated region.
  • With an endpoint: pass in a Speech service endpoint. A key or authorization token is optional.
  • With a host: pass in a host address. A key or authorization token is optional.
  • With an authorization token: pass in an authorization token and the associated region.

Let's take a look at how a SpeechConfig is created using a key and region. See the region support page to find your region identifier.

const speechConfig = SpeechConfig.fromSubscription("YourSubscriptionKey", "YourServiceRegion");

Initialize a recognizer

After you've created a SpeechConfig, the next step is to initialize a SpeechRecognizer. When you initialize a SpeechRecognizer, you pass it your speechConfig. This provides the credentials that the speech service requires to validate your request.

const recognizer = new SpeechRecognizer(speechConfig);

Recognize from microphone or file

If you want to specify the audio input device, you need to create an AudioConfig and pass it as a parameter when initializing your SpeechRecognizer.

To recognize speech using your device microphone, create an AudioConfig using fromDefaultMicrophoneInput(), then pass the audio config when creating your SpeechRecognizer object.

const audioConfig = AudioConfig.fromDefaultMicrophoneInput();
const recognizer = new SpeechRecognizer(speechConfig, audioConfig);

If you want to recognize speech from an audio file instead of using a microphone, you still need to provide an AudioConfig. However, when you create the AudioConfig, instead of calling fromDefaultMicrophoneInput(), you call fromWavFileInput() and pass the filename parameter.

Important

Recognizing speech from a file is only supported in the Node.js SDK

const audioConfig = AudioConfig.fromWavFileInput("YourAudioFile.wav");
const recognizer = new SpeechRecognizer(speechConfig, audioConfig);

Recognize speech

The Recognizer class for the Speech SDK for JavaScript exposes a few methods that you can use for speech recognition.

  • Single-shot recognition (async) - Performs recognition in a non-blocking (asynchronous) mode. This will recognize a single utterance. The end of a single utterance is determined by listening for silence at the end or until a maximum of 15 seconds of audio is processed.
  • Continuous recognition (async) - Asynchronously initiates continuous recognition operation. The user registers to events and handles various application state. To stop asynchronous continuous recognition, call stopContinuousRecognitionAsync.

Note

Learn more about how to choose a speech recognition mode.

Single-shot recognition

Here's an example of asynchronous single-shot recognition using recognizeOnceAsync:

recognizer.recognizeOnceAsync(result => {
    // Interact with result
});

You'll need to write some code to handle the result. This sample evaluates the result.reason:

  • Prints the recognition result: ResultReason.RecognizedSpeech
  • If there is no recognition match, inform the user: ResultReason.NoMatch
  • If an error is encountered, print the error message: ResultReason.Canceled
switch (result.reason) {
    case ResultReason.RecognizedSpeech:
        console.log(`RECOGNIZED: Text=${result.text}`);
        console.log("    Intent not recognized.");
        break;
    case ResultReason.NoMatch:
        console.log("NOMATCH: Speech could not be recognized.");
        break;
    case ResultReason.Canceled:
        const cancellation = CancellationDetails.fromResult(result);
        console.log(`CANCELED: Reason=${cancellation.reason}`);

        if (cancellation.reason == CancellationReason.Error) {
            console.log(`CANCELED: ErrorCode=${cancellation.ErrorCode}`);
            console.log(`CANCELED: ErrorDetails=${cancellation.errorDetails}`);
            console.log("CANCELED: Did you update the subscription info?");
        }
        break;
    }

Continuous recognition

Continuous recognition is a bit more involved than single-shot recognition. It requires you to subscribe to the Recognizing, Recognized, and Canceled events to get the recognition results. To stop recognition, you must call stopContinuousRecognitionAsync. Here's an example of how continuous recognition is performed on an audio input file.

Let's start by defining the input and initializing a SpeechRecognizer:

const recognizer = new SpeechRecognizer(speechConfig);

We'll subscribe to the events sent from the SpeechRecognizer.

  • recognizing: Signal for events containing intermediate recognition results.
  • recognized: Signal for events containing final recognition results (indicating a successful recognition attempt).
  • sessionStopped: Signal for events indicating the end of a recognition session (operation).
  • canceled: Signal for events containing canceled recognition results (indicating a recognition attempt that was canceled as a result or a direct cancellation request or, alternatively, a transport or protocol failure).
recognizer.recognizing = (s, e) => {
    console.log(`RECOGNIZING: Text=${e.result.text}`);
};

recognizer.recognized = (s, e) => {
    if (e.result.reason == ResultReason.RecognizedSpeech) {
        console.log(`RECOGNIZED: Text=${e.result.text}`);
    }
    else if (e.result.reason == ResultReason.NoMatch) {
        console.log("NOMATCH: Speech could not be recognized.");
    }
};

recognizer.canceled = (s, e) => {
    console.log(`CANCELED: Reason=${e.reason}`);

    if (e.reason == CancellationReason.Error) {
        console.log(`"CANCELED: ErrorCode=${e.errorCode}`);
        console.log(`"CANCELED: ErrorDetails=${e.errorDetails}`);
        console.log("CANCELED: Did you update the subscription info?");
    }

    recognizer.stopContinuousRecognitionAsync();
};

recognizer.sessionStopped = (s, e) => {
    console.log("\n    Session stopped event.");
    recognizer.stopContinuousRecognitionAsync();
};

With everything set up, we can call startContinuousRecognitionAsync.

// Starts continuous recognition. Uses stopContinuousRecognitionAsync() to stop recognition.
recognizer.startContinuousRecognitionAsync();

// Something later can call, stops recognition.
// recognizer.StopContinuousRecognitionAsync();

Dictation mode

When using continuous recognition, you can enable dictation processing by using the corresponding "enable dictation" function. This mode will cause the speech config instance to interpret word descriptions of sentence structures such as punctuation. For example, the utterance "Do you live in town question mark" would be interpreted as the text "Do you live in town?".

To enable dictation mode, use the enableDictation method on your SpeechConfig.

speechConfig.enableDictation();

Change source language

A common task for speech recognition is specifying the input (or source) language. Let's take a look at how you would change the input language to Italian. In your code, find your SpeechConfig, then add this line directly below it.

speechConfig.speechRecognitionLanguage = "it-IT";

The speechRecognitionLanguage property expects a language-locale format string. You can provide any value in the Locale column in the list of supported locales/languages.

Improve recognition accuracy

There are a few ways to improve recognition accuracy with the Speech Let's take a look at Phrase Lists. Phrase Lists are used to identify known phrases in audio data, like a person's name or a specific location. Single words or complete phrases can be added to a Phrase List. During recognition, an entry in a phrase list is used if an exact match for the entire phrase is included in the audio. If an exact match to the phrase is not found, recognition is not assisted.

Important

The Phrase List feature is only available in English.

To use a phrase list, first create a PhraseListGrammar object, then add specific words and phrases with addPhrase.

Any changes to PhraseListGrammar take effect on the next recognition or after a reconnection to the Speech service.

const phraseList = PhraseListGrammar.fromRecognizer(recognizer);
phraseList.addPhrase("Supercalifragilisticexpialidocious");

If you need to clear your phrase list:

phraseList.clear();

Other options to improve recognition accuracy

Phrase lists are only one option to improve recognition accuracy. You can also:

You can transcribe speech into text using the Speech SDK for Swift and Objective-C.

Prerequisites

The following samples assume that you have an Azure account and Speech service subscription. If you don't have an account and subscription, try the Speech service for free.

Install Speech SDK and samples

The Cognitive Services Speech SDK contains samples written in in Swift and Objective-C for iOS and Mac. Click a link to see installation instructions for each sample:

We also provide an online Speech SDK for Objective-C Reference.

One of the core features of the Speech service is the ability to recognize and transcribe human speech (often referred to as speech-to-text). In this quickstart, you learn how to use the Speech SDK in your apps and products to perform high-quality speech-to-text conversion.

Skip to samples on GitHub

If you want to skip straight to sample code, see the Python quickstart samples on GitHub.

Prerequisites

This article assumes:

Install and import the Speech SDK

Before you can do anything, you'll need to install the Speech SDK.

pip install azure-cognitiveservices-speech

If you're on macOS and run into install issues, you may need to run this command first.

python3 -m pip install --upgrade pip

After the Speech SDK is installed, import it into your Python project with this statement.

import azure.cognitiveservices.speech as speechsdk

Create a speech configuration

To call the Speech service using the Speech SDK, you need to create a SpeechConfig. This class includes information about your subscription, like your key and associated region, endpoint, host, or authorization token.

Note

Regardless of whether you're performing speech recognition, speech synthesis, translation, or intent recognition, you'll always create a configuration.

There are a few ways that you can initialize a SpeechConfig:

  • With a subscription: pass in a key and the associated region.
  • With an endpoint: pass in a Speech service endpoint. A key or authorization token is optional.
  • With a host: pass in a host address. A key or authorization token is optional.
  • With an authorization token: pass in an authorization token and the associated region.

Let's take a look at how a SpeechConfig is created using a key and region. See the region support page to find your region identifier.

speech_key, service_region = "YourSubscriptionKey", "YourServiceRegion"
speech_config = speechsdk.SpeechConfig(subscription=speech_key, region=service_region)

Initialize a recognizer

After you've created a SpeechConfig, the next step is to initialize a SpeechRecognizer. When you initialize a SpeechRecognizer, you pass it your speech_config. This provides the credentials that the speech service requires to validate your request.

speech_recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config)

Recognize from microphone or file

If you want to specify the audio input device, you need to create an AudioConfig and pass it as a parameter when initializing your SpeechRecognizer.

To recognize speech using your device microphone, simply create a SpeechRecognizer without passing an AudioConfig

speech_recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config)

Tip

If you want to reference a device by ID, create an AudioConfig using AudioConfig(device_name="<device id>") Learn how to get the device ID for your audio input device.

If you want to recognize speech from an audio file instead of using a microphone, create an AudioConfig and use the filename parameter.

audio_input = speechsdk.AudioConfig(filename="your_file_name.wav")
speech_recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config, audio_config=audio_input)

Recognize speech

The Recognizer class for the Speech SDK for Python exposes a few methods that you can use for speech recognition.

  • Single-shot recognition (sync) - Performs recognition in a blocking (synchronous) mode. Returns after a single utterance is recognized. The end of a single utterance is determined by listening for silence at the end or until a maximum of 15 seconds of audio is processed. The task returns the recognition text as result.
  • Single-shot recognition (async) - Performs recognition in a non-blocking (asynchronous) mode. This will recognize a single utterance. The end of a single utterance is determined by listening for silence at the end or until a maximum of 15 seconds of audio is processed.
  • Continuous recognition (sync) - Synchronously initiates continuous recognition. The client must connect to EventSignal to receive recognition results. To stop recognition, call stop_continuous_recognition().
  • Continuous recognition (async) - Asynchronously initiates continuous recognition operation. User has to connect to EventSignal to receive recognition results. To stop asynchronous continuous recognition, call stop_continuous_recognition().

Note

Learn more about how to choose a speech recognition mode.

Single-shot recognition

Here's an example of synchronous single-shot recognition using recognize_once():

result = speech_recognizer.recognize_once()

Here's an example of asynchronous single-shot recognition using recognize_once_async():

result = speech_recognizer.recognize_once_async()

Regardless of whether you've used the synchronous or asynchronous method, you'll need to write some code to iterate through the result. This sample evaluates the result.reason:

  • Prints the recognition result: speechsdk.ResultReason.RecognizedSpeech
  • If there is no recognition match, inform the user: speechsdk.ResultReason.NoMatch
  • If an error is encountered, print the error message: speechsdk.ResultReason.Canceled
if result.reason == speechsdk.ResultReason.RecognizedSpeech:
    print("Recognized: {}".format(result.text))
elif result.reason == speechsdk.ResultReason.NoMatch:
    print("No speech could be recognized: {}".format(result.no_match_details))
elif result.reason == speechsdk.ResultReason.Canceled:
    cancellation_details = result.cancellation_details
    print("Speech Recognition canceled: {}".format(cancellation_details.reason))
    if cancellation_details.reason == speechsdk.CancellationReason.Error:
        print("Error details: {}".format(cancellation_details.error_details))

Continuous recognition

Continuous recognition is a bit more involved than single-shot recognition. It requires you to connect to the EventSignal to get the recognition results, and in to stop recognition, you must call stop_continuous_recognition() or stop_continuous_recognition(). Here's an example of how continuous recognition is performed on an audio input file.

Let's start by defining the input and initializing a SpeechRecognizer:

audio_config = speechsdk.audio.AudioConfig(filename=weatherfilename)
speech_recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config, audio_config=audio_config)

Next, let's create a variable to manage the state of speech recognition. To start, we'll set this to False, since at the start of recognition we can safely assume that it's not finished.

done = False

Now, we're going to create a callback to stop continuous recognition when an evt is received. There's a few things to keep in mind.

  • When an evt is received, the evt message is printed.
  • After an evt is received, stop_continuous_recognition() is called to stop recognition.
  • The recognition state is changed to True.
def stop_cb(evt):
    print('CLOSING on {}'.format(evt))
    speech_recognizer.stop_continuous_recognition()
    nonlocal done
    done = True

This code sample shows how to connect callbacks to events sent from the SpeechRecognizer.

  • recognizing: Signal for events containing intermediate recognition results.
  • recognized: Signal for events containing final recognition results (indicating a successful recognition attempt).
  • session_started: Signal for events indicating the start of a recognition session (operation).
  • session_stopped: Signal for events indicating the end of a recognition session (operation).
  • canceled: Signal for events containing canceled recognition results (indicating a recognition attempt that was canceled as a result or a direct cancellation request or, alternatively, a transport or protocol failure).
speech_recognizer.recognizing.connect(lambda evt: print('RECOGNIZING: {}'.format(evt)))
speech_recognizer.recognized.connect(lambda evt: print('RECOGNIZED: {}'.format(evt)))
speech_recognizer.session_started.connect(lambda evt: print('SESSION STARTED: {}'.format(evt)))
speech_recognizer.session_stopped.connect(lambda evt: print('SESSION STOPPED {}'.format(evt)))
speech_recognizer.canceled.connect(lambda evt: print('CANCELED {}'.format(evt)))

speech_recognizer.session_stopped.connect(stop_cb)
speech_recognizer.canceled.connect(stop_cb)

With everything set up, we can call start_continuous_recognition().

speech_recognizer.start_continuous_recognition()
while not done:
    time.sleep(.5)

Dictation mode

When using continuous recognition, you can enable dictation processing by using the corresponding "enable dictation" function. This mode will cause the speech config instance to interpret word descriptions of sentence structures such as punctuation. For example, the utterance "Do you live in town question mark" would be interpreted as the text "Do you live in town?".

To enable dictation mode, use the enable_dictation() method on your SpeechConfig.

SpeechConfig.enable_dictation()

Change source language

A common task for speech recognition is specifying the input (or source) language. Let's take a look at how you would change the input language to German. In your code, find your SpeechConfig, then add this line directly below it.

speech_config.speech_recognition_language="de-DE"

speech_recognition_language is a parameter that takes a string as an argument. You can provide any value in the list of supported locales/languages.

Improve recognition accuracy

There are a few ways to improve recognition accuracy with the Speech SDK. Let's take a look at Phrase Lists. Phrase Lists are used to identify known phrases in audio data, like a person's name or a specific location. Single words or complete phrases can be added to a Phrase List. During recognition, an entry in a phrase list is used if an exact match for the entire phrase is included in the audio. If an exact match to the phrase is not found, recognition is not assisted.

Important

The Phrase List feature is only available in English.

To use a phrase list, first create a PhraseListGrammar object, then add specific words and phrases with addPhrase.

Any changes to PhraseListGrammar take effect on the next recognition or after a reconnection to the Speech service.

phrase_list_grammar = speechsdk.PhraseListGrammar.from_recognizer(reco)
phrase_list_grammar.addPhrase("Supercalifragilisticexpialidocious")

If you need to clear your phrase list:

phrase_list_grammar.clear()

Other options to improve recognition accuracy

Phrase lists are only one option to improve recognition accuracy. You can also:

In this quickstart, you learn how to convert speech to text using the Speech service and cURL.

For a high-level look at Speech-to-Text concepts, see the overview article.

Prerequisites

This article assumes that you have an Azure account and Speech service subscription. If you don't have an account and subscription, try the Speech service for free.

Convert speech to text

At a command prompt, run the following command. You will need to insert the following values into the command.

  • Your Speech service subscription key.
  • Your Speech service region.
  • The input audio file path. You can generate audio files using text-to-speech.
curl --location --request POST 'https://INSERT_REGION_HERE.stt.speech.microsoft.com/speech/recognition/conversation/cognitiveservices/v1?language=en-US' \
--header 'Ocp-Apim-Subscription-Key: INSERT_SUBSCRIPTION_KEY_HERE' \
--header 'Content-Type: audio/wav' \
--data-binary 'INSERT_AUDIO_FILE_PATH_HERE'

You should receive a response like the following one.

{
    "RecognitionStatus": "Success",
    "DisplayText": "My voice is my passport, verify me.",
    "Offset": 6600000,
    "Duration": 32100000
}

For more information see the speech-to-text REST API reference.

One of the core features of the Speech service is the ability to recognize and transcribe human speech (often referred to as speech-to-text). In this quickstart, you learn how to use the Speech CLI in your apps and products to perform high-quality speech-to-text conversion.

Download and install

Note

On Windows, you need the Microsoft Visual C++ Redistributable for Visual Studio 2019 for your platform. Installing this for the first time may require you to restart Windows.

Follow these steps to install the Speech CLI on Windows:

  1. Download the Speech CLI zip archive, then extract it.
  2. Go to the root directory spx-zips that you extracted from the download, and extract the subdirectory that you need (spx-net471 for .NET Framework 4.7, or spx-netcore-win-x64 for .NET Core 3.0 on an x64 CPU).

In the command prompt, change directory to this location, and then type spx to see help for the Speech CLI.

Note

On Windows, the Speech CLI can only show fonts available to the command prompt on the local computer. Windows Terminal supports all fonts produced interactively by the Speech CLI. If you output to a file, a text editor like Notepad or a web browser like Microsoft Edge can also show all fonts.

Note

Powershell does not check the local directory when looking for a command. In Powershell, change directory to the location of spx and call the tool by entering .\spx. If you add this directory to your path, Powershell and the Windows command prompt will find spx from any directory without including the .\ prefix.

Create subscription config

To start using the Speech CLI, you first need to enter your Speech subscription key and region information. See the region support page to find your region identifier. Once you have your subscription key and region identifier (ex. eastus, westus), run the following commands.

spx config @key --set SUBSCRIPTION-KEY
spx config @region --set REGION

Your subscription authentication is now stored for future SPX requests. If you need to remove either of these stored values, run spx config @region --clear or spx config @key --clear.

Speech-to-text from microphone

Plug in and turn on your PC microphone, and turn off any apps that might also use the microphone. Some computers have a built-in microphone, while others require configuration of a Bluetooth device.

Now you're ready to run the Speech CLI to recognize speech from your microphone. From the command line, change to the directory that contains the Speech CLI binary file, and run the following command.

spx recognize --microphone

Note

The Speech CLI defaults to English. You can choose a different language from the Speech-to-text table. For example, add --source de-DE to recognize German speech.

Speak into the microphone, and you see transcription of your words into text in real-time. The Speech CLI will stop after a period of silence, or when you press ctrl-C.

Speech-to-text from audio file

The Speech CLI can recognize speech in many file formats and natural languages. In this example, you can use any WAV file (16kHz or 8kHz, 16-bit, and mono PCM) that contains English speech. Or if you want a quick sample, download the whatstheweatherlike.wav file and copy it to the same directory as the Speech CLI binary file.

Now you're ready to run the Speech CLI to recognize speech found in the audio file by running the following command.

spx recognize --file whatstheweatherlike.wav

Note

The Speech CLI defaults to English. You can choose a different language from the Speech-to-text table. For example, add --source de-DE to recognize German speech.

The Speech CLI will show a text transcription of the speech on the screen.

Next steps