How to recognize speech

Reference documentation | Package (NuGet) | Additional Samples on GitHub

In this how-to guide, you learn how to recognize and transcribe human speech (often called speech-to-text).

Create a speech configuration

To call the Speech service by using the Speech SDK, you need to create a SpeechConfig instance. This class includes information about your subscription, like your key and associated location/region, endpoint, host, or authorization token.

Create a SpeechConfig instance by using your key and location/region. Create a Speech resource on the Azure portal. For more information, see Create a new Azure Cognitive Services resource.

using System;
using System.IO;
using System.Threading.Tasks;
using Microsoft.CognitiveServices.Speech;
using Microsoft.CognitiveServices.Speech.Audio;

class Program 
{
    async static Task Main(string[] args)
    {
        var speechConfig = SpeechConfig.FromSubscription("<paste-your-speech-key-here>", "<paste-your-speech-location/region-here>");
    }
}

You can initialize SpeechConfig in a few other ways:

  • With an endpoint: pass in a Speech service endpoint. A key or authorization token is optional.
  • With a host: pass in a host address. A key or authorization token is optional.
  • With an authorization token: pass in an authorization token and the associated region/location.

Note

Regardless of whether you're performing speech recognition, speech synthesis, translation, or intent recognition, you'll always create a configuration.

Recognize speech from a microphone

To recognize speech by using your device microphone, create an AudioConfig instance by using FromDefaultMicrophoneInput(). Then initialize SpeechRecognizer by passing audioConfig and speechConfig.

using System;
using System.IO;
using System.Threading.Tasks;
using Microsoft.CognitiveServices.Speech;
using Microsoft.CognitiveServices.Speech.Audio;

class Program 
{
    async static Task FromMic(SpeechConfig speechConfig)
    {
        using var audioConfig = AudioConfig.FromDefaultMicrophoneInput();
        using var recognizer = new SpeechRecognizer(speechConfig, audioConfig);

        Console.WriteLine("Speak into your microphone.");
        var result = await recognizer.RecognizeOnceAsync();
        Console.WriteLine($"RECOGNIZED: Text={result.Text}");
    }

    async static Task Main(string[] args)
    {
        var speechConfig = SpeechConfig.FromSubscription("<paste-your-speech-key-here>", "<paste-your-speech-location/region-here>");
        await FromMic(speechConfig);
    }
}

If you want to use a specific audio input device, you need to specify the device ID in AudioConfig. Learn how to get the device ID for your audio input device.

Recognize speech from a file

If you want to recognize speech from an audio file instead of a microphone, you still need to create an AudioConfig instance. But instead of calling FromDefaultMicrophoneInput(), you call FromWavFileInput() and pass the file path:

using System;
using System.IO;
using System.Threading.Tasks;
using Microsoft.CognitiveServices.Speech;
using Microsoft.CognitiveServices.Speech.Audio;

class Program 
{
    async static Task FromFile(SpeechConfig speechConfig)
    {
        using var audioConfig = AudioConfig.FromWavFileInput("PathToFile.wav");
        using var recognizer = new SpeechRecognizer(speechConfig, audioConfig);

        var result = await recognizer.RecognizeOnceAsync();
        Console.WriteLine($"RECOGNIZED: Text={result.Text}");
    }

    async static Task Main(string[] args)
    {
        var speechConfig = SpeechConfig.FromSubscription("<paste-your-speech-key-here>", "<paste-your-speech-location/region-here>");
        await FromFile(speechConfig);
    }
}

Recognize speech from an in-memory stream

For many use cases, it's likely that your audio data will come from blob storage or will otherwise already be in memory as a byte[] instance or a similar raw data structure. The following example uses PushAudioInputStream to recognize speech, which is essentially an abstracted memory stream. The sample code does the following:

  • Writes raw audio data (PCM) to PushAudioInputStream by using the Write() function, which accepts a byte[] instance.
  • Reads a .wav file by using FileReader for demonstration purposes. If you already have audio data in a byte[] instance, you can skip directly to writing the content to the input stream.
  • The default format is 16-bit, 16-KHz mono PCM. To customize the format, you can pass an AudioStreamFormat object to CreatePushStream() by using the static function AudioStreamFormat.GetWaveFormatPCM(sampleRate, (byte)bitRate, (byte)channels).
using System;
using System.IO;
using System.Threading.Tasks;
using Microsoft.CognitiveServices.Speech;
using Microsoft.CognitiveServices.Speech.Audio;

class Program 
{
    async static Task FromStream(SpeechConfig speechConfig)
    {
        var reader = new BinaryReader(File.OpenRead("PathToFile.wav"));
        using var audioInputStream = AudioInputStream.CreatePushStream();
        using var audioConfig = AudioConfig.FromStreamInput(audioInputStream);
        using var recognizer = new SpeechRecognizer(speechConfig, audioConfig);

        byte[] readBytes;
        do
        {
            readBytes = reader.ReadBytes(1024);
            audioInputStream.Write(readBytes, readBytes.Length);
        } while (readBytes.Length > 0);

        var result = await recognizer.RecognizeOnceAsync();
        Console.WriteLine($"RECOGNIZED: Text={result.Text}");
    }

    async static Task Main(string[] args)
    {
        var speechConfig = SpeechConfig.FromSubscription("<paste-your-speech-key-here>", "<paste-your-speech-location/region-here>");
        await FromStream(speechConfig);
    }
}

Using a push stream as input assumes that the audio data is a raw PCM and skips any headers. The API will still work in certain cases if the header has not been skipped. But for the best results, consider implementing logic to read off the headers so that byte[] starts at the start of the audio data.

Handle errors

The previous examples simply get the recognized text from result.text. To handle errors and other responses, you need to write some code to handle the result. The following code evaluates the result.Reason property and:

  • Prints the recognition result: ResultReason.RecognizedSpeech.
  • If there is no recognition match, informs the user: ResultReason.NoMatch.
  • If an error is encountered, prints the error message: ResultReason.Canceled.
switch (result.Reason)
{
    case ResultReason.RecognizedSpeech:
        Console.WriteLine($"RECOGNIZED: Text={result.Text}");
        break;
    case ResultReason.NoMatch:
        Console.WriteLine($"NOMATCH: Speech could not be recognized.");
        break;
    case ResultReason.Canceled:
        var cancellation = CancellationDetails.FromResult(result);
        Console.WriteLine($"CANCELED: Reason={cancellation.Reason}");

        if (cancellation.Reason == CancellationReason.Error)
        {
            Console.WriteLine($"CANCELED: ErrorCode={cancellation.ErrorCode}");
            Console.WriteLine($"CANCELED: ErrorDetails={cancellation.ErrorDetails}");
            Console.WriteLine($"CANCELED: Did you set the speech resource key and region values?");
        }
        break;
}

Use continuous recognition

The previous examples use single-shot recognition, which recognizes a single utterance. The end of a single utterance is determined by listening for silence at the end or until a maximum of 15 seconds of audio is processed.

In contrast, you use continuous recognition when you want to control when to stop recognizing. It requires you to subscribe to the Recognizing, Recognized, and Canceled events to get the recognition results. To stop recognition, you must call StopContinuousRecognitionAsync. Here's an example of how continuous recognition is performed on an audio input file.

Start by defining the input and initializing SpeechRecognizer:

using var audioConfig = AudioConfig.FromWavFileInput("YourAudioFile.wav");
using var recognizer = new SpeechRecognizer(speechConfig, audioConfig);

Then create a TaskCompletionSource<int> instance to manage the state of speech recognition:

var stopRecognition = new TaskCompletionSource<int>();

Next, subscribe to the events that SpeechRecognizer sends:

  • Recognizing: Signal for events that contain intermediate recognition results.
  • Recognized: Signal for events that contain final recognition results, which indicate a successful recognition attempt.
  • SessionStopped: Signal for events that indicate the end of a recognition session (operation).
  • Canceled: Signal for events that contain canceled recognition results. These results indicate a recognition attempt that was canceled as a result or a direct cancellation request. Alternatively, they indicate a transport or protocol failure.
recognizer.Recognizing += (s, e) =>
{
    Console.WriteLine($"RECOGNIZING: Text={e.Result.Text}");
};

recognizer.Recognized += (s, e) =>
{
    if (e.Result.Reason == ResultReason.RecognizedSpeech)
    {
        Console.WriteLine($"RECOGNIZED: Text={e.Result.Text}");
    }
    else if (e.Result.Reason == ResultReason.NoMatch)
    {
        Console.WriteLine($"NOMATCH: Speech could not be recognized.");
    }
};

recognizer.Canceled += (s, e) =>
{
    Console.WriteLine($"CANCELED: Reason={e.Reason}");

    if (e.Reason == CancellationReason.Error)
    {
        Console.WriteLine($"CANCELED: ErrorCode={e.ErrorCode}");
        Console.WriteLine($"CANCELED: ErrorDetails={e.ErrorDetails}");
        Console.WriteLine($"CANCELED: Did you set the speech resource key and region values?");
    }

    stopRecognition.TrySetResult(0);
};

recognizer.SessionStopped += (s, e) =>
{
    Console.WriteLine("\n    Session stopped event.");
    stopRecognition.TrySetResult(0);
};

With everything set up, call StartContinuousRecognitionAsync to start recognizing:

await recognizer.StartContinuousRecognitionAsync();

// Waits for completion. Use Task.WaitAny to keep the task rooted.
Task.WaitAny(new[] { stopRecognition.Task });

// Make the following call at some point to stop recognition:
// await recognizer.StopContinuousRecognitionAsync();

Dictation mode

When you're using continuous recognition, you can enable dictation processing by using the corresponding function. This mode will cause the speech configuration instance to interpret word descriptions of sentence structures such as punctuation. For example, the utterance "Do you live in town question mark" would be interpreted as the text "Do you live in town?".

To enable dictation mode, use the EnableDictation method on SpeechConfig:

speechConfig.EnableDictation();

Change the source language

A common task for speech recognition is specifying the input (or source) language. The following example shows how you would change the input language to Italian. In your code, find your SpeechConfig instance and add this line directly below it:

speechConfig.SpeechRecognitionLanguage = "it-IT";

The SpeechRecognitionLanguage property expects a language-locale format string. You can provide any value in the Locale column in the list of supported locales/languages.

Use a custom endpoint

With Custom Speech, you can upload your own data, test and train a custom model, compare accuracy between models, and deploy a model to a custom endpoint. The following example shows how to set a custom endpoint.

var speechConfig = SpeechConfig.FromSubscription("YourSubscriptionKey", "YourServiceRegion");
speechConfig.EndpointId = "YourEndpointId";
var speechRecognizer = new SpeechRecognizer(speechConfig);

Reference documentation | Package (NuGet) | Additional Samples on GitHub

In this how-to guide, you learn how to recognize and transcribe human speech (often called speech-to-text).

Create a speech configuration

To call the Speech service using the Speech SDK, you need to create a SpeechConfig instance. This class includes information about your subscription, like your key and associated location/region, endpoint, host, or authorization token.

Create a SpeechConfig instance by using your key and region. Create a Speech resource on the Azure portal. For more information, see Create a new Azure Cognitive Services resource.

using namespace std;
using namespace Microsoft::CognitiveServices::Speech;

auto config = SpeechConfig::FromSubscription("<paste-your-speech-key-here>", "<paste-your-speech-location/region-here>");

You can initialize SpeechConfig in a few other ways:

  • With an endpoint: pass in a Speech service endpoint. A key or authorization token is optional.
  • With a host: pass in a host address. A key or authorization token is optional.
  • With an authorization token: pass in an authorization token and the associated region.

Note

Regardless of whether you're performing speech recognition, speech synthesis, translation, or intent recognition, you'll always create a configuration.

Recognize speech from a microphone

To recognize speech by using your device microphone, create an AudioConfig instance by using FromDefaultMicrophoneInput(). Then initialize SpeechRecognizer by passing audioConfig and config.

using namespace Microsoft::CognitiveServices::Speech::Audio;

auto audioConfig = AudioConfig::FromDefaultMicrophoneInput();
auto recognizer = SpeechRecognizer::FromConfig(config, audioConfig);

cout << "Speak into your microphone." << std::endl;
auto result = recognizer->RecognizeOnceAsync().get();
cout << "RECOGNIZED: Text=" << result->Text << std::endl;

If you want to use a specific audio input device, you need to specify the device ID in AudioConfig. Learn how to get the device ID for your audio input device.

Recognize speech from a file

If you want to recognize speech from an audio file instead of using a microphone, you still need to create an AudioConfig instance. But instead of calling FromDefaultMicrophoneInput(), you call FromWavFileInput() and pass the file path:

using namespace Microsoft::CognitiveServices::Speech::Audio;

auto audioInput = AudioConfig::FromWavFileInput("YourAudioFile.wav");
auto recognizer = SpeechRecognizer::FromConfig(config, audioInput);

auto result = recognizer->RecognizeOnceAsync().get();
cout << "RECOGNIZED: Text=" << result->Text << std::endl;

Recognize speech by using the Recognizer class

The Recognizer class for the Speech SDK for C++ exposes a few methods that you can use for speech recognition.

Single-shot recognition

Single-shot recognition asynchronously recognizes a single utterance. The end of a single utterance is determined by listening for silence at the end or until a maximum of 15 seconds of audio is processed. Here's an example of asynchronous single-shot recognition via RecognizeOnceAsync:

auto result = recognizer->RecognizeOnceAsync().get();

You need to write some code to handle the result. This sample evaluates result->Reason and:

  • Prints the recognition result: ResultReason::RecognizedSpeech.
  • If there is no recognition match, informs the user: ResultReason::NoMatch.
  • If an error is encountered, prints the error message: ResultReason::Canceled.
switch (result->Reason)
{
    case ResultReason::RecognizedSpeech:
        cout << "We recognized: " << result->Text << std::endl;
        break;
    case ResultReason::NoMatch:
        cout << "NOMATCH: Speech could not be recognized." << std::endl;
        break;
    case ResultReason::Canceled:
        {
            auto cancellation = CancellationDetails::FromResult(result);
            cout << "CANCELED: Reason=" << (int)cancellation->Reason << std::endl;
    
            if (cancellation->Reason == CancellationReason::Error) {
                cout << "CANCELED: ErrorCode= " << (int)cancellation->ErrorCode << std::endl;
                cout << "CANCELED: ErrorDetails=" << cancellation->ErrorDetails << std::endl;
                cout << "CANCELED: Did you set the speech resource key and region values?" << std::endl;
            }
        }
        break;
    default:
        break;
}

Continuous recognition

Continuous recognition is a bit more involved than single-shot recognition. It requires you to subscribe to the Recognizing, Recognized, and Canceled events to get the recognition results. To stop recognition, you must call StopContinuousRecognitionAsync. Here's an example of how continuous recognition is performed on an audio input file.

Start by defining the input and initializing SpeechRecognizer:

auto audioInput = AudioConfig::FromWavFileInput("YourAudioFile.wav");
auto recognizer = SpeechRecognizer::FromConfig(config, audioInput);

Next, create a variable to manage the state of speech recognition. Declare promise<void> because at the start of recognition, you can safely assume that it's not finished:

promise<void> recognitionEnd;

Next, subscribe to the events that SpeechRecognizer sends:

  • Recognizing: Signal for events that contain intermediate recognition results.
  • Recognized: Signal for events that contain final recognition results, which indicate a successful recognition attempt.
  • SessionStopped: Signal for events that indicate the end of a recognition session (operation).
  • Canceled: Signal for events that contain canceled recognition results. These results indicate a recognition attempt that was canceled as a result or a direct cancellation request. Alternatively, they indicate a transport or protocol failure.
recognizer->Recognizing.Connect([](const SpeechRecognitionEventArgs& e)
    {
        cout << "Recognizing:" << e.Result->Text << std::endl;
    });

recognizer->Recognized.Connect([](const SpeechRecognitionEventArgs& e)
    {
        if (e.Result->Reason == ResultReason::RecognizedSpeech)
        {
            cout << "RECOGNIZED: Text=" << e.Result->Text 
                 << " (text could not be translated)" << std::endl;
        }
        else if (e.Result->Reason == ResultReason::NoMatch)
        {
            cout << "NOMATCH: Speech could not be recognized." << std::endl;
        }
    });

recognizer->Canceled.Connect([&recognitionEnd](const SpeechRecognitionCanceledEventArgs& e)
    {
        cout << "CANCELED: Reason=" << (int)e.Reason << std::endl;
        if (e.Reason == CancellationReason::Error)
        {
            cout << "CANCELED: ErrorCode=" << (int)e.ErrorCode << "\n"
                 << "CANCELED: ErrorDetails=" << e.ErrorDetails << "\n"
                 << "CANCELED: Did you set the speech resource key and region values?" << std::endl;

            recognitionEnd.set_value(); // Notify to stop recognition.
        }
    });

recognizer->SessionStopped.Connect([&recognitionEnd](const SessionEventArgs& e)
    {
        cout << "Session stopped.";
        recognitionEnd.set_value(); // Notify to stop recognition.
    });

With everything set up, call StopContinuousRecognitionAsync to start recognizing:

// Starts continuous recognition. Uses StopContinuousRecognitionAsync() to stop recognition.
recognizer->StartContinuousRecognitionAsync().get();

// Waits for recognition end.
recognitionEnd.get_future().get();

// Stops recognition.
recognizer->StopContinuousRecognitionAsync().get();

Dictation mode

When you're using continuous recognition, you can enable dictation processing by using the corresponding function. This mode will cause the speech configuration instance to interpret word descriptions of sentence structures such as punctuation. For example, the utterance "Do you live in town question mark" would be interpreted as the text "Do you live in town?".

To enable dictation mode, use the EnableDictation method on SpeechConfig:

config->EnableDictation();

Change the source language

A common task for speech recognition is specifying the input (or source) language. The following example shows how you would change the input language to German. In your code, find your SpeechConfig instance and add this line directly below it:

config->SetSpeechRecognitionLanguage("de-DE");

SetSpeechRecognitionLanguage is a parameter that takes a string as an argument. You can provide any value in the list of supported locales/languages.

Use a custom endpoint

With Custom Speech, you can upload your own data, test and train a custom model, compare accuracy between models, and deploy a model to a custom endpoint. The following example shows how to set a custom endpoint.

auto speechConfig = SpeechConfig::FromSubscription("YourSubscriptionKey", "YourServiceRegion");
speechConfig->SetEndpointId("YourEndpointId");
auto speechRecognizer = SpeechRecognizer::FromConfig(speechConfig);

Reference documentation | Package (Go) | Additional Samples on GitHub

In this how-to guide, you learn how to recognize and transcribe human speech (often called speech-to-text).

Recognize speech-to-text from a microphone

Use the following code sample to run speech recognition from your default device microphone. Replace the variables subscription and region with your speech key and location/region, respectively. Create a Speech resource on the Azure portal. For more information, see Create a new Azure Cognitive Services resource. Running the script will start a recognition session on your default microphone and output text.

package main

import (
	"bufio"
	"fmt"
	"os"

	"github.com/Microsoft/cognitive-services-speech-sdk-go/audio"
	"github.com/Microsoft/cognitive-services-speech-sdk-go/speech"
)

func sessionStartedHandler(event speech.SessionEventArgs) {
	defer event.Close()
	fmt.Println("Session Started (ID=", event.SessionID, ")")
}

func sessionStoppedHandler(event speech.SessionEventArgs) {
	defer event.Close()
	fmt.Println("Session Stopped (ID=", event.SessionID, ")")
}

func recognizingHandler(event speech.SpeechRecognitionEventArgs) {
	defer event.Close()
	fmt.Println("Recognizing:", event.Result.Text)
}

func recognizedHandler(event speech.SpeechRecognitionEventArgs) {
	defer event.Close()
	fmt.Println("Recognized:", event.Result.Text)
}

func cancelledHandler(event speech.SpeechRecognitionCanceledEventArgs) {
	defer event.Close()
	fmt.Println("Received a cancellation: ", event.ErrorDetails)
	fmt.Println("Did you set the speech resource key and region values?")
}

func main() {
    subscription :=  "<paste-your-speech-key-here>"
    region := "<paste-your-speech-location/region-here>"

	audioConfig, err := audio.NewAudioConfigFromDefaultMicrophoneInput()
	if err != nil {
		fmt.Println("Got an error: ", err)
		return
	}
	defer audioConfig.Close()
	config, err := speech.NewSpeechConfigFromSubscription(subscription, region)
	if err != nil {
		fmt.Println("Got an error: ", err)
		return
	}
	defer config.Close()
	speechRecognizer, err := speech.NewSpeechRecognizerFromConfig(config, audioConfig)
	if err != nil {
		fmt.Println("Got an error: ", err)
		return
	}
	defer speechRecognizer.Close()
	speechRecognizer.SessionStarted(sessionStartedHandler)
	speechRecognizer.SessionStopped(sessionStoppedHandler)
	speechRecognizer.Recognizing(recognizingHandler)
	speechRecognizer.Recognized(recognizedHandler)
	speechRecognizer.Canceled(cancelledHandler)
	speechRecognizer.StartContinuousRecognitionAsync()
	defer speechRecognizer.StopContinuousRecognitionAsync()
	bufio.NewReader(os.Stdin).ReadBytes('\n')
}

Run the following commands to create a go.mod file that links to components hosted on GitHub:

go mod init quickstart
go get github.com/Microsoft/cognitive-services-speech-sdk-go

Now build and run the code:

go build
go run quickstart

For detailed information, see the reference content for the SpeechConfig class and the reference content for the SpeechRecognizer class.

Recognize speech-to-text from an audio file

Use the following sample to run speech recognition from an audio file. Replace the variables subscription and region with your speech key and location/region, respectively. Create a Speech resource on the Azure portal. For more information, see Create a new Azure Cognitive Services resource. Additionally, replace the variable file with a path to a .wav file. Running the script will recognize speech from the file and output the text result.

package main

import (
	"fmt"
	"time"

	"github.com/Microsoft/cognitive-services-speech-sdk-go/audio"
	"github.com/Microsoft/cognitive-services-speech-sdk-go/speech"
)

func main() {
    subscription :=  "<paste-your-speech-key-here>"
    region := "<paste-your-speech-location/region-here>"
    file := "path/to/file.wav"

	audioConfig, err := audio.NewAudioConfigFromWavFileInput(file)
	if err != nil {
		fmt.Println("Got an error: ", err)
		return
	}
	defer audioConfig.Close()
	config, err := speech.NewSpeechConfigFromSubscription(subscription, region)
	if err != nil {
		fmt.Println("Got an error: ", err)
		return
	}
	defer config.Close()
	speechRecognizer, err := speech.NewSpeechRecognizerFromConfig(config, audioConfig)
	if err != nil {
		fmt.Println("Got an error: ", err)
		return
	}
	defer speechRecognizer.Close()
	speechRecognizer.SessionStarted(func(event speech.SessionEventArgs) {
		defer event.Close()
		fmt.Println("Session Started (ID=", event.SessionID, ")")
	})
	speechRecognizer.SessionStopped(func(event speech.SessionEventArgs) {
		defer event.Close()
		fmt.Println("Session Stopped (ID=", event.SessionID, ")")
	})

	task := speechRecognizer.RecognizeOnceAsync()
	var outcome speech.SpeechRecognitionOutcome
	select {
	case outcome = <-task:
	case <-time.After(5 * time.Second):
		fmt.Println("Timed out")
		return
	}
	defer outcome.Close()
	if outcome.Error != nil {
		fmt.Println("Got an error: ", outcome.Error)
	}
	fmt.Println("Got a recognition!")
	fmt.Println(outcome.Result.Text)
}

Run the following commands to create a go.mod file that links to components hosted on GitHub:

go mod init quickstart
go get github.com/Microsoft/cognitive-services-speech-sdk-go

Now build and run the code:

go build
go run quickstart

For detailed information, see the reference content for the SpeechConfig class and the reference content for the SpeechRecognizer class.

Reference documentation | Additional Samples on GitHub

In this how-to guide, you learn how to recognize and transcribe human speech (often called speech-to-text).

Create a speech configuration

To call the Speech service by using the Speech SDK, you need to create a SpeechConfig instance. This class includes information about your subscription, like your key and associated location/region, endpoint, host, or authorization token.

Create a SpeechConfig instance by using your key and location/region. Create a Speech resource on the Azure portal. For more information, see Create a new Azure Cognitive Services resource.

import com.microsoft.cognitiveservices.speech.*;
import com.microsoft.cognitiveservices.speech.audio.AudioConfig;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.Future;

public class Program {
    public static void main(String[] args) throws InterruptedException, ExecutionException {
        SpeechConfig speechConfig = SpeechConfig.fromSubscription("<paste-your-subscription-key>", "<paste-your-region>");
    }
}

You can initialize SpeechConfig in a few other ways:

  • With an endpoint: pass in a Speech service endpoint. A key or authorization token is optional.
  • With a host: pass in a host address. A key or authorization token is optional.
  • With an authorization token: pass in an authorization token and the associated region.

Note

Regardless of whether you're performing speech recognition, speech synthesis, translation, or intent recognition, you'll always create a configuration.

Recognize speech from a microphone

To recognize speech by using your device microphone, create an AudioConfig instance by using fromDefaultMicrophoneInput(). Then initialize SpeechRecognizer by passing audioConfig and config.

import com.microsoft.cognitiveservices.speech.*;
import com.microsoft.cognitiveservices.speech.audio.AudioConfig;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.Future;

public class Program {
    public static void main(String[] args) throws InterruptedException, ExecutionException {
        SpeechConfig speechConfig = SpeechConfig.fromSubscription("<paste-your-subscription-key>", "<paste-your-region>");
        fromMic(speechConfig);
    }

    public static void fromMic(SpeechConfig speechConfig) throws InterruptedException, ExecutionException {
        AudioConfig audioConfig = AudioConfig.fromDefaultMicrophoneInput();
        SpeechRecognizer recognizer = new SpeechRecognizer(speechConfig, audioConfig);

        System.out.println("Speak into your microphone.");
        Future<SpeechRecognitionResult> task = recognizer.recognizeOnceAsync();
        SpeechRecognitionResult result = task.get();
        System.out.println("RECOGNIZED: Text=" + result.getText());
    }
}

If you want to use a specific audio input device, you need to specify the device ID in AudioConfig. Learn how to get the device ID for your audio input device.

Recognize speech from a file

If you want to recognize speech from an audio file instead of using a microphone, you still need to create an AudioConfig instance. But instead of calling fromDefaultMicrophoneInput(), you call fromWavFileInput() and pass the file path:

import com.microsoft.cognitiveservices.speech.*;
import com.microsoft.cognitiveservices.speech.audio.AudioConfig;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.Future;

public class Program {
    public static void main(String[] args) throws InterruptedException, ExecutionException {
        SpeechConfig speechConfig = SpeechConfig.fromSubscription("<paste-your-subscription-key>", "<paste-your-region>");
        fromFile(speechConfig);
    }

    public static void fromFile(SpeechConfig speechConfig) throws InterruptedException, ExecutionException {
        AudioConfig audioConfig = AudioConfig.fromWavFileInput("YourAudioFile.wav");
        SpeechRecognizer recognizer = new SpeechRecognizer(speechConfig, audioConfig);
        
        Future<SpeechRecognitionResult> task = recognizer.recognizeOnceAsync();
        SpeechRecognitionResult result = task.get();
        System.out.println("RECOGNIZED: Text=" + result.getText());
    }
}

Handle errors

The previous examples simply get the recognized text by using result.getText(). To handle errors and other responses, you need to write some code to handle the result. The following example evaluates result.getReason() and:

  • Prints the recognition result: ResultReason.RecognizedSpeech.
  • If there is no recognition match, informs the user: ResultReason.NoMatch.
  • If an error is encountered, prints the error message: ResultReason.Canceled.
switch (result.getReason()) {
    case ResultReason.RecognizedSpeech:
        System.out.println("We recognized: " + result.getText());
        exitCode = 0;
        break;
    case ResultReason.NoMatch:
        System.out.println("NOMATCH: Speech could not be recognized.");
        break;
    case ResultReason.Canceled: {
            CancellationDetails cancellation = CancellationDetails.fromResult(result);
            System.out.println("CANCELED: Reason=" + cancellation.getReason());

            if (cancellation.getReason() == CancellationReason.Error) {
                System.out.println("CANCELED: ErrorCode=" + cancellation.getErrorCode());
                System.out.println("CANCELED: ErrorDetails=" + cancellation.getErrorDetails());
                System.out.println("CANCELED: Did you set the speech resource key and region values?");
            }
        }
        break;
}

Use continuous recognition

The previous examples use single-shot recognition, which recognizes a single utterance. The end of a single utterance is determined by listening for silence at the end or until a maximum of 15 seconds of audio is processed.

In contrast, you use continuous recognition when you want to control when to stop recognizing. It requires you to subscribe to the recognizing, recognized, and canceled events to get the recognition results. To stop recognition, you must call stopContinuousRecognitionAsync. Here's an example of how continuous recognition is performed on an audio input file.

Start by defining the input and initializing SpeechRecognizer:

AudioConfig audioConfig = AudioConfig.fromWavFileInput("YourAudioFile.wav");
SpeechRecognizer recognizer = new SpeechRecognizer(config, audioConfig);

Next, create a variable to manage the state of speech recognition. Declare a Semaphore instance at the class scope:

private static Semaphore stopTranslationWithFileSemaphore;

Next, subscribe to the events that SpeechRecognizer sends:

  • recognizing: Signal for events that contain intermediate recognition results.
  • recognized: Signal for events that contain final recognition results, which indicate a successful recognition attempt.
  • sessionStopped: Signal for events that indicate the end of a recognition session (operation).
  • canceled: Signal for events that contain canceled recognition results. These results indicate a recognition attempt that was canceled as a result or a direct cancellation request. Alternatively, they indicate a transport or protocol failure.
// First initialize the semaphore.
stopTranslationWithFileSemaphore = new Semaphore(0);

recognizer.recognizing.addEventListener((s, e) -> {
    System.out.println("RECOGNIZING: Text=" + e.getResult().getText());
});

recognizer.recognized.addEventListener((s, e) -> {
    if (e.getResult().getReason() == ResultReason.RecognizedSpeech) {
        System.out.println("RECOGNIZED: Text=" + e.getResult().getText());
    }
    else if (e.getResult().getReason() == ResultReason.NoMatch) {
        System.out.println("NOMATCH: Speech could not be recognized.");
    }
});

recognizer.canceled.addEventListener((s, e) -> {
    System.out.println("CANCELED: Reason=" + e.getReason());

    if (e.getReason() == CancellationReason.Error) {
        System.out.println("CANCELED: ErrorCode=" + e.getErrorCode());
        System.out.println("CANCELED: ErrorDetails=" + e.getErrorDetails());
        System.out.println("CANCELED: Did you set the speech resource key and region values?");
    }

    stopTranslationWithFileSemaphore.release();
});

recognizer.sessionStopped.addEventListener((s, e) -> {
    System.out.println("\n    Session stopped event.");
    stopTranslationWithFileSemaphore.release();
});

With everything set up, call startContinuousRecognitionAsync to start recognizing:

// Starts continuous recognition. Uses StopContinuousRecognitionAsync() to stop recognition.
recognizer.startContinuousRecognitionAsync().get();

// Waits for completion.
stopTranslationWithFileSemaphore.acquire();

// Stops recognition.
recognizer.stopContinuousRecognitionAsync().get();

Dictation mode

When you're using continuous recognition, you can enable dictation processing by using the corresponding function. This mode will cause the speech configuration instance to interpret word descriptions of sentence structures such as punctuation. For example, the utterance "Do you live in town question mark" would be interpreted as the text "Do you live in town?".

To enable dictation mode, use the enableDictation method on SpeechConfig:

config.enableDictation();

Change the source language

A common task for speech recognition is specifying the input (or source) language. The following example shows how you would change the input language to French. In your code, find your SpeechConfig instance and add this line directly below it:

config.setSpeechRecognitionLanguage("fr-FR");

setSpeechRecognitionLanguage is a parameter that takes a string as an argument. You can provide any value in the list of supported locales/languages.

Use a custom endpoint

With Custom Speech, you can upload your own data, test and train a custom model, compare accuracy between models, and deploy a model to a custom endpoint. The following example shows how to set a custom endpoint.

SpeechConfig speechConfig = SpeechConfig.FromSubscription("YourSubscriptionKey", "YourServiceRegion");
speechConfig.setEndpointId("YourEndpointId");
SpeechRecognizer speechRecognizer = new SpeechRecognizer(speechConfig);

Reference documentation | Package (npm) | Additional Samples on GitHub | Library source code

In this how-to guide, you learn how to recognize and transcribe human speech (often called speech-to-text).

Create a speech configuration

To call the Speech service by using the Speech SDK, you need to create a SpeechConfig instance. This class includes information about your subscription, like your key and associated location/region, endpoint, host, or authorization token.

Create a SpeechConfig instance by using your key and location/region. Create a Speech resource on the Azure portal. For more information, see Create a new Azure Cognitive Services resource.

const speechConfig = sdk.SpeechConfig.fromSubscription("<paste-your-speech-key-here>", "<paste-your-speech-location/region-here>");

You can initialize SpeechConfig in a few other ways:

  • With an endpoint: pass in a Speech service endpoint. A key or authorization token is optional.
  • With a host: pass in a host address. A key or authorization token is optional.
  • With an authorization token: pass in an authorization token and the associated location/region.

Note

Regardless of whether you're performing speech recognition, speech synthesis, translation, or intent recognition, you'll always create a configuration.

Recognize speech from a microphone

Recognizing speech from a microphone is not supported in Node.js. It's supported only in a browser-based JavaScript environment. For more information, see the React sample and the implementation of speech-to-text from a microphone on GitHub. The React sample shows design patterns for the exchange and management of authentication tokens. It also shows the capture of audio from a microphone or file for speech-to-text conversions.

Note

If you want to use a specific audio input device, you need to specify the device ID in AudioConfig. Learn how to get the device ID for your audio input device.

Recognize speech from a file

To recognize speech from an audio file, create an AudioConfig instance by using fromWavFileInput(), which accepts a Buffer object. Then initialize SpeechRecognizer by passing audioConfig and speechConfig.

const fs = require('fs');
const sdk = require("microsoft-cognitiveservices-speech-sdk");
const speechConfig = sdk.SpeechConfig.fromSubscription("<paste-your-speech-key-here>", "<paste-your-speech-location/region-here>");

function fromFile() {
    let audioConfig = sdk.AudioConfig.fromWavFileInput(fs.readFileSync("YourAudioFile.wav"));
    let recognizer = new sdk.SpeechRecognizer(speechConfig, audioConfig);

    recognizer.recognizeOnceAsync(result => {
        console.log(`RECOGNIZED: Text=${result.text}`);
        recognizer.close();
    });
}
fromFile();

Recognize speech from an in-memory stream

For many use cases, your audio data will likely come from blob storage. Or it will already be in memory as ArrayBuffer or similar raw data structure. The following code:

  • Creates a push stream by using createPushStream().
  • Reads a .wav file by using fs.createReadStream for demonstration purposes. If you already have audio data in ArrayBuffer, you can skip directly to writing the content to the input stream.
  • Creates an audio configuration by using the push stream.
const fs = require('fs');
const sdk = require("microsoft-cognitiveservices-speech-sdk");
const speechConfig = sdk.SpeechConfig.fromSubscription("<paste-your-speech-key-here>", "<paste-your-speech-location/region-here>");

function fromStream() {
    let pushStream = sdk.AudioInputStream.createPushStream();

    fs.createReadStream("YourAudioFile.wav").on('data', function(arrayBuffer) {
        pushStream.write(arrayBuffer.slice());
    }).on('end', function() {
        pushStream.close();
    });
 
    let audioConfig = sdk.AudioConfig.fromStreamInput(pushStream);
    let recognizer = new sdk.SpeechRecognizer(speechConfig, audioConfig);
    recognizer.recognizeOnceAsync(result => {
        console.log(`RECOGNIZED: Text=${result.text}`);
        recognizer.close();
    });
}
fromStream();

Using a push stream as input assumes that the audio data is a raw PCM that skips any headers. The API will still work in certain cases if the header has not been skipped. But for the best results, consider implementing logic to read off the headers so that fs starts at the start of the audio data.

Handle errors

The previous examples simply get the recognized text from result.text. To handle errors and other responses, you need to write some code to handle the result. The following code evaluates the result.reason property and:

  • Prints the recognition result: ResultReason.RecognizedSpeech.
  • If there is no recognition match, informs the user: ResultReason.NoMatch.
  • If an error is encountered, prints the error message: ResultReason.Canceled.
switch (result.reason) {
    case sdk.ResultReason.RecognizedSpeech:
        console.log(`RECOGNIZED: Text=${result.text}`);
        break;
    case sdk.ResultReason.NoMatch:
        console.log("NOMATCH: Speech could not be recognized.");
        break;
    case sdk.ResultReason.Canceled:
        const cancellation = sdk.CancellationDetails.fromResult(result);
        console.log(`CANCELED: Reason=${cancellation.reason}`);

        if (cancellation.reason == sdk.CancellationReason.Error) {
            console.log(`CANCELED: ErrorCode=${cancellation.ErrorCode}`);
            console.log(`CANCELED: ErrorDetails=${cancellation.errorDetails}`);
            console.log("CANCELED: Did you set the speech resource key and region values?");
        }
        break;
    }

Use continuous recognition

The previous examples use single-shot recognition, which recognizes a single utterance. The end of a single utterance is determined by listening for silence at the end or until a maximum of 15 seconds of audio is processed.

In contrast, you can use continuous recognition when you want to control when to stop recognizing. It requires you to subscribe to the Recognizing, Recognized, and Canceled events to get the recognition results. To stop recognition, you must call stopContinuousRecognitionAsync. Here's an example of how continuous recognition is performed on an audio input file.

Start by defining the input and initializing SpeechRecognizer:

const recognizer = new sdk.SpeechRecognizer(speechConfig, audioConfig);

Next, subscribe to the events sent from SpeechRecognizer:

  • recognizing: Signal for events that contain intermediate recognition results.
  • recognized: Signal for events that contain final recognition results, which indicate a successful recognition attempt.
  • sessionStopped: Signal for events that indicate the end of a recognition session (operation).
  • canceled: Signal for events that contain canceled recognition results. These results indicate a recognition attempt that was canceled as a result or a direct cancellation request. Alternatively, they indicate a transport or protocol failure.
recognizer.recognizing = (s, e) => {
    console.log(`RECOGNIZING: Text=${e.result.text}`);
};

recognizer.recognized = (s, e) => {
    if (e.result.reason == sdk.ResultReason.RecognizedSpeech) {
        console.log(`RECOGNIZED: Text=${e.result.text}`);
    }
    else if (e.result.reason == sdk.ResultReason.NoMatch) {
        console.log("NOMATCH: Speech could not be recognized.");
    }
};

recognizer.canceled = (s, e) => {
    console.log(`CANCELED: Reason=${e.reason}`);

    if (e.reason == sdk.CancellationReason.Error) {
        console.log(`"CANCELED: ErrorCode=${e.errorCode}`);
        console.log(`"CANCELED: ErrorDetails=${e.errorDetails}`);
        console.log("CANCELED: Did you set the speech resource key and region values?");
    }

    recognizer.stopContinuousRecognitionAsync();
};

recognizer.sessionStopped = (s, e) => {
    console.log("\n    Session stopped event.");
    recognizer.stopContinuousRecognitionAsync();
};

With everything set up, call startContinuousRecognitionAsync to start recognizing:

recognizer.startContinuousRecognitionAsync();

// Make the following call at some point to stop recognition:
// recognizer.stopContinuousRecognitionAsync();

Dictation mode

When you're using continuous recognition, you can enable dictation processing by using the corresponding function. This mode will cause the speech configuration instance to interpret word descriptions of sentence structures such as punctuation. For example, the utterance "Do you live in town question mark" would be interpreted as the text "Do you live in town?".

To enable dictation mode, use the enableDictation method on SpeechConfig:

speechConfig.enableDictation();

Change the source language

A common task for speech recognition is specifying the input (or source) language. The following example shows how you would change the input language to Italian. In your code, find your SpeechConfig instance and add this line directly below it:

speechConfig.speechRecognitionLanguage = "it-IT";

The speechRecognitionLanguage property expects a language-locale format string. You can provide any value in the Locale column in the list of supported locales/languages.

Use a custom endpoint

With Custom Speech, you can upload your own data, test and train a custom model, compare accuracy between models, and deploy a model to a custom endpoint. The following example shows how to set a custom endpoint.

var speechConfig = SpeechSDK.SpeechConfig.fromSubscription("YourSubscriptionKey", "YourServiceRegion");
speechConfig.endpointId = "YourEndpointId";
var speechRecognizer = new SpeechSDK.SpeechRecognizer(speechConfig);

Reference documentation | Package (Download) | Additional Samples on GitHub

In this how-to guide, you learn how to recognize and transcribe human speech (often called speech-to-text).

Install Speech SDK and samples

The Azure-Samples/cognitive-services-speech-sdk repository contains samples written in Objective-C for iOS and Mac. Select a link to see installation instructions for each sample:

For more information, see the Speech SDK for Objective-C reference.

Use a custom endpoint

With Custom Speech, you can upload your own data, test and train a custom model, compare accuracy between models, and deploy a model to a custom endpoint. The following example shows how to set a custom endpoint.

SPXSpeechConfiguration *speechConfig = [[SPXSpeechConfiguration alloc] initWithSubscription:"YourSubscriptionKey" region:"YourServiceRegion"];
speechConfig.endpointId = "YourEndpointId";
SPXSpeechRecognizer* speechRecognizer = [[SPXSpeechRecognizer alloc] init:speechConfig];

Reference documentation | Package (Download) | Additional Samples on GitHub

In this how-to guide, you learn how to recognize and transcribe human speech (often called speech-to-text).

Install Speech SDK and samples

The Azure-Samples/cognitive-services-speech-sdk repository contains samples written in Swift for iOS and Mac. Select a link to see installation instructions for each sample:

For more information, see the Speech SDK for Swift reference.

Use a custom endpoint

With Custom Speech, you can upload your own data, test and train a custom model, compare accuracy between models, and deploy a model to a custom endpoint. The following example shows how to set a custom endpoint.

let speechConfig = SPXSpeechConfiguration(subscription: "YourSubscriptionKey", region: "YourServiceRegion");
speechConfig.endpointId = "YourEndpointId";
let speechRecognizer = SPXSpeechRecognizer(speechConfiguration: speechConfig);

Reference documentation | Package (PyPi) | Additional Samples on GitHub

In this how-to guide, you learn how to recognize and transcribe human speech (often called speech-to-text).

Create a speech configuration

To call the Speech service by using the Speech SDK, you need to create a SpeechConfig instance. This class includes information about your subscription, like your speech key and associated location/region, endpoint, host, or authorization token.

Create a SpeechConfig instance by using your speech key and location/region. Create a Speech resource on the Azure portal. For more information, see Create a new Azure Cognitive Services resource.

speech_config = speechsdk.SpeechConfig(subscription="<paste-your-speech-key-here>", region="<paste-your-speech-location/region-here>")

You can initialize SpeechConfig in a few other ways:

  • With an endpoint: pass in a Speech service endpoint. A speech key or authorization token is optional.
  • With a host: pass in a host address. A speech key or authorization token is optional.
  • With an authorization token: pass in an authorization token and the associated region.

Note

Regardless of whether you're performing speech recognition, speech synthesis, translation, or intent recognition, you'll always create a configuration.

Recognize speech from a microphone

To recognize speech by using your device microphone, create a SpeechRecognizer instance without passing AudioConfig, and then pass speech_config:

import azure.cognitiveservices.speech as speechsdk

def from_mic():
    speech_config = speechsdk.SpeechConfig(subscription="<paste-your-speech-key-here>", region="<paste-your-speech-location/region-here>")
    speech_recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config)

    print("Speak into your microphone.")
    result = speech_recognizer.recognize_once_async().get()
    print(result.text)

from_mic()

If you want to use a specific audio input device, you need to specify the device ID in AudioConfig, and the pass it to the SpeechRecognizer constructor's audio_config parameter. Learn how to get the device ID for your audio input device.

Recognize speech from a file

If you want to recognize speech from an audio file instead of using a microphone, create an AudioConfig instance and use the filename parameter:

import azure.cognitiveservices.speech as speechsdk

def from_file():
    speech_config = speechsdk.SpeechConfig(subscription="<paste-your-speech-key-here>", region="<paste-your-speech-location/region-here>")
    audio_input = speechsdk.AudioConfig(filename="your_file_name.wav")
    speech_recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config, audio_config=audio_input)

    result = speech_recognizer.recognize_once_async().get()
    print(result.text)

from_file()

Handle errors

The previous examples simply get the recognized text from result.text. To handle errors and other responses, you need to write some code to handle the result. The following code evaluates the result.reason property and:

  • Prints the recognition result: speechsdk.ResultReason.RecognizedSpeech.
  • If there is no recognition match, informs the user: speechsdk.ResultReason.NoMatch.
  • If an error is encountered, prints the error message: speechsdk.ResultReason.Canceled.
if result.reason == speechsdk.ResultReason.RecognizedSpeech:
    print("Recognized: {}".format(result.text))
elif result.reason == speechsdk.ResultReason.NoMatch:
    print("No speech could be recognized: {}".format(result.no_match_details))
elif result.reason == speechsdk.ResultReason.Canceled:
    cancellation_details = result.cancellation_details
    print("Speech Recognition canceled: {}".format(cancellation_details.reason))
    if cancellation_details.reason == speechsdk.CancellationReason.Error:
        print("Error details: {}".format(cancellation_details.error_details))
        print("Did you set the speech resource key and region values?")

Use continuous recognition

The previous examples use single-shot recognition, which recognizes a single utterance. The end of a single utterance is determined by listening for silence at the end or until a maximum of 15 seconds of audio is processed.

In contrast, you use continuous recognition when you want to control when to stop recognizing. It requires you to connect to EventSignal to get the recognition results. To stop recognition, you must call stop_continuous_recognition() or stop_continuous_recognition(). Here's an example of how continuous recognition is performed on an audio input file.

Start by defining the input and initializing SpeechRecognizer:

audio_config = speechsdk.audio.AudioConfig(filename=weatherfilename)
speech_recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config, audio_config=audio_config)

Next, create a variable to manage the state of speech recognition. Set the variable to False because at the start of recognition, you can safely assume that it's not finished.

done = False

Now, create a callback to stop continuous recognition when evt is received. Keep these points in mind:

  • When evt is received, the evt message is printed.
  • After evt is received, stop_continuous_recognition() is called to stop recognition.
  • The recognition state is changed to True.
def stop_cb(evt):
    print('CLOSING on {}'.format(evt))
    speech_recognizer.stop_continuous_recognition()
    nonlocal done
    done = True

The following code sample shows how to connect callbacks to events sent from SpeechRecognizer. The events are:

  • recognizing: Signal for events that contain intermediate recognition results.
  • recognized: Signal for events that contain final recognition results, which indicate a successful recognition attempt.
  • session_started: Signal for events that indicate the start of a recognition session (operation).
  • session_stopped: Signal for events that indicate the end of a recognition session (operation).
  • canceled: Signal for events that contain canceled recognition results. These results indicate a recognition attempt that was canceled as a result or a direct cancellation request. Alternatively, they indicate a transport or protocol failure.
speech_recognizer.recognizing.connect(lambda evt: print('RECOGNIZING: {}'.format(evt)))
speech_recognizer.recognized.connect(lambda evt: print('RECOGNIZED: {}'.format(evt)))
speech_recognizer.session_started.connect(lambda evt: print('SESSION STARTED: {}'.format(evt)))
speech_recognizer.session_stopped.connect(lambda evt: print('SESSION STOPPED {}'.format(evt)))
speech_recognizer.canceled.connect(lambda evt: print('CANCELED {}'.format(evt)))

speech_recognizer.session_stopped.connect(stop_cb)
speech_recognizer.canceled.connect(stop_cb)

With everything set up, you can call start_continuous_recognition():

speech_recognizer.start_continuous_recognition()
while not done:
    time.sleep(.5)

Dictation mode

When you're using continuous recognition, you can enable dictation processing by using the corresponding function. This mode will cause the speech configuration instance to interpret word descriptions of sentence structures such as punctuation. For example, the utterance "Do you live in town question mark" would be interpreted as the text "Do you live in town?".

To enable dictation mode, use the enable_dictation() method on SpeechConfig:

SpeechConfig.enable_dictation()

Change the source language

A common task for speech recognition is specifying the input (or source) language. The following example shows how you would change the input language to German. In your code, find your SpeechConfig instance and add this line directly below it:

speech_config.speech_recognition_language="de-DE"

speech_recognition_language is a parameter that takes a string as an argument. You can provide any value in the list of supported locales/languages.

Use a custom endpoint

With Custom Speech, you can upload your own data, test and train a custom model, compare accuracy between models, and deploy a model to a custom endpoint. The following example shows how to set a custom endpoint.

speech_config = speechsdk.SpeechConfig(subscription="YourSubscriptionKey", region="YourServiceRegion")
speech_config.endpoint_id = "YourEndpointId"
speech_recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config)

Speech-to-text REST API v3.0 reference | Speech-to-text REST API for short audio reference | Additional Samples on GitHub

In this how-to guide, you learn how to recognize and transcribe human speech (often called speech-to-text).

Convert speech to text

At a command prompt, run the following command. Insert the following values into the command:

  • Your subscription key for the Speech service.
  • Your Speech service region.
  • The path for input audio files. You can generate audio files by using text-to-speech.
curl --location --request POST 'https://INSERT_REGION_HERE.stt.speech.microsoft.com/speech/recognition/conversation/cognitiveservices/v1?language=en-US' \
--header 'Ocp-Apim-Subscription-Key: INSERT_SUBSCRIPTION_KEY_HERE' \
--header 'Content-Type: audio/wav' \
--data-binary @'INSERT_AUDIO_FILE_PATH_HERE'

You should receive a response with a JSON body like the following one:

{
    "RecognitionStatus": "Success",
    "DisplayText": "My voice is my passport, verify me.",
    "Offset": 6600000,
    "Duration": 32100000
}

For more information, see the speech-to-text REST API reference.

In this how-to guide, you learn how to recognize and transcribe human speech (often called speech-to-text).

Speech-to-text from a microphone

Plug in and turn on your PC microphone. Turn off any apps that might also use the microphone. Some computers have a built-in microphone, whereas others require configuration of a Bluetooth device.

Now you're ready to run the Speech CLI to recognize speech from your microphone. From the command line, change to the directory that contains the Speech CLI binary file. Then run the following command:

spx recognize --microphone

Note

The Speech CLI defaults to English. You can choose a different language from the speech-to-text table. For example, add --source de-DE to recognize German speech.

Speak into the microphone, and you see transcription of your words into text in real time. The Speech CLI stops after a period of silence, or when you select Ctrl+C.

Speech-to-text from an audio file

The Speech CLI can recognize speech in many file formats and natural languages. In this example, you can use any WAV file (16 KHz or 8 KHz, 16-bit, and mono PCM) that contains English speech. Or if you want a quick sample, download the whatstheweatherlike.wav file and copy it to the same directory as the Speech CLI binary file.

Use the following command to run the Speech CLI to recognize speech found in the audio file:

spx recognize --file whatstheweatherlike.wav

Note

The Speech CLI defaults to English. You can choose a different language from the speech-to-text table. For example, add --source de-DE to recognize German speech.

The Speech CLI shows a text transcription of the speech on the screen.

Next steps