What is speech-to-text?

Speech-to-text from the Speech service, also known as speech-to-text, enables real-time transcription of audio streams into text that your applications, tools, or devices can consume, display, and take action on as command input. This service is powered by the same recognition technology that Microsoft uses for Cortana and Office products, and works seamlessly with the translation and text-to-speech. For a full list of available speech-to-text languages, see supported languages.

By default, the speech-to-text service uses the Universal language model. This model was trained using Microsoft-owned data and is deployed in the cloud. It's optimal for conversational and dictation scenarios. If you are using speech-to-text for recognition and transcription in a unique environment, you can create and train custom acoustic, language, and pronunciation models to address ambient noise or industry-specific vocabulary.

You can easily capture audio from a microphone, read from a stream, or access audio files from storage with the Speech SDK and REST APIs. The Speech SDK supports WAV/PCM 16-bit, 16 kHz/8 kHz, single-channel audio for speech recognition. Additional audio formats are supported using the speech-to-text REST endpoint or the batch transcription service.

Core features

Here is the features available via the Speech SDK and REST APIs:

Use case SDK REST
Transcribe short utterances (<15 seconds). Only supports one final transcription result. Yes Yes*
Continuous transcription of long utterances and streaming audio (>15 seconds). Supports interim and final transcription results. Yes No
Derive intents from recognition results with LUIS. Yes No**
Batch transcription of audio files asynchronously. No Yes***
Create and manage speech models. No Yes***
Create and manage custom model deployments. No Yes***
Create accuracy tests to measure the accuracy of the baseline model versus custom models. No Yes***
Manage subscriptions. No Yes***

*Using the REST functionality you can transfer up to 60 seconds of audio and will receive one final transcription result.

**LUIS intents and entities can be derived using a separate LUIS subscription. With this subscription, the SDK calls LUIS for you and provide entity and intent results. With the REST API, you call LUIS yourself to derive intents and entities with your LUIS subscription.

***These services are available using the cris.ai endpoint. See Swagger reference.

Get started with speech-to-text

We offer quickstarts in most popular programming languages, each designed to have you running code in less than 10 minutes. This table includes a complete list of Speech SDK quickstarts organized by platform and language. API reference can also be found here.

If you prefer to use the speech-to-text REST service, see REST APIs.

Tutorials and sample code

After you've had a chance to use the Speech service, try our tutorial that teaches you how to recognize intents from speech using the Speech SDK and LUIS.

Sample code for the Speech SDK is available on GitHub. These samples cover common scenarios like reading audio from a file or stream, continuous and single-shot recognition, and working with custom models.

Customization

In addition to the standard baseline model used by the Speech service, you can customize models to your needs with available data, to overcome speech recognition barriers such as speaking style, vocabulary and background noise, see Custom Speech

Note

Customization options vary by language/locale (see Supported languages).

Migration guides

Warning

Bing Speech was decommissioned on October 15, 2019.

If your applications, tools, or products are using the Bing Speech APIs or Custom Speech, we've created guides to help you migrate to the Speech service.

Reference docs

The Speech service provides two SDKs. The first SDK is the primary Speech SDK and provides most of the functionalities needed to interact with the Speech service. The second SDK is specific to devices, appropriately named the Speech Devices SDK. Both SDKs are available in many languages.

Speech SDK reference docs

Use the following list to find the appropriate Speech SDK reference docs:

Tip

The Speech service SDK is actively maintained and updated. To track changes, updates and feature additions refer to the Speech SDK release notes.

Speech Devices SDK reference docs

The Speech Devices SDK is a superset of the Speech SDK, with extended functionality for specific devices. To download the Speech Devices SDK, you must first choose a development kit.

REST API references

For references of various Speech service REST APIs, refer to the listing below:

Next steps