What is speech-to-text?
Speech-to-text from Azure Speech Services, also known as speech-to-text, enables real-time transcription of audio streams into text that your applications, tools, or devices can consume, display, and take action on as command input. This service is powered by the same recognition technology that Microsoft uses for Cortana and Office products, and works seamlessly with the translation and text-to-speech. For a full list of available speech-to-text languages, see supported languages.
By default, the speech-to-text service uses the Universal language model. This model was trained using Microsoft-owned data and is deployed in the cloud. It's optimal for conversational and dictation scenarios. If you are using speech-to-text for recognition and transcription in a unique environment, you can create and train custom acoustic, language, and pronunciation models to address ambient noise or industry-specific vocabulary.
You can easily capture audio from a microphone, read from a stream, or access audio files from storage with the Speech SDK and REST APIs. The Speech SDK supports WAV/PCM 16-bit, 16 kHz/8 kHz, single-channel audio for speech recognition. Additional audio formats are supported using the speech-to-text REST endpoint or the batch transcription service.
Here are the features available via the Speech SDK and REST APIs:
|Transcribe short utterances (<15 seconds). Only supports final transcription result.||Yes||Yes|
|Continuous transcription of long utterances and streaming audio (>15 seconds). Supports interim and final transcription results.||Yes||No|
|Derive intents from recognition results with LUIS.||Yes||No*|
|Batch transcription of audio files asynchronously.||No||Yes**|
|Create and manage speech models.||No||Yes**|
|Create and manage custom model deployments.||No||Yes**|
|Create accuracy tests to measure the accuracy of the baseline model versus custom models.||No||Yes**|
* LUIS intents and entities can be derived using a separate LUIS subscription. With this subscription, the SDK can call LUIS for you and provide entity and intent results. With the REST API, you can call LUIS yourself to derive intents and entities with your LUIS subscription.
** These services are available using the cris.ai endpoint. See Swagger reference.
Get started with speech-to-text
We offer quickstarts in most popular programming languages, each designed to have you running code in less than 10 minutes. This table includes a complete list of Speech SDK quickstarts organized by language.
|C#, .NET Core||Windows||Browse|
|C#, .NET Framework||Windows||Browse|
|Java||Windows, Linux, macOS||Browse|
|Python||Windows, Linux, macOS||Browse|
If you prefer to use the speech-to-text REST service, see REST APIs.
Tutorials and sample code
After you've had a chance to use the Speech Services, try our tutorial that teaches you how to recognize intents from speech using the Speech SDK and LUIS.
Sample code for the Speech SDK is available on GitHub. These samples cover common scenarios like reading audio from a file or stream, continuous and single-shot recognition, and working with custom models.
In addition to the Universal model used by the Speech Services, you can create custom acoustic, language, and pronunciation models specific to your experience. Here's a list of customization options:
|Acoustic model||Creating a custom acoustic model is helpful if your application, tools, or devices are used in a particular environment, like in a car or factory with specific recording conditions. Examples involve accented speech, specific background noises, or using a specific microphone for recording.|
|Language model||Create a custom language model to improve transcription of industry-specific vocabulary and grammar, such as medical terminology, or IT jargon.|
|Pronunciation model||With a custom pronunciation model, you can define the phonetic form and display of a word or term. It's useful for handling customized terms, such as product names or acronyms. All you need to get started is a pronunciation file -- a simple .txt file.|
Customization options vary by language/locale (see Supported languages).
Bing Speech will be decommissioned on October 15, 2019.
If your applications, tools, or products are using the Bing Speech APIs or Custom Speech, we've created guides to help you migrate to Speech Services.
- Speech SDK
- Speech Devices SDK
- REST API: Speech-to-text
- REST API: Text-to-speech
- REST API: Batch transcription and customization