What is speech-to-text?
In this overview, you learn about the benefits and capabilities of the speech-to-text service. Speech-to-text, also known as speech recognition, enables real-time transcription of audio streams into text. Your applications, tools, or devices can consume, display, and take action on this text as command input. This service is powered by the same recognition technology that Microsoft uses for Cortana and Office products. It seamlessly works with the translation and text-to-speech service offerings. For a full list of available speech-to-text languages, see supported languages.
The speech-to-text service defaults to using the Universal language model. This model was trained using Microsoft-owned data and is deployed in the cloud. It's optimal for conversational and dictation scenarios. When using speech-to-text for recognition and transcription in a unique environment, you can create and train custom acoustic, language, and pronunciation models. Customization is helpful for addressing ambient noise or industry-specific vocabulary.
With additional reference text as input, speech-to-text service also enables pronunciation assessment capability to evaluate speech pronunciation and gives speakers feedback on the accuracy and fluency of spoken audio. With pronunciation assessment, language learners can practice, get instant feedback, and improve their pronunciation so that they can speak and present with confidence. Educators can use the capability to evaluate pronunciation of multiple speakers in real-time. The feature currently supports US English, and correlates highly with speech assessments conducted by experts.
Bing Speech was decommissioned on October 15, 2019. If your applications, tools, or products are using the Bing Speech APIs, we've created guides to help you migrate to the Speech service.
TLS 1.2 is now enforced for all HTTP requests to this service. For more information, see Azure Cognitive Services security.
Sample code for the Speech SDK is available on GitHub. These samples cover common scenarios like reading audio from a file or stream, continuous and single-shot recognition, and working with custom models.
- Speech-to-text samples (SDK)
- Batch transcription samples (REST)
- Pronunciation assessment samples (REST)
In addition to the standard Speech service model, you can create custom models. Customization helps to overcome speech recognition barriers such as speaking style, vocabulary and background noise, see Custom Speech. Customization options vary by language/locale, see supported languages to verify support.
Batch transcription is a set of REST API operations that enable you to transcribe a large amount of audio in storage. You can point to audio files with a shared access signature (SAS) URI and asynchronously receive transcription results. See the how-to for more information on how to use the batch transcription API.
The Speech service provides two SDKs. The first SDK is the primary Speech SDK and provides most of the functionalities needed to interact with the Speech service. The second SDK is specific to devices, appropriately named the Speech Devices SDK. Both SDKs are available in many languages.
Speech SDK reference docs
Use the following list to find the appropriate Speech SDK reference docs:
The Speech service SDK is actively maintained and updated. To track changes, updates and feature additions refer to the Speech SDK release notes.
Speech Devices SDK reference docs
REST API references
For references of various Speech service REST APIs, refer to the listing below:
- REST API: Speech-to-text
- REST API: Pronunciation assessment
- REST API: Text-to-speech
- REST API: Batch transcription and customization