What is speech to text?

Article
01/22/2024

In this overview, you learn about the benefits and capabilities of the speech to text feature of the Speech service, which is part of Azure AI services. Speech to text can be used for real-time or batch transcription of audio streams into text.

Note

To compare pricing of real-time to batch transcription, see Speech service pricing.

For a full list of available speech to text languages, see Language and voice support.

Real-time speech to text

With real-time speech to text, the audio is transcribed as speech is recognized from a microphone or file. Use real-time speech to text for applications that need to transcribe audio in real-time such as:

Transcriptions, captions, or subtitles for live meetings
Diarization
Pronunciation assessment
Contact center agents assist
Dictation
Voice agents

Real-time speech to text is available via the Speech SDK and the Speech CLI.

Batch transcription

Batch transcription is used to transcribe a large amount of audio in storage. You can point to audio files with a shared access signature (SAS) URI and asynchronously receive transcription results. Use batch transcription for applications that need to transcribe audio in bulk such as:

Transcriptions, captions, or subtitles for prerecorded audio
Contact center post-call analytics
Diarization

Batch transcription is available via:

Speech to text REST API: To get started, see How to use batch transcription and Batch transcription samples (REST).
The Speech CLI supports both real-time and batch transcription. For Speech CLI help with batch transcriptions, run the following command:
```
spx help batch transcription
```

Custom speech

With custom speech, you can evaluate and improve the accuracy of speech recognition for your applications and products. A custom speech model can be used for real-time speech to text, speech translation, and batch transcription.

Tip

A hosted deployment endpoint isn't required to use custom speech with the Batch transcription API. You can conserve resources if the custom speech model is only used for batch transcription. For more information, see Speech service pricing.

Out of the box, speech recognition utilizes a Universal Language Model as a base model that is trained with Microsoft-owned data and reflects commonly used spoken language. The base model is pretrained with dialects and phonetics representing various common domains. When you make a speech recognition request, the most recent base model for each supported language is used by default. The base model works well in most speech recognition scenarios.

A custom model can be used to augment the base model to improve recognition of domain-specific vocabulary specific to the application by providing text data to train the model. It can also be used to improve recognition based for the specific audio conditions of the application by providing audio data with reference transcriptions. For more information, see custom speech and Speech to text REST API.

Customization options vary by language or locale. To verify support, see Language and voice support for the Speech service.

Responsible AI

An AI system includes not only the technology, but also the people who use it, the people who are affected by it, and the environment in which it's deployed. Read the transparency notes to learn about responsible AI use and deployment in your systems.

What is speech to text?

Real-time speech to text

Batch transcription

Custom speech

Responsible AI

Next steps

Feedback

Additional resources