Use cases for Speech-to-Text

What is a Transparency note?

An AI system includes not only the technology, but also the people who will use it, the people who will be affected by it, and the environment in which it is deployed. Creating a system that is fit for its intended purpose requires an understanding of how the technology works, its capabilities and limitations, and how to achieve the best performance. Microsoft's Transparency Notes are intended to help you understand how our AI technology works, the choices system owners can make that influence system performance and behavior, and the importance of thinking about the whole system, including the technology, the people, and the environment. You can use Transparency Notes when developing or deploying your own system, or share them with the people who will use or be affected by your system.

Microsoft's Transparency notes are part of a broader effort at Microsoft to put our AI principles into practice. To find out more, see Microsoft's AI principles.

Introduction to Speech-to-Text

Speech-to-Text, also known as automatic speech recognition (ASR), is a feature under Speech Services (See What is Speech-to-Text?), that converts spoken audio into text. Speech-to-Text support over 140 locales for inputs. Please see the latest list of supported locales in Language and voice support for the Speech service.

Terms and definitions

Term Definition
Audio input The streamed audio data or audio file that is used as an input for the speech-to-text feature. Audio input may contain not only voice, but also silence and non-speech noise. Speech-to-Text generates text for the voice parts of audio input.
Utterance A component of audio input that contains human voice. One utterance may consist of a single word or multiple words, such as a phrase.
Transcription The text output of the Speech-to-Text feature. This automatically generated text output leverages speech models (defined below) and is sometimes referred to as machine transcription or automated speech recognition (ASR). Transcription in this context is fully automated and therefore different from human transcription, which is text generated by human transcribers.
Speech model An automatically generated, machine-learned numerical representation of an utterance used to infer a transcription from an audio input. Speech models are trained on voice data that includes various speech styles, languages, accents, dialects and intonations, as well as acoustic variations generated by different types of recording devices. A speech model numerically represents both acoustic and linguistic features, which are used to predict what text should be associated with the utterance.
Real-time API An API that accepts requests with audio input, and returns a response in real time with transcription within the same network connection.
Language Detection API A type of real-time API that detects what language is spoken in a given audio input. A language is inferred based on voice sound in the audio input.
Speech Translation API Another type of real-time API that generates transcriptions of a given audio input then translates them into a language specified by the user. This is a cascaded service of Speech Services and Text Translator.
Batch API A service to send audio input to be transcribed at a later time. Customers specify location(s) of audio file(s) with other parameters, such as a language name. The service loads the audio input asynchronously and transcribes it. Once transcription is complete, text file(s) are loaded back to a location specified by the customer.
Speaker Separation (diarization) Speaker Separation (also called diarization) answers the question of who spoke when. It differentiates speakers in an audio input based on their voice characteristics. The Batch API supports diarization and is capable of differentiating two speakers' voices on mono channel recordings. Diarization is combined with speech to text functionality in order to provide transcription outputs that contain a speaker entry for each transcribed phrase. This diarization feature only supports the differentiation of two voices, and the transcription output is tagged as Speaker 1 or 2. If more than two speakers are within the audio, this will result in inaccurate tagging.

Speech-to-Text features

There are two ways to use the Speech-to-Text service - Real-time and Batch.

Real-time Speech-to-Text API

This is a common API call via Speech SDK/REST API to send an audio input and receive a text transcription in real time. The speech system uses a “speech model” to recognize what is spoken in a given input audio. A speech model consists of an acoustic model and a language model, which combine to calculate likely spoken content. The acoustic model maps a small linguistic unit called a “phoneme” to voice sound in an audio input, which leads to a probability of specific word or phrase. The language model calculates probabilities of word combinations in a specified language. The Speech-to-Text technology then infers a final phrase or text based on combined probabilities. This feature is often used, for example, for voice-enabled queries or dictation within a customer's service or application.

Sub-features and options of the Real-time Speech-to-Text API

  • Language Detection: Unlike in a default API call, where a language (or a locale) for an audio input must be specified in advance, with language detection, customers can specify multiple locales and let the service detect a language.

  • Speech Translation: This API converts audio input to text, and then translates and transcribes it into another language. The translated transcription output can be returned in text format or the customer may choose to have the text synthesized into audible speech by Text-to-Speech (TTS). See What is the Translator service for more information about the text translation service.

Batch transcription API

This is another type of API call, typically used to send lengthy audio inputs and to receive transcribed text asynchronously (i.e., at a later time). To use this API, users can specify locations of multiple audio files. The Speech-to-Text technology reads the audio input from the file streams and generates transcription text files which are returned to the customer's specified storage location. This feature is used to support larger transcription jobs (e.g., an audio broadcast for later text transcription) where it is not necessary to provide end users the transcription content in real time.

Sub-features and options of the Batch transcription API
  • Speaker Separation (diarization): This feature is disabled by default. If a customer chooses to enable this feature, the service will differentiate between up to two speakers' utterances. The resulting transcription text contains a "speaker property" that indicates either Speaker 1 or Speaker 2, which denotes which of two speakers is speaking in an audio file. Speaker separation only works with two speakers. If audio input sent via the Batch API contains more than two speakers, the service will still separate voice into two speakers, leading to errors in speaker tags tied to transcription output.

Example use cases

Speech-to-Text can offer more convenient or alternative ways for end users to interact with applications and devices. Instead of typing words on a keyboard or using their hands for touch screen interactions, Speech-to-Text technology allows users to operate applications and devices by voice and through dictation.

  • Smart Assistants: Companies developing smart assistants on appliances, cars, and homes can use Speech-to-Text to enable natural interface search queries or command-and-control features by voice.

  • Chat Bots -- Companies can build chat bot applications, in which users can use voice enabled queries or commands to interact with bots.

  • Voice Typing -- Apps may allow users to use voice to dictate long-form text. Voice typing can be used to enter text for texting, emails, and documents.

  • Voice Commanding -- Users can trigger certain actions by voice. This is called "command and control", and typical examples are entering query text by voice and selecting a menu item by voice.

  • Voice translation -- Customers can use speech translation features of Speech-to-Text technology to communicate by voice with other users who speak different languages. Speech translation enables voice-to-voice communication across multiple languages. See the latest list of supported locales in Language and voice support for the Speech service.

  • Call Center transcriptions -- Customers often record conversations with their end users in scenarios such as customer support calls. Audio recordings can be sent to the Batch API for transcription.

  • Mixed-language dictation -- Customers can use Speech-to-Text technology to dictate in multiple languages. Using Language Detection, a dictation application may automatically detect spoken languages and transcribe appropriately without forcing users to specify which language they speak.

Considerations when choosing other use cases

Speech-to-Text API can offer users convenient options for enabling voice enabled applications, but it is very important to consider the context within which you will integrate the API and ensure that you comply with all laws and regulations that apply to your application. This includes understanding your obligations under privacy and communication laws, including national and regional eavesdropping and wiretap laws, that apply to your jurisdiction. Only collect and process audio that is within the reasonable expectations of your users; this includes ensuring you have all necessary and appropriate permissions from users for your collection, processing and storage of audio data.

Many applications are designed and intended to be used by a specific individual user for voice enabled queries, commands or dictation, however, the microphone for your application may pick up sound or voice from a non-primary users. To avoid capturing the voices of non-primary users unintentionally, you should take the following into consideration.

Microphone considerations: Often you cannot control who may speak around the input device that sends audio input to the Speech-to-Text cloud service. You should therefore encourage your end-users to take extra care when using voice enabled features and applications in a public/open environment where other people's voices may be easily captured and/or for allowing non-primary users to use the voice-enabled application.

Only use Speech-to-Text in experiences and features that are within the reasonable expectations of your users: Audio data of a person speaking is personal information. Speech-to-Text is not intended to be used for covert audio surveillance purposes, in a manner that violates legal requirements, or in applications and devices in public spaces or locations where users may have a reasonable expectation of privacy. Use speech services only to collect and process audio in ways that are within the reasonable expectations of your users; this includes ensuring you have all necessary and appropriate permissions from users for your collection, processing, and storage of audio data.

Next steps