Use cases for Speech to text

Article
08/07/2023

What is a Transparency note?

An AI system includes not only the technology, but also the people who will use it, the people who will be affected by it, and the environment in which it is deployed. Creating a system that is fit for its intended purpose requires an understanding of how the technology works, what its capabilities and limitations are, and how to achieve the best performance. Microsoft's Transparency Notes are intended to help you understand how our AI technology works, the choices system owners can make that influence system performance and behavior, and the importance of thinking about the whole system, including the technology, the people, and the environment. You can use Transparency Notes when developing or deploying your own system, or share them with the people who will use or be affected by your system.

Microsoft's Transparency Notes are part of a broader effort at Microsoft to put our AI Principles into practice. To find out more, see the Microsoft AI principles.

The basics of speech to text

Speech to text, also known as automatic speech recognition (ASR), is a feature under the Azure AI Speech service, which is a part of Azure AI services. Speech to text converts spoken audio into text. Speech to text in Azure supports more than 140 locales for input. For the latest list of supported locales, see Language and voice support for the Speech service.

Key terms

Term	Definition
Audio input	The streamed audio data or audio file that's used as an input for the speech to text feature. Audio input can contain not only voice, but also silence and non-speech noise. Speech to text generates text for the voice parts of audio input.
Utterance	A component of audio input that contains human voice. One utterance can consist of a single word or multiple words, such as a phrase.
Transcription	The text output of the speech to text feature. This automatically generated text output leverages speech models and is sometimes called machine transcription or automated speech recognition (ASR). Transcription in this context is fully automated and therefore different from human transcription, which is text that is generated by human transcribers.
Speech model	An automatically generated, machine-learned numerical representation of an utterance that is used to infer a transcription from an audio input. Speech models are trained on voice data that includes various speech styles, languages, accents, dialects, and intonations, and on acoustic variations that are generated by using different types of recording devices. A speech model numerically represents both acoustic and linguistic features, which are used to predict what text should be associated with the utterance.
Real-time API	An API that accepts requests with audio input, and returns a response in real time with transcription within the same network connection.
Language Detection API	A type of real-time API that detects what language is spoken in an audio input. A language is inferred based on voice sound in the audio input.
Speech Translation API	Another type of real-time API that generates transcriptions of a given audio input then translates them into a language specified by the user. This is a cascaded service of Speech Services and Text Translator.
Batch API	A service that is used to send audio input to be transcribed at a later time. You specify the location of audio files and other parameters, such as the language of recognition. The service loads the audio input asynchronously and transcribes it. When transcription is complete, text files are loaded back to a location that you specify.
Diarization	Diarization answers the question of who spoke and when. It differentiates speakers in an audio input based on their voice characteristics. Both real-time and batch APIs support diarization and are capable of differentiating speakers' voices on monochannel recordings. Diarization is combined with speech to text functionality to provide transcription outputs that contain a speaker entry for each transcribed segment. The transcription output is tagged as GUEST1, GUEST2, GUEST3, etc. based on the number of speakers in the audio conversation.
Word error rate(WER)	Word error rate (WER) is the industry standard to measure speech to text accuracy. WER counts the number of incorrect words that are identified during recognition. Then it divides by the total number of words that are provided in the correct transcript (often created by human labeling).
Token error rate(TER)	Token error rate (TER) is a measure of the correctness of the final recognition of words, capitalization, punctuation, and so on, compared to tokens that are provided in the correct transcript (often created by human labeling).
Runtime latency	In speech to text, latency is the time between the speech audio input and the transcription result output.
Word diarization error rate (WDER)	Word diarization error rate (WDER) counts the number of errors on the words assigned to the wrong speaker compared to the ground truth. A lower WDER rate indicates better quality.

Capabilities

System behavior

Below we are listing the main ways to call our speech to text service.

Real-time Speech to text API

This is a common API call via the Speech SDK or REST API to send an audio input and receive a text transcription in real time. The speech system uses a speech model to recognize what is spoken in an input audio. During real-time speech to text, the system takes an audio stream as input and continuously determines the most likely sequence of words that produced the audio that's observed so far. The model is trained on a large amount of diverse audio across typical usage scenarios and a wide range of speakers. For example, this feature is often used for voice-enabled queries or dictation within an organization's service or application.

Batch transcription API

Batch transcription is another type of API call. It’s typically used to send prerecorded audio inputs and to receive transcribed text asynchronously (that is, at a later time). To use this API, you can specify locations for multiple audio files. The speech to text technology reads the audio input from the file and generates transcription text files that are returned to the storage location that you specify. This feature is used to support larger transcription jobs in which it is not necessary to provide end users with the transcription content in real time. An example is transcribing call center recordings to gain insights into customers and call center agent performance.

When you use batch transcription, you can choose to use the Whisper model instead of the default Azure AI speech to text model. To determine whether the Whisper model is appropriate for your use case, you can compare how the output between these models differs in the batch. Try it out in Speech Studio, and then perform deeper evaluations by using the test capabilities through custom speech. Note that the Whisper model is also available thorough Azure OpenAI.

Speech translation API

This API converts audio input to text, and then translates it into another language. The translated transcription output can be returned in text format, or you can choose to have the text synthesized into audible speech by using text to speech. For more information, see What is Azure AI Translator?

Sub-features and options

The APIs above can optionally use the following sub-features:

Model customization: Azure AI Speech enables developers to customize the speech to text models in order to improve the recognition accuracy for a specific scenario. There are two ways to customize speech to text:
- At runtime through the use of the phrase list feature
- Ahead of time through the use of custom speech
Language detection: Unlike in a default API call, in which a language or locale for an audio input must be specified in advance, with language detection, you can specify multiple locales and let the service detect which language should be used to recognize a specific part of the audio.
Diarization: This feature is disabled by default. If you choose to enable this feature, the service differentiates different speakers' utterances. The resulting transcription text contains a "speaker" property that indicates GUEST1, GUEST2, GUEST3, and so on, which denotes which speaker is speaking in an audio file.

Use cases

Speech to text can offer different ways for users to interact with applications and devices. Instead of typing words on a keyboard or using their hands for touchscreen interactions, speech to text technology allows users to operate applications and devices by voice and through dictation.

Smart assistants: Companies that are developing smart assistants on appliances, cars, and homes can use speech to text to enable natural interface search queries or to trigger certain features by voice. This is called _command-and-_control.
Chat bots: Companies can build chat bot applications, in which users can use voice-enabled queries or commands to interact with bots.
Voice typing: Apps can allow users to use voice to dictate long-form text. Voice typing can be used to enter text for messaging, emails, and documents.
Voice commanding: Users can trigger certain actions by voice (command-and-control). Two common examples are entering query text by voice and selecting a menu item by voice.
Voice translation: You can use the speech translation features of speech to text technology to communicate by voice with other users who speak different languages. Speech translation enables voice-to-voice communication across multiple languages. See the latest list of supported locales in Language and voice support for the Speech service.
Call center transcriptions: Companies often record conversations with their users in scenarios like customer support calls. Audio recordings can be sent to the batch API for transcription.
Mixed-language dictation: Users can use speech to text technology to dictate in multiple languages. Using language detection, a dictation application can automatically detect spoken languages and transcribe appropriately without requiring a user to specify which language they speak.
Live conversation transcription: When speakers are all in the same room using a single-microphone setup, do live transcription about which speaker (Guest1, Guest2, Guest3, and so on) makes each statement.
Conversation transcription of prerecorded audio: After the recording of audio with multiple speakers you can use our service to get the transcription about which speaker (Guest1, Guest2, Guest3, and so on) makes each statement.

Considerations when choosing other use cases

The speech to text API offers convenient options for developing voice-enabled applications, but it is very important to consider the context in which you will integrate the API. You must ensure that you comply with all laws and regulations that apply to your application. This includes understanding your obligations under privacy and communication laws, including national and regional privacy, eavesdropping, and wiretap laws that apply to your jurisdiction. Collect and process only audio that is within the reasonable expectations of your users. This includes ensuring that you have all necessary and appropriate consents from users for you to collect, process, and store their audio data.

Many applications are designed and intended to be used by a specific individual user for voice-enabled queries, commands, or dictation. However, the microphone for your application might pick up sound or voice from non-primary users. To avoid unintentionally capturing the voices of non-primary users, you should consider the following information:

Microphone considerations: Often, you cannot control who might speak near the input device that sends audio input to the speech to text cloud service. You should encourage your users to take extra care when they use voice-enabled features and applications in a public or open environment where other people's voices might be easily captured.
Use speech to text only in experiences and features that are within the reasonable expectations of your users: Audio data that contains a person speaking is personal information. Speech to text is not intended to be used for covert audio surveillance purposes, in a manner that violates legal requirements, or in applications and devices in public spaces or locations where users might have a reasonable expectation of privacy. Use the Speech service only to collect and process audio in ways that are within the reasonable expectations of your users. This includes ensuring that you have all necessary and appropriate consents from users to collect, process, and store their audio data.
Azure AI Speech service and integration of the Whisper model: The Whisper model enhances the Azure AI Speech service with advanced features like multilingual recognition and readability. The Speech service also enriches the performance of the Whisper model by enabling larger-scale batch transcriptions and speaker diarization. Whether to use the default Speech service speech to text model or Whisper model depends on the specific use case. We recommend that you take advantage of the batch try out and custom speech experiences in Speech Studio to evaluate both options to find the best fit for your business needs.
Conversation transcription on prerecorded events: The system will perform better if all speakers are in the same acoustic environment (for example, the conversation takes place in a room in which people speak into a common microphone).
Conversation transcription: Although there is no limitation on the numbers of speakers in the conversation, the system performs better when the number of speakers is under 30.

Unsupported uses

Conversation transcription with speaker recognition: The Speech service is not designed to provide diarization with speaker recognition, and it cannot be used to identify individuals. In other words, speakers will be presented as Guest1, Guest2, Guest3, and so on, in the transcription. These will be randomly assigned and may not be used to identify individual speakers in the conversation. For each conversation transcription, the assignment of Guest1, Guest2, Guest3, and so on, will be random.

To prevent any potential for misuse of Speech service for identification purposes, you are responsible for ensuring that you use the service, including the diarization feature, only for supported uses, and that you have a proper legal basis and any required consents in place for all uses of the service.

Limitations

Speech to text recognizes what's spoken in an audio input, and then generates transcription outputs. This requires proper setup for the expected languages used in the audio input and spoken styles. Non-optimal settings might lead to lower accuracy.

Technical limitations, operational factors, and ranges

Language of accuracy

The industry standard to measure speech to text accuracy is word error rate (WER). To understand the detailed WER calculation, see Test the accuracy of a custom speech model.

Transcription accuracy and system limitations

Speech to text uses a unified speech recognition machine learning model to transcribe what is spoken in a wide range of contexts and topic domains, including command-and-control, dictation, and conversations. You don't need to consider using different models for your application or feature scenarios.

However, you need to specify a language or locale for each audio input. The language or locale must match the actual language that's spoken in an input voice. For more information, see the list of supported locales.

Many factors can lead to a lower accuracy in transcription:

Acoustic quality: Speech to text–enabled applications and devices might use a wide variety of microphone types and specifications. Unified speech models have been created based on various voice audio device scenarios, such as telephones, mobile phones, and speaker devices. However, voice quality might be degraded by the way a user speaks into a microphone, even if they use a high-quality microphone. For example, if a speaker is located far from the microphone, the input quality would be too low. A speaker who is too close to the microphone could also cause audio quality deterioration. Both cases can adversely affect the accuracy of speech to text.
Non-speech noise: If an input audio contains a certain level of noise, accuracy is affected. Noise might come from the audio devices that are used to make a recording, or audio input itself might contain noise, such as background or environmental noise.
Overlapped speech: There might be multiple speakers within range of an audio input device, and they might speak at the same time. Also, other speakers might speak in the background while the main user is speaking.
Vocabularies: The speech to text model has been trained on a wide variety of words in many domains. However, users might speak organization-specific terms and jargon that aren't in a standard vocabulary. If a word that doesn't exist in a model appears in the audio, the result is an error in transcription.
Accents: Even within one locale, such as in English - United States (en-US), many people have different accents. Very specific accents might also lead to an error in transcription.
Mismatched locales: Users might not speak the languages that you expect. If you specified English - United States (en-US) for an audio input, but a speaker spoke in Swedish, for example, accuracy would be reduced.
Insertion errors: At times, the speech to text models can produce insertion errors in the presence of noise or soft background speech. This is limited when you use the Speech service, but it’s slightly more frequent when you use the Whisper model, as stated in the OpenAI model card.

Because of these acoustic and linguistic variations, you should expect a certain level of inaccuracy in the output text when you design an application.

System performance

System performance is measured by these key factors (from the user's point of view):

Word error rate (WER)
Token error rate (TER)
Runtime latency

A model is considered better only when it shows significant improvements (such as a 5% relative WER improvement) in all scenarios (like transcription of conversation speech, call center transcription, dictation, and voice assistant) while being in line with the resource usage and response latency goals.

For diarization, we measure the quality by using word diarization error rate (WDER). The lower the WDER, the better the quality of diarization.

Best practices for improving system performance

As described earlier, acoustic conditions like background noise, side speech, distance to microphone, and speaking styles and characteristics can adversely affect the accuracy of what is recognized.

For better speech experiences, consider the following application or service design principles:

Design UIs to match input locales: Mismatched locales reduce accuracy. The Speech SDK supports automatic language detection, but it detects only one out of four locales that are specified at runtime. You still need to know the locale that your users will speak in. Your UI should clearly indicate which languages the users can speak in via a dropdown that lists the languages that are supported. For more information, see the supported locales.
Allow users to try again: Misrecognition might occur due to a temporary issue, such as unclear or fast speech or a long pause. If your application expects specific transcriptions, such as predefined action commands like "Yes" and "No" and it did not get any of them, users should be able to try again. A typical method is to tell users, "Sorry, I didn't get that. Please try again."
Confirm before you take an action by voice: Just as with keyboard-based, click-based, or tap-based UIs, if an audio input can trigger an action, users should be given an opportunity to confirm the action, especially by displaying or playing back what was recognized or transcribed. A typical example is sending a text message by voice. An app repeats what was recognized and asks for confirmation: "You said, 'Thank you.' Send it or change it?"
Add custom vocabularies: The general speech recognition model that's provided by speech to text covers a broad vocabulary. However, scenario-specific jargon and named entities (for example, people names and product names) might be underrepresented. What words and phrases are likely to be spoken can vary significantly depending on the scenario. If you can anticipate which words and phrases will be spoken (for instance, when a user selects an item from a list), you might want to use the phrasal list grammar. For more information, see "Improving recognition accuracy" in Get started with speech to text.
Use custom speech: If speech to text accuracy in your application scenarios remains low, you might want to consider customizing the model for your acoustic and linguistic variations. You can create your own models by training them by using your own voice audio data or text data. For details, see custom speech.

Evaluation of speech to text

A speech to text model is evaluated through testing. The goal of testing is to confirm that the model performs well across each of the key scenarios and in prevalent audio conditions, and that we are achieving our fairness goals across demographic factors.

Evaluation methods

For model evaluation, test datasets are used. Both a regression test and a model performance test are run before each model deployment. Key metrics for regression tests are WER, TER, WDER (if diarization is enabled when doing speech to text), and latency at the 90th percentile.

Evaluation results

We strive to ship all model updates regression-free (that is, the updated model should only improve the current production model). Each candidate is compared directly to the current production model. To consider a model for deployment, we must see at least a 5% relative WER improvement over the current production model.

Speech to text models are trained and tuned by using voice audio that has variations, including:

Microphones and device specifications
Speech environment
Speech scenarios
Languages and accents of speakers
Age and gender of speakers
Ethnic background of speakers

For diarization, additional data variations are used:

Length of time each speaker speaks
Number of speakers
Emotional speech that alters pitch and tone

The resulting speech to text system transcribes the user's spoken words into text, which then can be used by a dialog system with natural language understanding or for analytics like summarization or sentiment.

Fairness considerations

At Microsoft, we strive to empower every person on the planet to achieve more. An essential part of this goal is working to create technologies and products that are fair and inclusive. Fairness is a multidimensional, sociotechnical topic, and it affects many different aspects of our product development. Learn more about the Microsoft approach to fairness.

One dimension we need to consider is how well the system performs for different groups of people. Research has shown that without conscious effort focused on improving performance for all groups, it is often possible for the performance of an AI system to vary across groups based on factors such as race, ethnicity, region, gender, and age.

Each version of the speech to text model is tested and evaluated against various test sets to make sure that the model can perform without a large gap in each of the evaluation criteria. More granular fairness results are coming soon.

Evaluating and integrating speech to text for your use

The performance of speech to text will vary depending on the real-world uses and conditions that you implement. To ensure optimal performance in your scenario, you should conduct your own evaluations of the solutions you implement by using speech to text.

A test voice dataset should consist of actual voice inputs that were collected in your applications in production. You should randomly sample data to reflect real user variations over a certain period of time. Also, the test dataset should be refreshed periodically to reflect changes in the variations.

Guidance for integration and responsible use with speech to text

As Microsoft works to help customers responsibly develop and deploy solutions by using speech to text, we are taking a principled approach to upholding personal agency and dignity by considering the AI systems' fairness, reliability & safety, privacy & security, inclusiveness, transparency, and human accountability. These considerations reflect our commitment to developing Responsible AI.

When getting ready to deploy AI-powered products or features, the following activities help to set you up for success:

Understand what it can do: Fully assess the capabilities of speech to text to understand its capabilities and limitations. Understand how it will perform in your particular scenario and context by thoroughly testing it with real life conditions and data.
Respect an individual's right to privacy: Only collect data and information from individuals for lawful and justifiable purposes. Only use data and information that you have consent to use for this purpose.
Legal review: Obtain appropriate legal advice to review your solution, particularly if you will use it in sensitive or high-risk applications. Understand what restrictions you might need to work within and your responsibility to resolve any issues that might come up in the future. Do not provide any legal advice or guidance.
Human-in-the-loop: Keep a human-in-the-loop and include human oversight as a consistent pattern area to explore. This means ensuring constant human oversight of the AI-powered product or feature and maintaining the role of humans in decision making. Ensure that you can have real-time human intervention in the solution to prevent harm. This enables you to manage situations when the AI model does not perform as required.
Security: Ensure that your solution is secure and has adequate controls to preserve the integrity of your content and prevent unauthorized access.
Build trust with affected stakeholders: Communicate the expected benefits and potential risks to affected stakeholders. Help people understand why the data is needed and how the use of the data will lead to their benefit. Describe data handling in an understandable way.
Customer feedback loop: Provide a feedback channel that allows users and individuals to report issues with the service once it's been deployed. After you've deployed an AI-powered product or feature, it requires ongoing monitoring and improvement. Be ready to implement any feedback and suggestions for improvement. Establish channels to collect questions and concerns from affected stakeholders (people who might be directly or indirectly affected by the system, including employees, visitors, and the general public).
Feedback: Seek out feedback from a diverse sampling of the community during the development and evaluation process (for example, from historically marginalized groups, people with disabilities, and service workers). See: Community jury.
User study: Any consent or disclosure recommendations should be framed in a user study. Evaluate the first and continuous-use experience with a representative sample of the community to validate that the design choices lead to effective disclosure. Conduct user research with 10-20 community members (affected stakeholders) to evaluate their comprehension of the information and to determine if their expectations are met.

Recommendations for preserving privacy

A successful privacy approach empowers individuals with information and provides controls and protection to preserve their privacy.

Consent to process and store audio input: Be sure to have all necessary permissions from your end users before you use the speech to text-enabled features in your applications or devices. Also ensure that you have permission for Microsoft to process this data as your third-party cloud service processor. Note that the real-time API does not separately store any of the audio input and transcription output data. However, you can design your application or device to retain end-user data, such as transcription text. You have an option to turn on local data logging via the Speech SDK (see Enable logging in the Speech SDK).

Next steps

Data, privacy, and security for speech to text