Speech Service Documentation (Preview)
The Microsoft Speech Service and SDK provides developers an easy way to create powerful speech-enabled features in their applications, like voice command control, transcription, and dictation. Cortana, Microsoft Office, and others use the same technology. A key differentiator of Microsoft Speech Service is the ability to customize Speech-to-Text acoustic and language models to accommodate specialized vocabulary, noisy environments, and different ways of speaking. In fact, customized Text-to-speech and Speech-Translation are available as well.
Our Speech SDK is available on multiple platforms in several programming languages, so you can take care of live microphone management, real-time streaming, or batch-based file communication, depending on your application needs.
10-Minute Quickstarts - New SDKs
Learn how to install and use the Speech SDK with our quickstarts:
|Devices SDK (Android)|
Samples and Reference
Our Speech SDK supports Speech-to-Text (STT), or speech recognition. The Speech SDK transcribes audio streams into text that your application can accept as input. Your application can then, for example, enter the text into a document or act upon it as a command.
The following are common use cases for Speech-to-Text via the Speech SDK:
- Recognize a brief utterance, such as a command, without interim results.
- Transcribe a long, previously-recorded utterance, such as a voicemail message.
- Transcribe streaming speech in real-time, with partial results, for dictation.
- Determine what users want to do based on a spoken natural-language request.
The Speech SDK supports interactive speech transcription with real-time continuous Speech-to-Text and interim results. It also supports end-of-speech detection, optional automatic capitalization and punctuation, profanity masking, and text normalization.
A key differentiator of Microsoft Speech Service is the ability to customize Speech-to-Text acoustic and language models to accommodate specialized vocabulary, noisy environments, and different ways of speaking.
The Speech SDK can be used for Speech-Translation as well. With streaming Speech-Translation via the Speech SDK, the service returns interim results that can be displayed to the user to indicate translation progress. The results may be returned either as text or as voice.
Use cases for Speech-Translation via the Speech SDK include the following:
- Implement a conversational translation mobile app or device for travelers.
- Provide automatic translations for subtitling of audio and video recordings.
Speech-Translation models can also be customized.
Currently, Text-to-Speech (TTS), or speech synthesis, is supported via a REST API that converts plain text to natural-sounding speech that's delivered to your application in an audio file. Multiple voices, varying in gender or accent, are available for many languages.
Our Speech SDK provides access to Speech-to-Text and Speech-Translation. Text-to-Speech employs REST POST calls over HTTP.
The Text-to-Speech REST API supports Speech-Synthesis-Markup-Language (SSML) tags, so you can specify the exact phonetic pronunciation for troublesome words. SSML can also indicate speech characteristics (including emphasis, rate, volume, gender, and pitch) right in the text.
If you want an unsupported dialect or unique voice for your application, you can create custom voice fonts from your own speech samples.
The following are common use cases for the Text-to-Speech REST API:
- Speech output as an alternative screen output for visually-impaired users.
- Voice prompting for in-car applications, such as navigation.
- Conversational user interfaces in concert with the Speech-to-Text API.
Speech Devices SDK
With the introduction of the unified Speech Service, Microsoft and its partners now offer an integrated hardware/software platform that's optimized for developing speech-enabled devices: the Speech Devices SDK. This SDK is suitable for developing smart speech devices for all types of applications.
The Speech Devices SDK allows you to build your own ambient devices with a customized wake word unique to your brand that triggers audio capture. It also provides superior audio processing from multi-channel sources for more accurate speech recognition, including noise suppression, far-field voice, and beamforming.
SDKs in Consideration