What is Speech Services?

Like the other Azure speech services, Speech Services are powered by speech technologies used in products like Cortana and Microsoft Office.

The Speech Services unite Azure speech features previously available via the Bing Speech API, Translator Speech, Custom Speech, and Custom Voice services. Now, one subscription provides access to all of these capabilities.

Main Speech Services functions

The primary functions of the Speech Services are speech-to-text (also called speech recognition or transcription), text-to-speech (speech synthesis), and speech translation.

Function Features
  • Transcribes continuous real-time speech into text.
  • Can batch-transcribe speech from audio recordings.
  • Supports intermediate results, end-of-speech detection, automatic text formatting, and profanity masking.
  • Can call on Language Understanding (LUIS) to derive user intent from transcribed speech.*
  • Text-to-Speech
  • NEW: Provides neural text-to-speech voices nearly indistinguishable from human speech (English).
  • Converts text to natural-sounding speech.
  • Offers multiple genders and/or dialects for many supported languages.
  • Supports plain text input or Speech Synthesis Markup Language (SSML).
  • Speech translation
  • Translates streaming audio in near-real-time.
  • Can also process recorded speech.
  • Provides results as text or synthesized speech.
  • Customize speech features

    You can use your own data to train the models that underlie the Speech service's Speech-to-Text and Text-to-Speech features.

    Feature Model Purpose
    Speech-to-text Acoustic model Helps transcribe particular speakers and environments, such as cars or factories.
    Language model Helps transcribe field-specific vocabulary and grammar, such as medical or IT jargon.
    Pronunciation model Helps transcribe abbreviations and acronyms, such as "IOU" for "I owe you."
    Text-to-speech Voice font Gives your app a voice of its own by training the model on samples of human speech.

    You can use your custom models anywhere you use the standard models in your app's Speech-to-Text or Text-to-Speech functionality.

    Use the Speech service

    To simplify the development of speech-enabled applications, Microsoft provides the Speech SDK for use with the Speech service. The Speech SDK provides consistent native Speech-to-Text and Speech Translation APIs for C#, C++, and Java. If you develop with one of these languages, the Speech SDK makes development easier by handling the network details for you.

    The Speech Services also has a REST API that works with any programming language that can make HTTP requests. The REST interface does not offer the streaming, real-time functionality of the SDK.

    to Text
    Text to

    Speech SDK Yes No Yes Native APIs for C#, C++, and Java to simplify development.
    REST APIs Yes Yes No A simple HTTP-based API that makes it easy to add speech to your applications.


    The Speech Services also support WebSocket protocols for streaming speech-to-text and speech translation. The Speech SDKs use these protocols to communicate with the Speech service. Use the Speech SDK instead of trying to implement your own WebSocket communication with the Speech service.

    If you already have code that uses Bing Speech or Translator Speech via WebSockets, you can update it to use the Speech Services. The WebSocket protocols are compatible, however, the endpoints are different.

    Speech Devices SDK

    The Speech Devices SDK is an integrated hardware and software platform for developers of speech-enabled devices. Our hardware partner provides reference designs and development units. Microsoft provides a device-optimized SDK that takes full advantage of the hardware's capabilities.

    Speech scenarios

    Use cases for Speech Services include:

    • Create voice-triggered apps
    • Transcribe call center recordings
    • Implement voice bots

    Voice user interface

    Voice input is a great way to make your app flexible, hands-free, and quick to use. With a voice-enabled app, users can just ask for the information they want.

    If your app is intended for use by the general public, you can use the default speech recognition models. They recognize a wide variety of speakers in common environments.

    If your app is used in a specific domain, for example, medicine or IT, you can create a language model. You can use this model to teach the Speech Services about the special terminology used by your app.

    If your app is used in a noisy environment, such as a factory, you can create a custom acoustic model. This model helps the Speech Services to distinguish speech from noise.

    Call center transcription

    Often, call center recordings are consulted only if an issue arises with a call. With the Speech service, it's easy to transcribe every recording to text. You can easily index the text for full-text search or apply Text Analytics to detect sentiment, language, and key phrases.

    If your call center recordings involve specialized terminology, such as product names or IT jargon, you can create a language model to teach the Speech Services the vocabulary. A custom acoustic model can help the Speech Services understand less-than-optimal phone connections.

    For more information about this scenario, read more about batch transcription with the Speech service.

    Voice bots

    Bots are a popular way to connect users with the information they want and customers with businesses they like. When you add a conversational user interface to your website or app, the functionality is easier to find and quicker to access. With the Speech service, this conversation takes on a new dimension of fluency by responding to spoken queries in kind.

    To add a unique personality to your voice-enabled bot, you can give it a voice of its own. Creating a custom voice is a two-step process. First, make recordings of the voice you want to use. Then submit those recordings along with a text transcript to the Speech service's voice customization portal, which does the rest. After you create your custom voice, the steps to use it in your app are straightforward.

    Next steps

    Get a subscription key for the Speech Services.