What is the Speech service?

The Speech service unites the Azure speech features previously available via the Bing Speech API, Translator Speech, Custom Speech, and Custom Voice services. Now, one subscription provides access to all of these capabilities.

Like the other Azure speech services, the Speech service is powered by the speech technologies used in products like Cortana and Microsoft Office. You can count on the quality of the results and the reliability of the cloud platform.

Note

The Speech service is currently in public preview. Return here for documentation updates, new code samples, and more.

Main Speech service functions

The primary functions of the Speech service are Speech to Text (also called speech recognition or transcription), Text to Speech (speech synthesis), and Speech Translation.

Function Features
Speech to Text
  • Transcribes continuous real-time speech into text.
  • Can batch-transcribe speech from audio recordings.
  • Offers recognition modes for interactive, conversation, and dictation use cases.
  • Supports intermediate results, end-of-speech detection, automatic text formatting, and profanity masking.
  • Can call on Language Understanding (LUIS) to derive user intent from transcribed speech.*
Text to Speech
  • Converts text to natural-sounding speech.
  • Offers multiple genders and/or dialects for many supported languages.
  • Supports plain text input or Speech Synthesis Markup Language (SSML).
Speech Translation
  • Translates streaming audio in near-real-time.
  • Can also process recorded speech.
  • Provides results as text or synthesized speech.

* Intent recognition requires a LUIS subscription.

Customize speech features

You can use your own data to train the models that underlie the Speech service's Speech-to-Text and Text-to-Speech features.

Feature Model Purpose
Speech to Text Acoustic model Helps transcribe particular speakers and environments, such as cars or factories.
Language model Helps transcribe field-specific vocabulary and grammar, such as medical or IT jargon.
Pronunciation model Helps transcribe abbreviations and acronyms, such as "IOU" for "I owe you."
Text to Speech Voice font Gives your app a voice of its own by training the model on samples of human speech.

You can use your custom models anywhere you use the standard models in your app's Speech-to-Text or Text-to-Speech functionality.

Use the Speech service

To simplify the development of speech-enabled applications, Microsoft provides the Speech SDK for use with the new Speech service. The Speech SDK provides consistent native Speech-to-Text and Speech Translation APIs for C#, C++, and Java. If you develop with one of these languages, the Speech SDK makes development easier by handling the network details for you.

The Speech service also has a REST API that works with any programming language that can make HTTP requests. The REST interface does not offer the streaming, real-time functionality of the SDK.


Method
Speech
to Text
Text to
Speech
Speech
Translation

Description
Speech SDK Yes No Yes Native APIs for C#, C++, and Java to simplify development.
REST Yes Yes No A simple HTTP-based API that makes it easy to add speech to your applications.

WebSockets

The Speech service also has WebSocket protocols for streaming Speech to Text and Speech Translation. The Speech SDKs use these protocols to communicate with the Speech service. Use the Speech SDK instead of trying to implement your own WebSocket communication with the Speech service.

If you already have code that uses Bing Speech or Translator Speech via WebSockets, you can update it to use the Speech service. The WebSocket protocols are compatible, only the endpoints are different.

Speech Devices SDK

The Speech Devices SDK is an integrated hardware and software platform for developers of speech-enabled devices. Our hardware partner provides reference designs and development units. Microsoft provides a device-optimized SDK that takes full advantage of the hardware's capabilities.

Speech scenarios

Use cases for the Speech service include:

  • Create voice-triggered apps
  • Transcribe call center recordings
  • Implement voice bots

Voice user interface

Voice input is a great way to make your app flexible, hands-free, and quick to use. With a voice-enabled app, users can just ask for the information they want.

If your app is intended for use by the general public, you can use the default speech recognition models. They recognize a wide variety of speakers in common environments.

If your app is used in a specific domain, for example, medicine or IT, you can create a language model. You can use this model to teach the Speech service about the special terminology used by your app.

If your app is used in a noisy environment, such as a factory, you can create a custom acoustic model. This model helps the Speech service to distinguish speech from noise.

Getting started is easy. Just download the Speech SDK and follow the relevant Quickstart article.

Call center transcription

Often, call center recordings are consulted only if an issue arises with a call. With the Speech service, it's easy to transcribe every recording to text. You can easily index the text for full-text search or apply Text Analytics to detect sentiment, language, and key phrases.

If your call center recordings involve specialized terminology, such as product names or IT jargon, you can create a language model to teach the Speech service the vocabulary. A custom acoustic model can help the Speech service understand less-than-optimal phone connections.

For more information about this scenario, read more about batch transcription with the Speech service.

Voice bots

Bots are a popular way to connect users with the information they want and customers with businesses they like. When you add a conversational user interface to your website or app, the functionality is easier to find and quicker to access. With the Speech service, this conversation takes on a new dimension of fluency by responding to spoken queries in kind.

To add a unique personality to your voice-enabled bot, you can give it a voice of its own. Creating a custom voice is a two-step process. First, make recordings of the voice you want to use. Then submit those recordings along with a text transcript to the Speech service's voice customization portal, which does the rest. After you create your custom voice, the steps to use it in your app are straightforward.

Next steps

Get a subscription key for the Speech service.