What is the Speech service?

Like the other Azure speech services, the Speech service is powered by the speech technologies used in products like Cortana and Microsoft Office.

The Speech service unites the Azure speech features previously available via the Bing Speech API, Translator Speech, Custom Speech, and Custom Voice services. Now, one subscription provides access to all of these capabilities.

Main Speech service functions

The primary functions of the Speech service are Speech to Text (also called speech recognition or transcription), Text to Speech (speech synthesis), and Speech Translation.

Function Features
Speech to Text
  • Transcribes continuous real-time speech into text.
  • Can batch-transcribe speech from audio recordings.
  • Supports intermediate results, end-of-speech detection, automatic text formatting, and profanity masking.
  • Can call on Language Understanding (LUIS) to derive user intent from transcribed speech.*
Text to Speech
  • Converts text to natural-sounding speech.
  • Offers multiple genders and/or dialects for many supported languages.
  • Supports plain text input or Speech Synthesis Markup Language (SSML).
Speech Translation
  • Translates streaming audio in near-real-time.
  • Can also process recorded speech.
  • Provides results as text or synthesized speech.

* Intent recognition requires a LUIS subscription.

Customize speech features

You can use your own data to train the models that underlie the Speech service's Speech-to-Text and Text-to-Speech features.

Feature Model Purpose
Speech to Text Acoustic model Helps transcribe particular speakers and environments, such as cars or factories.
Language model Helps transcribe field-specific vocabulary and grammar, such as medical or IT jargon.
Pronunciation model Helps transcribe abbreviations and acronyms, such as "IOU" for "I owe you."
Text to Speech Voice font Gives your app a voice of its own by training the model on samples of human speech.

You can use your custom models anywhere you use the standard models in your app's Speech-to-Text or Text-to-Speech functionality.

Use the Speech service

To simplify the development of speech-enabled applications, Microsoft provides the Speech SDK for use with the Speech service. The Speech SDK provides consistent native Speech-to-Text and Speech Translation APIs for C#, C++, and Java. If you develop with one of these languages, the Speech SDK makes development easier by handling the network details for you.

The Speech service also has a REST API that works with any programming language that can make HTTP requests. The REST interface does not offer the streaming, real-time functionality of the SDK.


Method
Speech
to Text
Text to
Speech
Speech
Translation

Description
Speech SDK Yes No Yes Native APIs for C#, C++, and Java to simplify development.
REST Yes Yes No A simple HTTP-based API that makes it easy to add speech to your applications.

WebSockets

The Speech service also has WebSocket protocols for streaming Speech to Text and Speech Translation. The Speech SDKs use these protocols to communicate with the Speech service. Use the Speech SDK instead of trying to implement your own WebSocket communication with the Speech service.

If you already have code that uses Bing Speech or Translator Speech via WebSockets, you can update it to use the Speech service. The WebSocket protocols are compatible, only the endpoints are different.

Speech Devices SDK

The Speech Devices SDK is an integrated hardware and software platform for developers of speech-enabled devices. Our hardware partner provides reference designs and development units. Microsoft provides a device-optimized SDK that takes full advantage of the hardware's capabilities.

Why move to the Speech service?

The Speech service provides all the functionality and more of the Bing Speech API and three other Azure speech services: Custom Speech, Custom Voice, and Translator Speech. We encourage users of these services to migrate to the Speech service.

The Speech service incorporates many upgrades to these other services, including:

  • Higher speech recognition accuracy. We regularly improve the models used in the service.

  • More scalable. The service is more capable of handling multiple simultaneous requests, reducing latency.

  • The Speech Service uses a time-based pricing model. See Speech Service pricing for details.

  • The Speech Service is available in multiple regions to suit the needs of customers worldwide. You need an Azure subscription for each region used by your application.

  • A single Speech Service subscription key grants access to the following features. Each is metered separately, so you're charged only for the features you use.

  • The Speech Service speech-to-text function integrates with the Language Understanding Service (LUIS) to recognize speaker intent. A LUIS endpoint key can also be used with the Speech Service. See the intent recognition tutorial for details.

  • Speech-to-text no longer requires that you specify a recognition mode.

  • The Speech Service supports 24-KHz voices for text-to-speech, improving audio quality. At this writing, there are two such voices (US English only): Jessa24kRUS and Guy24kRUS.

  • THe Speech Service's batch transcription allows high volumes of recorded speech, such as call center recordings, to be transcribed to text efficiently, so they can be easily analyzed and searched.

  • When using the Speech SDK, there is no time limit on streaming speech-to-text transcription.

  • The Speech SDK provides a consistent API to the Speech service across several programming languages and execution environments (including Windows 10, UWP, and .NET Core), making development easier, especially on multiple platforms.

  • The Speech Service is compatible with the REST APIs and WebSockets protocol used by other Azure speech services, making it easy to migrate existing client applications to the Speech service.

Speech scenarios

Use cases for the Speech service include:

  • Create voice-triggered apps
  • Transcribe call center recordings
  • Implement voice bots

Voice user interface

Voice input is a great way to make your app flexible, hands-free, and quick to use. With a voice-enabled app, users can just ask for the information they want.

If your app is intended for use by the general public, you can use the default speech recognition models. They recognize a wide variety of speakers in common environments.

If your app is used in a specific domain, for example, medicine or IT, you can create a language model. You can use this model to teach the Speech service about the special terminology used by your app.

If your app is used in a noisy environment, such as a factory, you can create a custom acoustic model. This model helps the Speech service to distinguish speech from noise.

Getting started is easy. Just download the Speech SDK and follow the relevant Quickstart article.

Call center transcription

Often, call center recordings are consulted only if an issue arises with a call. With the Speech service, it's easy to transcribe every recording to text. You can easily index the text for full-text search or apply Text Analytics to detect sentiment, language, and key phrases.

If your call center recordings involve specialized terminology, such as product names or IT jargon, you can create a language model to teach the Speech service the vocabulary. A custom acoustic model can help the Speech service understand less-than-optimal phone connections.

For more information about this scenario, read more about batch transcription with the Speech service.

Voice bots

Bots are a popular way to connect users with the information they want and customers with businesses they like. When you add a conversational user interface to your website or app, the functionality is easier to find and quicker to access. With the Speech service, this conversation takes on a new dimension of fluency by responding to spoken queries in kind.

To add a unique personality to your voice-enabled bot, you can give it a voice of its own. Creating a custom voice is a two-step process. First, make recordings of the voice you want to use. Then submit those recordings along with a text transcript to the Speech service's voice customization portal, which does the rest. After you create your custom voice, the steps to use it in your app are straightforward.

Next steps

Get a subscription key for the Speech service.