What is the Speech service?
The Speech service unites the Azure speech features previously available via the Bing Speech API, Translator Speech, Custom Speech, and Custom Voice services. Now, one subscription provides access to all of these capabilities.
Like the other Azure speech services, the Speech service is powered by the speech technologies used in products like Cortana and Microsoft Office. You can count on the quality of the results and the reliability of the cloud platform.
The Speech service is currently in public preview. Return here for documentation updates, new code samples, and more.
Main Speech service functions
The primary functions of the Speech service are Speech to Text (also called speech recognition or transcription), Text to Speech (speech synthesis), and Speech Translation.
|Speech to Text||
|Text to Speech||
* Intent recognition requires a LUIS subscription.
Customize speech features
You can use your own data to train the models that underlie the Speech service's Speech-to-Text and Text-to-Speech features.
|Speech to Text||Acoustic model||Helps transcribe particular speakers and environments, such as cars or factories.|
|Language model||Helps transcribe field-specific vocabulary and grammar, such as medical or IT jargon.|
|Pronunciation model||Helps transcribe abbreviations and acronyms, such as "IOU" for "I owe you."|
|Text to Speech||Voice font||Gives your app a voice of its own by training the model on samples of human speech.|
You can use your custom models anywhere you use the standard models in your app's Speech-to-Text or Text-to-Speech functionality.
Use the Speech service
To simplify the development of speech-enabled applications, Microsoft provides the Speech SDK for use with the new Speech service. The Speech SDK provides consistent native Speech-to-Text and Speech Translation APIs for C#, C++, and Java. If you develop with one of these languages, the Speech SDK makes development easier by handling the network details for you.
The Speech service also has a REST API that works with any programming language that can make HTTP requests. The REST interface does not offer the streaming, real-time functionality of the SDK.
|Speech SDK||Yes||No||Yes||Native APIs for C#, C++, and Java to simplify development.|
|REST||Yes||Yes||No||A simple HTTP-based API that makes it easy to add speech to your applications.|
The Speech service also has WebSocket protocols for streaming Speech to Text and Speech Translation. The Speech SDKs use these protocols to communicate with the Speech service. Use the Speech SDK instead of trying to implement your own WebSocket communication with the Speech service.
If you already have code that uses Bing Speech or Translator Speech via WebSockets, you can update it to use the Speech service. The WebSocket protocols are compatible, only the endpoints are different.
Speech Devices SDK
The Speech Devices SDK is an integrated hardware and software platform for developers of speech-enabled devices. Our hardware partner provides reference designs and development units. Microsoft provides a device-optimized SDK that takes full advantage of the hardware's capabilities.
Use cases for the Speech service include:
- Create voice-triggered apps
- Transcribe call center recordings
- Implement voice bots
Voice user interface
Voice input is a great way to make your app flexible, hands-free, and quick to use. With a voice-enabled app, users can just ask for the information they want.
If your app is intended for use by the general public, you can use the default speech recognition models. They recognize a wide variety of speakers in common environments.
If your app is used in a specific domain, for example, medicine or IT, you can create a language model. You can use this model to teach the Speech service about the special terminology used by your app.
If your app is used in a noisy environment, such as a factory, you can create a custom acoustic model. This model helps the Speech service to distinguish speech from noise.
Call center transcription
Often, call center recordings are consulted only if an issue arises with a call. With the Speech service, it's easy to transcribe every recording to text. You can easily index the text for full-text search or apply Text Analytics to detect sentiment, language, and key phrases.
If your call center recordings involve specialized terminology, such as product names or IT jargon, you can create a language model to teach the Speech service the vocabulary. A custom acoustic model can help the Speech service understand less-than-optimal phone connections.
For more information about this scenario, read more about batch transcription with the Speech service.
Bots are a popular way to connect users with the information they want and customers with businesses they like. When you add a conversational user interface to your website or app, the functionality is easier to find and quicker to access. With the Speech service, this conversation takes on a new dimension of fluency by responding to spoken queries in kind.
To add a unique personality to your voice-enabled bot, you can give it a voice of its own. Creating a custom voice is a two-step process. First, make recordings of the voice you want to use. Then submit those recordings along with a text transcript to the Speech service's voice customization portal, which does the rest. After you create your custom voice, the steps to use it in your app are straightforward.
Get a subscription key for the Speech service.