What is the Speech SDK?
The Speech software development kit (SDK) exposes many of the Speech service capabilities you can use to develop speech-enabled applications. The Speech SDK is available in many programming languages and across all platforms.
Programming language | Platform | SDK reference |
---|---|---|
C# 1 | Windows, Linux, macOS, Mono, Xamarin.iOS, Xamarin.Mac, Xamarin.Android, UWP, Unity | .NET SDK |
C++ | Windows, Linux, macOS | C++ SDK |
Go | Linux | Go SDK |
Java | Android, Windows, Linux, macOS | Java SDK |
JavaScript | Browser, Node.js | JavaScript SDK |
Objective-C / Swift | iOS, macOS | Objective-C SDK |
Python | Windows, Linux, macOS | Python SDK |
1 The .NET Speech SDK is based on .NET Standard 2.0, so it supports many platforms. For more information, see .NET implementation support.
Important
C isn't a supported programming language for the Speech SDK. Several supported programming languages, for example, C++, include C headers that are part of a common Application Binary Interface (ABI) layer. These ABI headers are not intended for direct use and are subject to change across versions.
Scenario capabilities
The Speech SDK exposes many features from the Speech service, but not all of them. The capabilities of the Speech SDK are often associated with scenarios. The Speech SDK is ideal for both real-time and non-real-time scenarios, by using local devices, files, Azure Blob Storage, and even input and output streams. When a scenario can't be achieved with the Speech SDK, look for a REST API alternative.
Speech-to-text
Speech-to-text transcribes audio streams to text that your applications, tools, or devices can consume or display. Speech-to-text is also known as speech recognition. Use speech-to-text with Language Understanding (LUIS) to derive user intents from transcribed speech and act on voice commands. Use speech translation to translate speech input to a different language with a single call. For more information, see Speech-to-text basics.
Speech recognition, phrase list, intent, translation, and on-premises containers are available on the following platforms:
- C++/Windows and Linux and macOS
- C# (Framework and .NET Core)/Windows and UWP and Unity and Xamarin and Linux and macOS
- Java (Jre and Android)
- JavaScript (browser and NodeJS)
- Python
- Swift
- Objective-C
- Go (speech recognition only)
Text-to-speech
Text-to-speech converts text into humanlike synthesized speech. Text-to-speech is also known as speech synthesis. The input text is either string literals or uses the Speech Synthesis Markup Language (SSML). For more information on standard or neural voices, see Text-to-speech language and voice support.
Text-to-speech is available on the following platforms:
- C++/Windows and Linux and macOS
- C# (Framework and .NET Core)/Windows and UWP and Unity and Xamarin and Linux and macOS
- Java (Jre and Android)
- JavaScript (browser and NodeJS)
- Python
- Swift
- Objective-C
- Go
- Text-to-speech REST API can be used in every other situation
Voice assistants
Voice assistants using the Speech SDK enable you to create natural, humanlike conversational interfaces for your applications and experiences. The Speech SDK provides fast, reliable interaction that includes speech-to-text, text-to-speech, and conversational data on a single connection. Your implementation can use the Bot Framework's Direct Line Speech channel or the integrated Custom Commands service for task completion. Also, voice assistants can use custom voices created in the Custom Voice portal to add a unique voice output experience.
Voice assistant support is available on the following platforms:
- C++/Windows and Linux and macOS
- C#/Windows
- Java/Windows and Linux and macOS and Android (Speech Devices SDK)
- Go
Keyword recognition
The concept of keyword recognition is supported in the Speech SDK. Keyword recognition is the act of identifying a keyword in speech, followed by an action upon hearing the keyword. For example, "Hey Cortana" would activate the Cortana assistant.
Keyword recognition is available on the following platforms:
- C++/Windows and Linux
- C#/Windows and Linux
- Python/Windows and Linux
- Java/Windows and Linux and Android
Meeting scenarios
The Speech SDK is perfect for transcribing meeting scenarios, whether from a single device or multidevice conversation.
Conversation transcription
Conversation transcription enables real-time, and asynchronous, speech recognition, speaker identification, and sentence attribution to each speaker. This process is also known as diarization. It's perfect for transcribing in-person meetings with the ability to distinguish speakers.
Conversation transcription is available on the following platforms:
- C++/Windows and Linux
- C# (Framework and .NET Core)/Windows and UWP and Linux
- Java/Windows and Linux and Android
Multidevice conversation
With multidevice conversation, you can connect multiple devices or clients in a conversation to send speech-based or text-based messages, with easy support for transcription and translation.
Multidevice conversation is available on the following platforms:
- C++/Windows
- C# (Framework and .NET Core)/Windows
Custom/agent scenarios
The Speech SDK can be used for transcribing call center scenarios, where telephony data is generated.
Call center transcription
Call center transcription is a common scenario for speech-to-text for transcribing large volumes of telephony data that might come from various systems, such as interactive voice response. The latest speech recognition models from the Speech service excel at transcribing this telephony data, even in cases when the data is difficult for a human to understand.
Call center transcription is available through the batch Speech service via its REST API and can be used in any situation.
Codec-compressed audio input
Several of the Speech SDK programming languages support codec-compressed audio input streams. For more information, see Use compressed audio input formats.
Codec-compressed audio input is available on the following platforms:
- C++/Linux
- C#/Linux
- Java/Linux, Android, and iOS
REST API
The Speech SDK covers many feature capabilities of the Speech service, but for some scenarios you might want to use the REST API.
Batch transcription
Batch transcription enables asynchronous speech-to-text transcription of large volumes of data. Batch transcription is only possible from the REST API. In addition to converting speech audio to text, batch speech-to-text also allows for diarization and sentiment analysis.
Customization
The Speech service delivers great functionality with its default models across speech-to-text, text-to-speech, and speech translation. Sometimes you might want to increase the baseline performance to work even better with your unique use case. The Speech service has various no-code customization tools that make it easy. You can use them to create a competitive advantage with custom models based on your own data. These models will only be available to you and your organization.
Custom speech-to-text
When you use speech-to-text for recognition and transcription in a unique environment, you can create and train custom acoustic, language, and pronunciation models to address ambient noise or industry-specific vocabulary. The creation and management of no-code Custom Speech models is available in the Speech Studio. After the Custom Speech model is published, it can be consumed by the Speech SDK.
Custom text-to-speech
Custom text-to-speech, also known as Custom Voice, is a set of online tools that allow you to create a recognizable, one-of-a-kind voice for your brand. The creation and management of no-code Custom Voice models is available through the Custom Voice portal. After the Custom Voice model is published, it can be consumed by the Speech SDK.
Get the Speech SDK
The Speech SDK supports Windows 10 and Windows Server 2016, or later versions. Earlier versions are not officially supported. It's possible to use parts of the Speech SDK with earlier versions of Windows, although it's not advised.
System requirements
The Speech SDK on Windows requires the Microsoft Visual C++ Redistributable for Visual Studio 2019 on the system.
C#
The .NET Speech SDK is available as a NuGet package and implements .NET Standard 2.0. For more information, see Microsoft.CognitiveServices.Speech.
C# NuGet package
The .NET Speech SDK can be installed from the .NET Core CLI with the following dotnet add
command:
dotnet add package Microsoft.CognitiveServices.Speech
The .NET Speech SDK can be installed from the Package Manager with the following Install-Package
command:
Install-Package Microsoft.CognitiveServices.Speech
C++
The C++ Speech SDK is available as a NuGet package on Windows, Linux, and macOS. For more information, see Microsoft.CognitiveServices.Speech. The C++ Speech SDK is also available as a tar package from https://aka.ms/csspeech/linuxbinary.
C++ NuGet package
The C++ Speech SDK can be installed from the Package Manager with the following Install-Package
command:
Install-Package Microsoft.CognitiveServices.Speech
Additional resources
Windows, Linux, and macOS quickstart C++ source code
Python
The Python Speech SDK is available as a Python Package Index (PyPI) module. For more information, see azure-cognitiveservices-speech . The Python Speech SDK is compatible with Windows, Linux, and macOS. Install a version of Python from 3.7 to 3.10.
Before you install the Python Speech SDK, make sure to satisfy the system requirements and prerequisites.
To install the Speech SDK, run this command in a terminal.
pip install azure-cognitiveservices-speech
If you're on macOS and run into install issues, you may need to run this command first.
python3 -m pip install --upgrade pip
Now you can import the Speech SDK into your Python project.
import azure.cognitiveservices.speech as speechsdk
Java
The Java SDK for Android is packaged as an AAR (Android Library), which includes the necessary libraries and required Android permissions. It's hosted in a Maven repository at https://azureai.azureedge.net/maven/
as package com.microsoft.cognitiveservices.speech:client-sdk:1.19.0
. Make sure 1.19.0 is the latest version by searching our GitHub repo.
To consume the package from your Android Studio project, make the following changes:
In the project-level build.gradle file, add the following to the
repositories
section:maven { url 'https://azureai.azureedge.net/maven/' }
In the module-level build.gradle file, add the following to the
dependencies
section:implementation 'com.microsoft.cognitiveservices.speech:client-sdk:1.21.0'
Additional resources
Important
By downloading any of the Azure Cognitive Services Speech SDKs, you acknowledge its license. For more information, see:
Sample source code
The Speech SDK team actively maintains a large set of examples in an open-source repository. For the sample source code repository, see the Microsoft Cognitive Services Speech SDK on GitHub . There are samples for C#, C++, Java, Python, Objective-C, Swift, JavaScript, UWP, Unity, and Xamarin.
Next steps
Feedback
Submit and view feedback for