About the Speech SDK

The Speech software development kit (SDK) exposes many of the Speech service capabilities, to empower you to develop speech-enabled applications. The Speech SDK is available in many programming languages and across all platforms.

Programming language Platform SDK reference
C# 1 Windows, Linux, macOS, Mono, Xamarin.iOS, Xamarin.Mac, Xamarin.Android, UWP, Unity .NET SDK
C++ Windows, Linux, macOS C++ SDK
Java 2 Android, Windows, Linux, macOS Java SDK
JavaScript Browser, Node.js JavaScript SDK
Objective-C / Swift iOS, macOS Objective-C SDK
Python Windows, Linux, macOS Python SDK

1 The .NET Speech SDK is based on .NET Standard 2.0, thus it supports many platforms. For more information, see .NET implementation support .

2 The Java Speech SDK is also available as part of the Speech Devices SDK.

Scenario capabilities

The Speech SDK exposes many features from the Speech service, but not all of them. The capabilities of the Speech SDK are often associated with scenarios. The Speech SDK is ideal for both real-time and non-real-time scenarios, using local devices, files, Azure blob storage, and even input and output streams. When a scenario is not achievable with the Speech SDK, look for a REST API alternative.

Speech-to-text

Speech-to-text (also known as speech recognition) transcribes audio streams to text that your applications, tools, or devices can consume or display. Use speech-to-text with Language Understanding (LUIS) to derive user intents from transcribed speech and act on voice commands. Use Speech Translation to translate speech input to a different language with a single call. For more information, see Speech-to-text basics.

Text-to-speech

Text-to-speech (also known as speech synthesis) converts text into human-like synthesized speech. The input text is either string literals or using the Speech Synthesis Markup Language (SSML). For more information on standard or neural voices, see Text-to-speech language and voice support.

Voice assistants

Voice assistants using the Speech SDK enable developers to create natural, human-like conversational interfaces for their applications and experiences. The voice assistant service provides fast, reliable interaction between a device and an assistant. The implementation uses the Bot Framework's Direct Line Speech channel or the integrated Custom Commands (Preview) service for task completion. Additionally, voice assistants can use custom voices created in the Custom Voice Portal to add a unique voice output experience.

Keyword spotting

The concept of keyword spotting is supported in the Speech SDK. Keyword spotting is the act of identifying a keyword in speech, followed by an action upon hearing the keyword. For example, "Hey Cortana" would activate the Cortana assistant.

Meeting scenarios

The Speech SDK is perfect for transcribing meeting scenarios, whether from a single device or multi-device conversation.

Conversation Transcription

Conversation Transcription enables real-time (and asynchronous) speech recognition, speaker identification, and sentence attribution to each speaker (also known as diarization). It's perfect for transcribing in-person meetings with the ability to distinguish speakers.

Multi-device Conversation

With Multi-device Conversation, connect multiple devices or clients in a conversation to send speech-based or text-based messages, with easy support for transcription and translation.

Custom / agent scenarios

The Speech SDK can be used for transcribing call center scenarios, where telephony data is generated.

Call Center Transcription

Call Center Transcription is common scenario for speech-to-text for transcribing large volumes of telephony data that may come from various systems, such as Interactive Voice Response (IVR). The latest speech recognition models from the Speech service excel at transcribing this telephony data, even in cases when the data is difficult for a human to understand.

Codec compressed audio input

Several of the Speech SDK programming languages support codec compressed audio input streams. For more information, see use compressed audio input formats .

REST API

While the Speech SDK covers many feature capabilities of the Speech Service, for some scenarios you might want to use the REST API.

Batch transcription

Batch transcription enables asynchronous speech-to-text transcription of large volumes of data. Batch transcription is only possible from the REST API. In addition to converting speech audio to text, batch speech-to-text also allows for diarization and sentiment-analysis.

Customization

The Speech Service delivers great functionality with its default models across speech-to-text, text-to-speech, and speech-translation. Sometimes you may want to increase the baseline performance to work even better with your unique use case. The Speech Service has a variety of no-code customization tools that make it easy, and allow you to create a competitive advantage with custom models based on your own data. These models will only be available to you and your organization.

Custom Speech-to-text

When using speech-to-text for recognition and transcription in a unique environment, you can create and train custom acoustic, language, and pronunciation models to address ambient noise or industry-specific vocabulary. The creation and management of no-code Custom Speech models is available through the Custom Speech Portal. Once the Custom Speech model is published, it can be consumed by the Speech SDK.

Custom Text-to-speech

Custom text-to-speech, also known as Custom Voice is a set of online tools that allow you to create a recognizable, one-of-a-kind voice for your brand. The creation and management of no-code Custom Voice models is available through the Custom Voice Portal. Once the Custom Voice model is published, it can be consumed by the Speech SDK.

Get the Speech SDK

The Speech SDK supports Windows 10 and Windows Server 2016, or later versions. Earlier versions are not officially supported. It is possible to use parts of the Speech SDK with earlier versions of Windows, although it's not advised.


Windows

System requirements

The Speech SDK on Windows requires the Microsoft Visual C++ Redistributable for Visual Studio 2019 on the system.

C#

The .NET Speech SDK is available as a NuGet package and implements .NET Standard 2.0, for more information, see Microsoft.CognitiveServices.Speech .


C#

C# NuGet Package

The .NET Speech SDK can be installed from the .NET Core CLI with the following dotnet add command.

dotnet add package Microsoft.CognitiveServices.Speech

The .NET Speech SDK can be installed from the Package Manager with the following Install-Package command.

Install-Package Microsoft.CognitiveServices.Speech

Additional resources

For microphone input, the Media Foundation libraries must be installed. These libraries are part of Windows 10 and Windows Server 2016. It's possible to use the Speech SDK without these libraries, as long as a microphone isn't used as the audio input device.

The required Speech SDK files can be deployed in the same directory as your application. This way your application can directly access the libraries. Make sure you select the correct version (x86/x64) that matches your application.

Name Function
Microsoft.CognitiveServices.Speech.core.dll Core SDK, required for native and managed deployment
Microsoft.CognitiveServices.Speech.csharp.dll Required for managed deployment

Note

Starting with the release 1.3.0 the file Microsoft.CognitiveServices.Speech.csharp.bindings.dll (shipped in previous releases) isn't needed anymore. The functionality is now integrated in the core SDK.

Important

For the Windows Forms App (.NET Framework) C# project, make sure the libraries are included in your project's deployment settings. You can check this under Properties -> Publish Section. Click the Application Files button and find corresponding libraries from the scroll down list. Make sure the value is set to Included. Visual Studio will include the file when project is published/deployed.

C++

The C++ Speech SDK is available on Windows, Linux, and macOS. For more information, see Microsoft.CognitiveServices.Speech .


C++

C++ NuGet package

The C++ Speech SDK can be installed from the Package Manager with the following Install-Package command.

Install-Package Microsoft.CognitiveServices.Speech

C++ binaries and header files

Alternatively, the C++ Speech SDK can be installed from binaries. Download the SDK as a .tar package and unpack the files in a directory of your choice. The contents of this package (which include header files for both x86 and x64 target architectures) are structured as follows:

Path Description
license.md License
ThirdPartyNotices.md Third-party notices
include Header files for C++
lib/x64 Native x64 library for linking with your application
lib/x86 Native x86 library for linking with your application

To create an application, copy or move the required binaries (and libraries) into your development environment. Include them as required in your build process.

Additional resources

Python

The Python Speech SDK is available as a Python Package Index (PyPI) module, for more information, see azure-cognitiveservices-speech . The Python Speech SDK is compatible with Windows, Linux, and macOS.


Python
pip install azure-cognitiveservices-speech

Tip

If you are on macOS, you may need to run the following command to get the pip command above to work:

python3 -m pip install --upgrade pip

Additional resources

Java

The Java SDK for Android is packaged as an AAR (Android Library) , which includes the necessary libraries and required Android permissions. It's hosted in a Maven repository at https://csspeechstorage.blob.core.windows.net/maven/ as package com.microsoft.cognitiveservices.speech:client-sdk:1.13.0.


Java

To consume the package from your Android Studio project, make the following changes:

  1. In the project-level build.gradle file, add the following to the repository section:
maven { url 'https://csspeechstorage.blob.core.windows.net/maven/' }
  1. In the module-level build.gradle file, add the following to the dependencies section:
implementation 'com.microsoft.cognitiveservices.speech:client-sdk:1.13.0'

The Java SDK is also part of the Speech Devices SDK.

Additional resources

Important

By downloading any of the Azure Cognitive Services Speech SDKs, you acknowledge its license. For more information, see:

Sample source code

The Speech SDK team actively maintains a large set of examples in an open-source repository. For the sample source code repository, visit the Microsoft Cognitive Services Speech SDK on GitHub . There are samples for C#, C++, Java, Python, Objective-C, Swift, JavaScript, UWP, Unity, and Xamarin.


GitHub

Next steps