Use codec compressed audio input
The Speech SDK and Speech CLI can accept compressed audio formats using GStreamer. GStreamer decompresses the audio before it is sent over the wire to the Speech service as raw PCM.
| Platform | Languages | Supported GStreamer version |
|---|---|---|
| Linux | C++, C#, Java, Python, Go | Supported Linux distributions and target architectures |
| Windows (excluding UWP) | C++, C#, Java, Python | 1.18.3 |
| Android | Java | 1.18.3 |
Installing GStreamer on Linux
For more information, see Linux installation instructions.
sudo apt install libgstreamer1.0-0 \
gstreamer1.0-plugins-base \
gstreamer1.0-plugins-good \
gstreamer1.0-plugins-bad \
gstreamer1.0-plugins-ugly
Installing GStreamer on Windows
For more information, see Windows installation instructions.
- Create a folder c:\gstreamer
- Download installer
- Copy the installer to c:\gstreamer
- Open PowerShell as an administrator.
- Run the following command in the PowerShell:
cd c:\gstreamer
msiexec /passive INSTALLLEVEL=1000 INSTALLDIR=C:\gstreamer /i gstreamer-1.0-msvc-x86_64-1.18.3.msi
- Add the system variables GST_PLUGIN_PATH with value C:\gstreamer\1.0\msvc_x86_64\lib\gstreamer-1.0
- Add the system variables GSTREAMER_ROOT_X86_64 with value C:\gstreamer\1.0\msvc_x86_64
- Add another entry in the path variable as C:\gstreamer\1.0\msvc_x86_64\bin
- Reboot the machine
Using GStreamer in Android
Look at the Java tab above for the details about building libgstreamer_android.so
For more information see Android installation instructions.
Speech SDK version required for compressed audio input
- Speech SDK version 1.10.0 or later is required for RHEL 8 and CentOS 8
- Speech SDK version 1.11.0 or later is required for Windows.
- Speech SDK version 1.16.0 or later for the latest GStreamer on Windows and Android.
The default audio streaming format is WAV (16 kHz or 8 kHz, 16-bit, and mono PCM). Outside of WAV / PCM, the compressed input formats listed below are also supported using GStreamer.
- MP3
- OPUS/OGG
- FLAC
- ALAW in wav container
- MULAW in wav container
- ANY (For the scenario where the media format is not known)
GStreamer required to handle compressed audio
Handling compressed audio is implemented using GStreamer. For licensing reasons GStreamer binaries are not compiled and linked with the Speech SDK. Developers need to install several dependencies and plugins, see Installing on Windows or Installing on Linux. GStreamer binaries need to be in the system path, so that the Speech SDK can load the binaries during runtime. For example, on Windows, if the Speech SDK is able to find libgstreamer-1.0-0.dll or gstreamer-1.0-0.dll (for latest GStreamer) during runtime, it means the GStreamer binaries are in the system path.
Handling compressed audio is implemented using GStreamer. For licensing reasons GStreamer binaries are not compiled and linked with the Speech SDK. Developers need to install several dependencies and plugins.
Note
For mandatory general setup on Linux, see system requirements and setup instructions.
sudo apt install libgstreamer1.0-0 \
gstreamer1.0-plugins-base \
gstreamer1.0-plugins-good \
gstreamer1.0-plugins-bad \
gstreamer1.0-plugins-ugly
Handling compressed audio is implemented using GStreamer. For licensing reasons GStreamer binaries are not compiled and linked with the Speech SDK. Instead, you'll need to use the prebuilt binaries for Android. To download the prebuilt libraries, see installing for Android development.
libgstreamer_android.so is required. Make sure that all the GStreamer plugins (from Android.mk file below) are linked in libgstreamer_android.so. When using the latest speech SDK (1.16 and above) with GStreamer version 1.18.3, libc++_shared.so is also required to be present from android ndk.
GSTREAMER_PLUGINS := coreelements app audioconvert mpg123 \
audioresample audioparsers ogg opusparse \
opus wavparse alaw mulaw flac
An example Android.mk and Application.mk file are provided below. Follow these steps to create the gstreamer shared object:libgstreamer_android.so.
# Android.mk
LOCAL_PATH := $(call my-dir)
include $(CLEAR_VARS)
LOCAL_MODULE := dummy
LOCAL_SHARED_LIBRARIES := gstreamer_android
include $(BUILD_SHARED_LIBRARY)
ifndef GSTREAMER_ROOT_ANDROID
$(error GSTREAMER_ROOT_ANDROID is not defined!)
endif
ifndef APP_BUILD_SCRIPT
$(error APP_BUILD_SCRIPT is not defined!)
endif
ifndef TARGET_ARCH_ABI
$(error TARGET_ARCH_ABI is not defined!)
endif
ifeq ($(TARGET_ARCH_ABI),armeabi)
GSTREAMER_ROOT := $(GSTREAMER_ROOT_ANDROID)/arm
else ifeq ($(TARGET_ARCH_ABI),armeabi-v7a)
GSTREAMER_ROOT := $(GSTREAMER_ROOT_ANDROID)/armv7
else ifeq ($(TARGET_ARCH_ABI),arm64-v8a)
GSTREAMER_ROOT := $(GSTREAMER_ROOT_ANDROID)/arm64
else ifeq ($(TARGET_ARCH_ABI),x86)
GSTREAMER_ROOT := $(GSTREAMER_ROOT_ANDROID)/x86
else ifeq ($(TARGET_ARCH_ABI),x86_64)
GSTREAMER_ROOT := $(GSTREAMER_ROOT_ANDROID)/x86_64
else
$(error Target arch ABI not supported: $(TARGET_ARCH_ABI))
endif
GSTREAMER_NDK_BUILD_PATH := $(GSTREAMER_ROOT)/share/gst-android/ndk-build/
include $(GSTREAMER_NDK_BUILD_PATH)/plugins.mk
GSTREAMER_PLUGINS := $(GSTREAMER_PLUGINS_CORE) \
$(GSTREAMER_PLUGINS_CODECS) \
$(GSTREAMER_PLUGINS_PLAYBACK) \
$(GSTREAMER_PLUGINS_CODECS_GPL) \
$(GSTREAMER_PLUGINS_CODECS_RESTRICTED)
GSTREAMER_EXTRA_LIBS := -liconv -lgstbase-1.0 -lGLESv2 -lEGL
include $(GSTREAMER_NDK_BUILD_PATH)/gstreamer-1.0.mk
# Application.mk
APP_STL = c++_shared
APP_PLATFORM = android-21
APP_BUILD_SCRIPT = Android.mk
You can build libgstreamer_android.so using the following command on Ubuntu 18.04 or 20.04. The following command lines have only been tested for GStreamer Android version 1.14.4 with Android NDK b16b.
# Assuming wget and unzip already installed on the system
mkdir buildLibGstreamer
cd buildLibGstreamer
wget https://dl.google.com/android/repository/android-ndk-r16b-linux-x86_64.zip
unzip -q -o android-ndk-r16b-linux-x86_64.zip
export PATH=$PATH:$(pwd)/android-ndk-r16b
export NDK_PROJECT_PATH=$(pwd)/android-ndk-r16b
wget https://gstreamer.freedesktop.org/data/pkg/android/1.14.4/gstreamer-1.0-android-universal-1.14.4.tar.bz2
mkdir gstreamer_android
tar -xjf gstreamer-1.0-android-universal-1.14.4.tar.bz2 -C $(pwd)/gstreamer_android/
export GSTREAMER_ROOT_ANDROID=$(pwd)/gstreamer_android
mkdir gstreamer
# Copy the Application.mk and Android.mk from the documentation above and put it inside $(pwd)/gstreamer
# Enable only one of the following at one time to create the shared object for the targeted ABI
echo "building for armeabi-v7a. libgstreamer_android.so will be placed in $(pwd)/armeabi-v7a"
ndk-build -C $(pwd)/gstreamer "NDK_APPLICATION_MK=Application.mk" APP_ABI=armeabi-v7a NDK_LIBS_OUT=$(pwd)
#echo "building for arm64-v8a. libgstreamer_android.so will be placed in $(pwd)/arm64-v8a"
#ndk-build -C $(pwd)/gstreamer "NDK_APPLICATION_MK=Application.mk" APP_ABI=arm64-v8a NDK_LIBS_OUT=$(pwd)
#echo "building for x86_64. libgstreamer_android.so will be placed in $(pwd)/x86_64"
#ndk-build -C $(pwd)/gstreamer "NDK_APPLICATION_MK=Application.mk" APP_ABI=x86_64 NDK_LIBS_OUT=$(pwd)
#echo "building for x86. libgstreamer_android.so will be placed in $(pwd)/x86"
#ndk-build -C $(pwd)/gstreamer "NDK_APPLICATION_MK=Application.mk" APP_ABI=x86 NDK_LIBS_OUT=$(pwd)
Once the shared object (libgstreamer_android.so) is built application developer needs to place the shared object in the Android app, so that it can be loaded by speech SDK.
Handling compressed audio is implemented using GStreamer. For licensing reasons GStreamer binaries are not compiled and linked with the Speech SDK. Developers need to install several dependencies and plugins, see Installing on Windows or Installing on Linux. GStreamer binaries need to be in the system path, so that the Speech SDK can load the binaries during runtime. For example, on Windows, if the Speech SDK is able to find libgstreamer-1.0-0.dll during runtime, it means the GStreamer binaries are in the system path.
Speech SDK can use GStreamer to handle compressed audio. However, for licensing reasons GStreamer binaries are not compiled and linked with the Speech SDK. Developers need to install several dependencies and plugins, see Installing on Linux. Go language is only supported in Speech SDK on the Linux platform. See Speech SDK for Go to get started with Microsoft Speech SDK in Go.
Example code using codec compressed audio input
To configure Speech SDK to accept compressed audio input, create PullAudioInputStream or PushAudioInputStream. Then, create an AudioConfig from an instance of your stream class, specifying the compression format of the stream. Find related sample code snippets in About the Speech SDK audio input stream API.
Let's assume that you have an input stream class called pullStream and are using OPUS/OGG. Your code may look like this:
using Microsoft.CognitiveServices.Speech;
using Microsoft.CognitiveServices.Speech.Audio;
// ... omitted for brevity
var speechConfig =
SpeechConfig.FromSubscription(
"YourSubscriptionKey",
"YourServiceRegion");
// Create an audio config specifying the compressed
// audio format and the instance of your input stream class.
var audioFormat =
AudioStreamFormat.GetCompressedFormat(
AudioStreamContainerFormat.OGG_OPUS);
var audioConfig =
AudioConfig.FromStreamInput(
pullStream,
audioFormat);
using var recognizer = new SpeechRecognizer(speechConfig, audioConfig);
var result = await recognizer.RecognizeOnceAsync();
var text = result.Text;
To configure Speech SDK to accept compressed audio input, create PullAudioInputStream or PushAudioInputStream. Then, create an AudioConfig from an instance of your stream class, specifying the compression format of the stream. Find related sample code in Speech SDK samples.
Let's assume that you have an input stream class called pushStream and are using OPUS/OGG. Your code may look like this:
using namespace Microsoft::CognitiveServices::Speech;
using namespace Microsoft::CognitiveServices::Speech::Audio;
// ... omitted for brevity
auto config =
SpeechConfig::FromSubscription(
"YourSubscriptionKey",
"YourServiceRegion"
);
auto audioFormat =
AudioStreamFormat::GetCompressedFormat(
AudioStreamContainerFormat::OGG_OPUS
);
auto audioConfig =
AudioConfig::FromStreamInput(
pushStream,
audioFormat
);
auto recognizer = SpeechRecognizer::FromConfig(config, audioConfig);
auto result = recognizer->RecognizeOnceAsync().get();
auto text = result->Text;
To configure Speech SDK to accept compressed audio input, create a PullAudioInputStream or PushAudioInputStream. Then, create an AudioConfig from an instance of your stream class, specifying the compression format of the stream. Find related sample code in Speech SDK samples.
Let's assume that you have an input stream class called pullStream and are using OPUS/OGG. Your code may look like this:
import com.microsoft.cognitiveservices.speech.audio.AudioConfig;
import com.microsoft.cognitiveservices.speech.audio.AudioInputStream;
import com.microsoft.cognitiveservices.speech.audio.AudioStreamFormat;
import com.microsoft.cognitiveservices.speech.audio.PullAudioInputStream;
import com.microsoft.cognitiveservices.speech.audio.AudioStreamContainerFormat;
// ... omitted for brevity
SpeechConfig speechConfig =
SpeechConfig.fromSubscription(
"YourSubscriptionKey",
"YourServiceRegion");
// Create an audio config specifying the compressed
// audio format and the instance of your input stream class.
AudioStreamFormat audioFormat =
AudioStreamFormat.getCompressedFormat(
AudioStreamContainerFormat.OGG_OPUS);
AudioConfig audioConfig =
AudioConfig.fromStreamInput(
pullStream,
audioFormat);
SpeechRecognizer recognizer = new SpeechRecognizer(speechConfig, audioConfig);
SpeechRecognitionResult result = recognizer.recognizeOnceAsync().get();
String text = result.getText();
To configure Speech SDK to accept compressed audio input, create PullAudioInputStream or PushAudioInputStream. Then, create an AudioConfig from an instance of your stream class, specifying the compression format of the stream.
Let's assume that your use case is to use PullStream for an MP3 file. Your code may look like this:
import azure.cognitiveservices.speech as speechsdk
class BinaryFileReaderCallback(speechsdk.audio.PullAudioInputStreamCallback):
def __init__(self, filename: str):
super().__init__()
self._file_h = open(filename, "rb")
def read(self, buffer: memoryview) -> int:
print('trying to read {} frames'.format(buffer.nbytes))
try:
size = buffer.nbytes
frames = self._file_h.read(size)
buffer[:len(frames)] = frames
print('read {} frames'.format(len(frames)))
return len(frames)
except Exception as ex:
print('Exception in `read`: {}'.format(ex))
raise
def close(self) -> None:
print('closing file')
try:
self._file_h.close()
except Exception as ex:
print('Exception in `close`: {}'.format(ex))
raise
def compressed_stream_helper(compressed_format,
mp3_file_path,
default_speech_auth):
callback = BinaryFileReaderCallback(mp3_file_path)
stream = speechsdk.audio.PullAudioInputStream(stream_format=compressed_format, pull_stream_callback=callback)
speech_config = speechsdk.SpeechConfig(**default_speech_auth)
audio_config = speechsdk.audio.AudioConfig(stream=stream)
speech_recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config, audio_config=audio_config)
done = False
def stop_cb(evt):
"""callback that signals to stop continuous recognition upon receiving an event `evt`"""
print('CLOSING on {}'.format(evt))
nonlocal done
done = True
# Connect callbacks to the events fired by the speech recognizer
speech_recognizer.recognizing.connect(lambda evt: print('RECOGNIZING: {}'.format(evt)))
speech_recognizer.recognized.connect(lambda evt: print('RECOGNIZED: {}'.format(evt)))
speech_recognizer.session_started.connect(lambda evt: print('SESSION STARTED: {}'.format(evt)))
speech_recognizer.session_stopped.connect(lambda evt: print('SESSION STOPPED {}'.format(evt)))
speech_recognizer.canceled.connect(lambda evt: print('CANCELED {}'.format(evt)))
# stop continuous recognition on either session stopped or canceled events
speech_recognizer.session_stopped.connect(stop_cb)
speech_recognizer.canceled.connect(stop_cb)
# Start continuous speech recognition
speech_recognizer.start_continuous_recognition()
while not done:
time.sleep(.5)
speech_recognizer.stop_continuous_recognition()
# </SpeechContinuousRecognitionWithFile>
def pull_audio_input_stream_compressed_mp3(mp3_file_path: str,
default_speech_auth):
# Create a compressed format
compressed_format = speechsdk.audio.AudioStreamFormat(compressed_stream_format=speechsdk.AudioStreamContainerFormat.MP3)
compressed_stream_helper(compressed_format, mp3_file_path, default_speech_auth)
To configure the Speech SDK to accept compressed audio input, create a PullAudioInputStream or PushAudioInputStream. Then, create an AudioConfig from an instance of your stream class, specifying the compression format of the stream.
In the following example let's assume that your use case is to use PushStream for a compressed file.
package recognizer
import (
"fmt"
"time"
"strings"
"github.com/Microsoft/cognitive-services-speech-sdk-go/audio"
"github.com/Microsoft/cognitive-services-speech-sdk-go/speech"
"github.com/Microsoft/cognitive-services-speech-sdk-go/samples/helpers"
)
func RecognizeOnceFromCompressedFile(subscription string, region string, file string) {
var containerFormat audio.AudioStreamContainerFormat
if strings.Contains(file, ".mulaw") {
containerFormat = audio.MULAW
} else if strings.Contains(file, ".alaw") {
containerFormat = audio.ALAW
} else if strings.Contains(file, ".mp3") {
containerFormat = audio.MP3
} else if strings.Contains(file, ".flac") {
containerFormat = audio.FLAC
} else if strings.Contains(file, ".opus") {
containerFormat = audio.OGGOPUS
} else {
containerFormat = audio.ANY
}
format, err := audio.GetCompressedFormat(containerFormat)
if err != nil {
fmt.Println("Got an error: ", err)
return
}
defer format.Close()
stream, err := audio.CreatePushAudioInputStreamFromFormat(format)
if err != nil {
fmt.Println("Got an error: ", err)
return
}
defer stream.Close()
audioConfig, err := audio.NewAudioConfigFromStreamInput(stream)
if err != nil {
fmt.Println("Got an error: ", err)
return
}
defer audioConfig.Close()
config, err := speech.NewSpeechConfigFromSubscription(subscription, region)
if err != nil {
fmt.Println("Got an error: ", err)
return
}
defer config.Close()
speechRecognizer, err := speech.NewSpeechRecognizerFromConfig(config, audioConfig)
if err != nil {
fmt.Println("Got an error: ", err)
return
}
defer speechRecognizer.Close()
speechRecognizer.SessionStarted(func(event speech.SessionEventArgs) {
defer event.Close()
fmt.Println("Session Started (ID=", event.SessionID, ")")
})
speechRecognizer.SessionStopped(func(event speech.SessionEventArgs) {
defer event.Close()
fmt.Println("Session Stopped (ID=", event.SessionID, ")")
})
helpers.PumpFileIntoStream(file, stream)
task := speechRecognizer.RecognizeOnceAsync()
var outcome speech.SpeechRecognitionOutcome
select {
case outcome = <-task:
case <-time.After(40 * time.Second):
fmt.Println("Timed out")
return
}
defer outcome.Close()
if outcome.Error != nil {
fmt.Println("Got an error: ", outcome.Error)
}
fmt.Println("Got a recognition!")
fmt.Println(outcome.Result.Text)
}