Install and run Docker containers for the Speech service APIs
Containers enable you to run some of the Speech service APIs in your own environment. Containers are great for specific security and data governance requirements. In this article you'll learn how to download, install, and run a Speech container.
Speech containers enable customers to build a speech application architecture that is optimized for both robust cloud capabilities and edge locality. There are several containers available, which use the same pricing as the cloud-based Azure Speech Services.
Important
The following Speech containers are now Generally available:
- Standard Speech-to-text
- Custom Speech-to-text
- Standard Text-to-speech
- Neural Text-to-speech
The following speech containers are in gated preview.
- Custom Text-to-speech
- Speech Language Detection
To use the speech containers you must submit an online request, and have it approved. See the Request approval to the run the container section below for more information.
Container | Features | Latest |
---|---|---|
Speech-to-text | Analyzes sentiment and transcribes continuous real-time speech or batch audio recordings with intermediate results. | 2.11.0 |
Custom Speech-to-text | Using a custom model from the Custom Speech portal, transcribes continuous real-time speech or batch audio recordings into text with intermediate results. | 2.11.0 |
Text-to-speech | Converts text to natural-sounding speech with plain text input or Speech Synthesis Markup Language (SSML). | 1.13.0 |
Custom Text-to-speech | Using a custom model from the Custom Voice portal, converts text to natural-sounding speech with plain text input or Speech Synthesis Markup Language (SSML). | 1.13.0 |
Speech Language Detection | Detect the language spoken in audio files. | 1.0 |
Neural Text-to-speech | Converts text to natural-sounding speech using deep neural network technology, allowing for more natural synthesized speech. | 1.5.0 |
If you don't have an Azure subscription, create a free account before you begin.
Prerequisites
The following prerequisites before using Speech containers:
Required | Purpose |
---|---|
Docker Engine | You need the Docker Engine installed on a host computer. Docker provides packages that configure the Docker environment on macOS, Windows, and Linux. For a primer on Docker and container basics, see the Docker overview. Docker must be configured to allow the containers to connect with and send billing data to Azure. On Windows, Docker must also be configured to support Linux containers. |
Familiarity with Docker | You should have a basic understanding of Docker concepts, like registries, repositories, containers, and container images, as well as knowledge of basic docker commands. |
Speech resource | In order to use these containers, you must have: An Azure Speech resource to get the associated API key and endpoint URI. Both values are available on the Azure portal's Speech Overview and Keys pages. They are both required to start the container. {API_KEY}: One of the two available resource keys on the Keys page {ENDPOINT_URI}: The endpoint as provided on the Overview page |
Gathering required parameters
There are three primary parameters for all Cognitive Services' containers that are required. The end-user license agreement (EULA) must be present with a value of accept
. Additionally, both an Endpoint URL and API Key are needed.
Endpoint URI {ENDPOINT_URI}
The Endpoint URI value is available on the Azure portal Overview page of the corresponding Cognitive Service resource. Navigate to the Overview page, hover over the Endpoint, and a Copy to clipboard
icon will appear. Copy and use where needed.
Keys {API_KEY}
This key is used to start the container, and is available on the Azure portal's Keys page of the corresponding Cognitive Service resource. Navigate to the Keys page, and click on the Copy to clipboard
icon.
Important
These subscription keys are used to access your Cognitive Service API. Do not share your keys. Store them securely, for example, using Azure Key Vault. We also recommend regenerating these keys regularly. Only one key is necessary to make an API call. When regenerating the first key, you can use the second key for continued access to the service.
The host computer
The host is a x64-based computer that runs the Docker container. It can be a computer on your premises or a Docker hosting service in Azure, such as:
- Azure Kubernetes Service.
- Azure Container Instances.
- A Kubernetes cluster deployed to Azure Stack. For more information, see Deploy Kubernetes to Azure Stack.
Advanced Vector Extension support
The host is the computer that runs the docker container. The host must support Advanced Vector Extensions (AVX2). You can check for AVX2 support on Linux hosts with the following command:
grep -q avx2 /proc/cpuinfo && echo AVX2 supported || echo No AVX2 support detected
Warning
The host computer is required to support AVX2. The container will not function correctly without AVX2 support.
Container requirements and recommendations
The following table describes the minimum and recommended allocation of resources for each Speech container.
Container | Minimum | Recommended |
---|---|---|
Speech-to-text | 2 core, 2-GB memory | 4 core, 4-GB memory |
Custom Speech-to-text | 2 core, 2-GB memory | 4 core, 4-GB memory |
Text-to-speech | 1 core, 2-GB memory | 2 core, 3-GB memory |
Custom Text-to-speech | 1 core, 2-GB memory | 2 core, 3-GB memory |
Speech Language Detection | 1 core, 1-GB memory | 1 core, 1-GB memory |
Neural Text-to-speech | 6 core, 12-GB memory | 8 core, 16-GB memory |
- Each core must be at least 2.6 gigahertz (GHz) or faster.
Core and memory correspond to the --cpus
and --memory
settings, which are used as part of the docker run
command.
Note
The minimum and recommended are based off of Docker limits, not the host machine resources. For example, speech-to-text containers memory map portions of a large language model, and it is recommended that the entire file fits in memory, which is an additional 4-6 GB. Also, the first run of either container may take longer, since models are being paged into memory.
Request approval to the run the container
Fill out and submit the request form to request access to the container.
The form requests information about you, your company, and the user scenario for which you'll use the container. After you submit the form, the Azure Cognitive Services team will review it and email you with a decision.
Important
- On the form, you must use an email address associated with an Azure subscription ID.
- The Azure resource you use to run the container must have been created with the approved Azure subscription ID.
- Check your email (both inbox and junk folders) for updates on the status of your application from Microsoft.
After you're approved, you will be able to run the container after downloading it from the Microsoft Container Registry (MCR), described later in the article.
You won't be able to run the container if your Azure subscription has not been approved.
Get the container image with docker pull
Container images for Speech are available in the following Container Registry.
- Speech-to-text
- Custom Speech-to-text
- Text-to-speech
- Neural Text-to-speech
- Custom Text-to-speech
- Speech Language Detection
Container | Repository |
---|---|
Speech-to-text | mcr.microsoft.com/azure-cognitive-services/speechservices/speech-to-text:latest |
Tip
You can use the docker images command to list your downloaded container images. For example, the following command lists the ID, repository, and tag of each downloaded container image, formatted as a table:
docker images --format "table {{.ID}}\t{{.Repository}}\t{{.Tag}}"
IMAGE ID REPOSITORY TAG
<image-id> <repository-path/name> <tag-name>
Docker pull for the Speech containers
- Speech-to-text
- Custom Speech-to-text
- Text-to-speech
- Neural Text-to-speech
- Custom Text-to-speech
- Speech Language Detection
Docker pull for the Speech-to-text container
Use the docker pull command to download a container image from Microsoft Container registry.
docker pull mcr.microsoft.com/azure-cognitive-services/speechservices/speech-to-text:latest
Important
The latest
tag pulls the en-US
locale. For additional locales see Speech-to-text locales.
Speech-to-text locales
All tags, except for latest
are in the following format and are case-sensitive:
<major>.<minor>.<patch>-<platform>-<locale>-<prerelease>
The following tag is an example of the format:
2.6.0-amd64-en-us
For all of the supported locales of the speech-to-text container, please see Speech-to-text image tags.
How to use the container
Once the container is on the host computer, use the following process to work with the container.
- Run the container, with the required billing settings. More examples of the
docker run
command are available. - Query the container's prediction endpoint.
Run the container with docker run
Use the docker run command to run the container. Refer to gathering required parameters for details on how to get the {Endpoint_URI}
and {API_Key}
values. Additional examples of the docker run
command are also available.
- Speech-to-text
- Custom Speech-to-text
- Text-to-speech
- Neural Text-to-speech
- Custom Text-to-speech
- Speech Language Detection
To run the Standard Speech-to-text container, execute the following docker run
command.
docker run --rm -it -p 5000:5000 --memory 4g --cpus 4 \
mcr.microsoft.com/azure-cognitive-services/speechservices/speech-to-text \
Eula=accept \
Billing={ENDPOINT_URI} \
ApiKey={API_KEY}
This command:
- Runs a Speech-to-text container from the container image.
- Allocates 4 CPU cores and 4 gigabytes (GB) of memory.
- Exposes TCP port 5000 and allocates a pseudo-TTY for the container.
- Automatically removes the container after it exits. The container image is still available on the host computer.
Note
Containers support compressed audio input to Speech SDK using GStreamer. To install GStreamer in a container, follow Linux instructions for GStreamer in Use codec compressed audio input with the Speech SDK.
Diarization on the speech-to-text output
Diarization is enabled by default. to get diarization in your response, use diarize_speech_config.set_service_property
.
- Set the phrase output format to
Detailed
. - Set the mode of diarization. The supported modes are
Identity
andAnonymous
.
diarize_speech_config.set_service_property(
name='speechcontext-PhraseOutput.Format',
value='Detailed',
channel=speechsdk.ServicePropertyChannel.UriQueryParameter
)
diarize_speech_config.set_service_property(
name='speechcontext-phraseDetection.speakerDiarization.mode',
value='Identity',
channel=speechsdk.ServicePropertyChannel.UriQueryParameter
)
Note
"Identity" mode returns "SpeakerId": "Customer"
or "SpeakerId": "Agent"
.
"Anonymous" mode returns "SpeakerId": "Speaker 1"
or "SpeakerId": "Speaker 2"
Analyze sentiment on the speech-to-text output
Starting in v2.6.0 of the speech-to-text container, you should use TextAnalytics 3.0 API endpoint instead of the preview one. For example
https://westus2.api.cognitive.microsoft.com/text/analytics/v3.0/sentiment
https://localhost:5000/text/analytics/v3.0/sentiment
Note
The Text Analytics v3.0
API is not backward compatible with Text Analytics v3.0-preview.1
. To get the latest sentiment feature support, use v2.6.0
of the speech-to-text container image and Text Analytics v3.0
.
Starting in v2.2.0 of the speech-to-text container, you can call the sentiment analysis v3 API on the output. To call sentiment analysis, you will need a Text Analytics API resource endpoint. For example:
https://westus2.api.cognitive.microsoft.com/text/analytics/v3.0-preview.1/sentiment
https://localhost:5000/text/analytics/v3.0-preview.1/sentiment
If you're accessing a Text analytics endpoint in the cloud, you will need a key. If you're running Text Analytics locally, you may not need to provide this.
The key and endpoint are passed to the Speech container as arguments, as in the following example.
docker run -it --rm -p 5000:5000 \
mcr.microsoft.com/azure-cognitive-services/speechservices/speech-to-text:latest \
Eula=accept \
Billing={ENDPOINT_URI} \
ApiKey={API_KEY} \
CloudAI:SentimentAnalysisSettings:TextAnalyticsHost={TEXT_ANALYTICS_HOST} \
CloudAI:SentimentAnalysisSettings:SentimentAnalysisApiKey={SENTIMENT_APIKEY}
This command:
- Performs the same steps as the command above.
- Stores a Text Analytics API endpoint and key, for sending sentiment analysis requests.
Phraselist v2 on the speech-to-text output
Starting in v2.6.0 of the speech-to-text container, you can get the output with your own phrases - either the whole sentence, or phrases in the middle. For example the tall man in the following sentence:
- "This is a sentence the tall man this is another sentence."
To configure a phrase list, you need to add your own phrases when you make the call. For example:
phrase="the tall man"
recognizer = speechsdk.SpeechRecognizer(
speech_config=dict_speech_config,
audio_config=audio_config)
phrase_list_grammer = speechsdk.PhraseListGrammar.from_recognizer(recognizer)
phrase_list_grammer.addPhrase(phrase)
dict_speech_config.set_service_property(
name='setflight',
value='xonlineinterp',
channel=speechsdk.ServicePropertyChannel.UriQueryParameter
)
If you have multiple phrases to add, call .addPhrase()
for each phrase to add it to the phrase list.
Important
The Eula
, Billing
, and ApiKey
options must be specified to run the container; otherwise, the container won't start. For more information, see Billing.
Query the container's prediction endpoint
Note
Use a unique port number if you're running multiple containers.
Containers | SDK Host URL | Protocol |
---|---|---|
Standard Speech-to-text and Custom Speech-to-text | ws://localhost:5000 |
WS |
Text-to-speech (including Standard, Custom and Neural), Speech Language detection | http://localhost:5000 |
HTTP |
For more information on using WSS and HTTPS protocols, see container security.
Speech-to-text (Standard and Custom)
The container provides websocket-based query endpoint APIs, that are accessed through the Speech SDK. By default, the Speech SDK uses online speech services. To use the container, you need to change the initialization method.
Tip
When using the Speech SDK with containers, you do not need to provide the Azure Speech resource subscription key or an authentication bearer token.
See the examples below.
Change from using this Azure-cloud initialization call:
var config = SpeechConfig.FromSubscription("YourSubscriptionKey", "YourServiceRegion");
To using this call with the container host:
var config = SpeechConfig.FromHost(
new Uri("ws://localhost:5000"));
Analyze sentiment
If you provided your Text Analytics API credentials to the container, you can use the Speech SDK to send speech recognition requests with sentiment analysis. You can configure the API responses to use either a simple or detailed format.
Note
v1.13 of the Speech Service Python SDK has an identified issue with sentiment analysis. Please use v1.12.x or earlier if you're using sentiment analysis in the Speech Service Python SDK.
To configure the Speech client to use a simple format, add "Sentiment"
as a value for Simple.Extensions
. If you want to choose a specific Text Analytics model version, replace 'latest'
in the speechcontext-phraseDetection.sentimentAnalysis.modelversion
property configuration.
speech_config.set_service_property(
name='speechcontext-PhraseOutput.Simple.Extensions',
value='["Sentiment"]',
channel=speechsdk.ServicePropertyChannel.UriQueryParameter
)
speech_config.set_service_property(
name='speechcontext-phraseDetection.sentimentAnalysis.modelversion',
value='latest',
channel=speechsdk.ServicePropertyChannel.UriQueryParameter
)
Simple.Extensions
will return the sentiment result in root layer of the response.
{
"DisplayText":"What's the weather like?",
"Duration":13000000,
"Id":"6098574b79434bd4849fee7e0a50f22e",
"Offset":4700000,
"RecognitionStatus":"Success",
"Sentiment":{
"Negative":0.03,
"Neutral":0.79,
"Positive":0.18
}
}
If you want to completely disable sentiment analysis, add a false
value to sentimentanalysis.enabled
.
speech_config.set_service_property(
name='speechcontext-phraseDetection.sentimentanalysis.enabled',
value='false',
channel=speechsdk.ServicePropertyChannel.UriQueryParameter
)
Text-to-speech (Standard, Neural and Custom)
The container provides REST-based endpoint APIs. There are many sample source code projects for platform, framework, and language variations available.
With the Standard or Neural Text-to-speech containers, you should rely on the locale and voice of the image tag you downloaded. For example, if you downloaded the latest
tag the default locale is en-US
and the AriaNeural
voice. The {VOICE_NAME}
argument would then be en-US-AriaNeural
. See the example SSML below:
<speak version="1.0" xml:lang="en-US">
<voice name="en-US-AriaNeural">
This text will get converted into synthesized speech.
</voice>
</speak>
However, for Custom Text-to-speech you'll need to obtain the Voice / model from the custom voice portal. The custom model name is synonymous with the voice name. Navigate to the Training page, and copy the Voice / model to use as the {VOICE_NAME}
argument.
See the example SSML below:
<speak version="1.0" xml:lang="en-US">
<voice name="custom-voice-model">
This text will get converted into synthesized speech.
</voice>
</speak>
Let's construct an HTTP POST request, providing a few headers and a data payload. Replace the {VOICE_NAME}
placeholder with your own value.
curl -s -v -X POST http://localhost:5000/speech/synthesize/cognitiveservices/v1 \
-H 'Accept: audio/*' \
-H 'Content-Type: application/ssml+xml' \
-H 'X-Microsoft-OutputFormat: riff-24khz-16bit-mono-pcm' \
-d '<speak version="1.0" xml:lang="en-US"><voice name="{VOICE_NAME}">This is a test, only a test.</voice></speak>' > output.wav
This command:
- Constructs an HTTP POST request for the
speech/synthesize/cognitiveservices/v1
endpoint. - Specifies an
Accept
header ofaudio/*
- Specifies a
Content-Type
header ofapplication/ssml+xml
, for more information, see request body. - Specifies a
X-Microsoft-OutputFormat
header ofriff-16khz-16bit-mono-pcm
, for more options see audio output. - Sends the Speech Synthesis Markup Language (SSML) request given the
{VOICE_NAME}
to the endpoint.
Run multiple containers on the same host
If you intend to run multiple containers with exposed ports, make sure to run each container with a different exposed port. For example, run the first container on port 5000 and the second container on port 5001.
You can have this container and a different Azure Cognitive Services container running on the HOST together. You also can have multiple containers of the same Cognitive Services container running.
Validate that a container is running
There are several ways to validate that the container is running. Locate the External IP address and exposed port of the container in question, and open your favorite web browser. Use the various request URLs below to validate the container is running. The example request URLs listed below are http://localhost:5000
, but your specific container may vary. Keep in mind that you're to rely on your container's External IP address and exposed port.
Request URL | Purpose |
---|---|
http://localhost:5000/ |
The container provides a home page. |
http://localhost:5000/ready |
Requested with GET, this provides a verification that the container is ready to accept a query against the model. This request can be used for Kubernetes liveness and readiness probes. |
http://localhost:5000/status |
Also requested with GET, this verifies if the api-key used to start the container is valid without causing an endpoint query. This request can be used for Kubernetes liveness and readiness probes. |
http://localhost:5000/swagger |
The container provides a full set of documentation for the endpoints and a Try it out feature. With this feature, you can enter your settings into a web-based HTML form and make the query without having to write any code. After the query returns, an example CURL command is provided to demonstrate the HTTP headers and body format that's required. |
Stop the container
To shut down the container, in the command-line environment where the container is running, select Ctrl+C.
Troubleshooting
When starting or running the container, you may experience issues. Use an output mount and enable logging. Doing so will allow the container to generate log files that are helpful when troubleshooting issues.
Tip
For more troubleshooting information and guidance, see Cognitive Services containers frequently asked questions (FAQ).
Billing
The Speech containers send billing information to Azure, using a Speech resource on your Azure account.
Queries to the container are billed at the pricing tier of the Azure resource that's used for the ApiKey
.
Azure Cognitive Services containers aren't licensed to run without being connected to the metering / billing endpoint. You must enable the containers to communicate billing information with the billing endpoint at all times. Cognitive Services containers don't send customer data, such as the image or text that's being analyzed, to Microsoft.
Connect to Azure
The container needs the billing argument values to run. These values allow the container to connect to the billing endpoint. The container reports usage about every 10 to 15 minutes. If the container doesn't connect to Azure within the allowed time window, the container continues to run but doesn't serve queries until the billing endpoint is restored. The connection is attempted 10 times at the same time interval of 10 to 15 minutes. If it can't connect to the billing endpoint within the 10 tries, the container stops serving requests. See the Cognitive Services container FAQ for an example of the information sent to Microsoft for billing.
Billing arguments
The docker run
command will start the container when all three of the following options are provided with valid values:
Option | Description |
---|---|
ApiKey |
The API key of the Cognitive Services resource that's used to track billing information. The value of this option must be set to an API key for the provisioned resource that's specified in Billing . |
Billing |
The endpoint of the Cognitive Services resource that's used to track billing information. The value of this option must be set to the endpoint URI of a provisioned Azure resource. |
Eula |
Indicates that you accepted the license for the container. The value of this option must be set to accept. |
For more information about these options, see Configure containers.
Summary
In this article, you learned concepts and workflow for downloading, installing, and running Speech containers. In summary:
- Speech provides four Linux containers for Docker, encapsulating various capabilities:
- Speech-to-text
- Custom Speech-to-text
- Text-to-speech
- Custom Text-to-speech
- Neural Text-to-speech
- Speech Language Detection
- Container images are downloaded from the container registry in Azure.
- Container images run in Docker.
- Whether using the REST API (Text-to-speech only) or the SDK (Speech-to-text or Text-to-speech) you specify the host URI of the container.
- You're required to provide billing information when instantiating a container.
Important
Cognitive Services containers are not licensed to run without being connected to Azure for metering. Customers need to enable the containers to communicate billing information with the metering service at all times. Cognitive Services containers do not send customer data (e.g., the image or text that is being analyzed) to Microsoft.
Next steps
- Review configure containers for configuration settings
- Learn how to use Speech service containers with Kubernetes and Helm
- Use more Cognitive Services containers