Install and run Speech Service containers

Speech containers enable customers to build one speech application architecture that is optimized to take advantage of both robust cloud capabilities and edge locality.

The two speech containers are speech-to-text and text-to-speech.

Function Features Latest
Speech-to-text
  • Transcribes continuous real-time speech or batch audio recordings into text with intermediate results.
  • 1.2.0
    Text-to-Speech
  • Converts text to natural-sounding speech. with plain text input or Speech Synthesis Markup Language (SSML).
  • 1.2.0

    If you don't have an Azure subscription, create a free account before you begin.

    Prerequisites

    You must meet the following prerequisites before using Speech containers:

    Required Purpose
    Docker Engine You need the Docker Engine installed on a host computer. Docker provides packages that configure the Docker environment on macOS, Windows, and Linux. For a primer on Docker and container basics, see the Docker overview.

    Docker must be configured to allow the containers to connect with and send billing data to Azure.

    On Windows, Docker must also be configured to support Linux containers.

    Familiarity with Docker You should have a basic understanding of Docker concepts, like registries, repositories, containers, and container images, as well as knowledge of basic docker commands.
    Speech resource In order to use these containers, you must have:

    An Azure Speech resource to get the associated API key and endpoint URI. Both values are available on the Azure portal's Speech Overview and Keys pages. They are both required to start the container.

    {API_KEY}: One of the two available resource keys on the Keys page

    {ENDPOINT_URI}: The endpoint as provided on the Overview page

    Request access to the container registry

    You must first complete and submit the Cognitive Services Speech Containers Request form to request access to the container.

    The form requests information about you, your company, and the user scenario for which you'll use the container. After you've submitted the form, the Azure Cognitive Services team reviews it to ensure that you meet the criteria for access to the private container registry.

    Important

    You must use an email address that's associated with either a Microsoft Account (MSA) or Azure Active Directory (Azure AD) account in the form.

    If your request is approved, you'll receive an email with instructions that describe how to obtain your credentials and access the private container registry.

    Use the Docker CLI to authenticate the private container registry

    You can authenticate with the private container registry for Cognitive Services Containers in any of several ways, but the recommended method from the command line is to use the Docker CLI.

    Use the docker login command, as shown in the following example, to log in to containerpreview.azurecr.io, the private container registry for Cognitive Services Containers. Replace <username> with the user name and <password> with the password that's provided in the credentials you received from the Azure Cognitive Services team.

    docker login containerpreview.azurecr.io -u <username> -p <password>
    

    If you've secured your credentials in a text file, you can concatenate the contents of that text file, by using the cat command, to the docker login command, as shown in the following example. Replace <passwordFile> with the path and name of the text file that contains the password and <username> with the user name that's provided in your credentials.

    cat <passwordFile> | docker login containerpreview.azurecr.io -u <username> --password-stdin
    

    The host computer

    The host is a x64-based computer that runs the Docker container. It can be a computer on your premises or a Docker hosting service in Azure, such as:

    Advanced Vector Extension support

    The host is the computer that runs the docker container. The host must support Advanced Vector Extensions (AVX2). You can check this support on Linux hosts with the following command:

    grep -q avx2 /proc/cpuinfo && echo AVX2 supported || echo No AVX2 support detected
    

    Container requirements and recommendations

    The following table describes the minimum and recommended CPU cores and memory to allocate for each Speech container.

    Container Minimum Recommended
    cognitive-services-speech-to-text 2 core
    2-GB memory
    4 core
    4-GB memory
    cognitive-services-text-to-speech 1 core, 0.5-GB memory 2 core, 1-GB memory
    • Each core must be at least 2.6 gigahertz (GHz) or faster.

    Core and memory correspond to the --cpus and --memory settings, which are used as part of the docker run command.

    Note; The minimum and recommended are based off of Docker limits, not the host machine resources. For example, speech-to-text containers memory map portions of a large language model, and it is recommended that the entire file fits in memory, which is an additional 4-6 GB. Also, the first run of either container may take longer, since models are being paged into memory.

    Get the container image with docker pull

    Container images for Speech are available.

    Container Repository
    cognitive-services-speech-to-text containerpreview.azurecr.io/microsoft/cognitive-services-speech-to-text:latest
    cognitive-services-text-to-speech containerpreview.azurecr.io/microsoft/cognitive-services-text-to-speech:latest

    Tip

    You can use the docker images command to list your downloaded container images. For example, the following command lists the ID, repository, and tag of each downloaded container image, formatted as a table:

    docker images --format "table {{.ID}}\t{{.Repository}}\t{{.Tag}}"
    
    IMAGE ID         REPOSITORY                TAG
    <image-id>       <repository-path/name>    <tag-name>
    

    Language locale is in container tag

    The latest tag pulls the en-us locale and jessarus voice.

    Speech to text locales

    All tags, except for latest are in the following format, where the <culture> indicates the locale container:

    <major>.<minor>.<patch>-<platform>-<culture>-<prerelease>
    

    The following tag is an example of the format:

    1.2.0-amd64-en-us-preview
    

    The following table lists the supported locales for speech-to-text in the 1.2.0 version of the container:

    Language locale Tags
    Chinese zh-cn
    English en-us
    en-gb
    en-au
    en-in
    French fr-ca
    fr-fr
    German de-de
    Italian it-it
    Japanese ja-jp
    Korean ko-kr
    Portuguese pt-br
    Spanish es-es
    es-mx

    Text to speech locales

    All tags, except for latest are in the following format, where the <culture> indicates the locale and the <voice> indicates the voice of the container:

    <major>.<minor>.<patch>-<platform>-<culture>-<voice>-<prerelease>
    

    The following tag is an example of the format:

    1.2.0-amd64-en-us-jessarus-preview
    

    The following table lists the supported locales for text-to-speech in the 1.2.0 version of the container:

    Language locale Tags Supported voices
    Chinese zh-cn huihuirus
    kangkang-apollo
    yaoyao-apollo
    English en-au catherine
    hayleyrus
    English en-gb george-apollo
    hazelrus
    susan-apollo
    English en-in heera-apollo
    priyarus
    ravi-apollo
    English en-us jessarus
    benjaminrus
    jessa24krus
    zirarus
    guy24krus
    French fr-ca caroline
    harmonierus
    French fr-fr hortenserus
    julie-apollo
    paul-apollo
    German de-de hedda
    heddarus
    stefan-apollo
    Italian it-it cosimo-apollo
    luciarus
    Japanese ja-jp ayumi-apollo
    harukarus
    ichiro-apollo
    Korean ko-kr heamirus
    Portuguese pt-br daniel-apollo
    heloisarus
    Spanish es-es elenarus
    laura-apollo
    pablo-apollo
    Spanish es-mx hildarus
    raul-apollo

    Docker pull for the speech containers

    Speech-to-text

    docker pull containerpreview.azurecr.io/microsoft/cognitive-services-speech-to-text:latest
    

    Text-to-speech

    docker pull containerpreview.azurecr.io/microsoft/cognitive-services-text-to-speech:latest
    

    How to use the container

    Once the container is on the host computer, use the following process to work with the container.

    1. Run the container, with the required billing settings. More examples of the docker run command are available.
    2. Query the container's prediction endpoint.

    Run the container with docker run

    Use the docker run command to run any of the three containers. The command uses the following parameters:

    During the preview, the billing settings must be valid to start the container, but you aren't billed for usage.

    Placeholder Value
    {API_KEY} This key is used to start the container, and is available on the Azure portal's Speech Keys page.
    {ENDPOINT_URI} The billing endpoint URI value is available on the Azure portal's Speech Overview page.

    Replace these parameters with your own values in the following example docker run command.

    Text-to-speech

    docker run --rm -it -p 5000:5000 --memory 2g --cpus 1 \
    containerpreview.azurecr.io/microsoft/cognitive-services-text-to-speech \
    Eula=accept \
    Billing={ENDPOINT_URI} \
    ApiKey={API_KEY}
    

    Speech-to-text

    docker run --rm -it -p 5000:5000 --memory 2g --cpus 2 \
    containerpreview.azurecr.io/microsoft/cognitive-services-speech-to-text \
    Eula=accept \
    Billing={ENDPOINT_URI} \
    ApiKey={API_KEY}
    

    This command:

    • Runs a Speech container from the container image
    • Allocates 2 CPU cores and 2 gigabytes (GB) of memory
    • Exposes TCP port 5000 and allocates a pseudo-TTY for the container
    • Automatically removes the container after it exits. The container image is still available on the host computer.

    Important

    The Eula, Billing, and ApiKey options must be specified to run the container; otherwise, the container won't start. For more information, see Billing.

    Query the container's prediction endpoint

    Container Endpoint
    Speech-to-text ws://localhost:5000/speech/recognition/dictation/cognitiveservices/v1
    Text-to-speech http://localhost:5000/speech/synthesize/cognitiveservices/v1

    Speech-to-text

    The container provides websocket-based query endpoint APIs, that are accessed through the Speech SDK.

    By default, the Speech SDK uses online speech services. To use the container, you need to change the initialization method. See the examples below.

    For C#

    Change from using this Azure-cloud initialization call:

    var config = SpeechConfig.FromSubscription("YourSubscriptionKey", "YourServiceRegion");
    

    to this call using the container endpoint:

    var config = SpeechConfig.FromEndpoint(
        new Uri("ws://localhost:5000/speech/recognition/dictation/cognitiveservices/v1"),
        "YourSubscriptionKey");
    

    For Python

    Change from using this Azure-cloud initialization call

    speech_config = speechsdk.SpeechConfig(
        subscription=speech_key, region=service_region)
    

    to this call using the container endpoint:

    speech_config = speechsdk.SpeechConfig(
        subscription=speech_key, endpoint="ws://localhost:5000/speech/recognition/dictation/cognitiveservices/v1")
    

    Text-to-speech

    The container provides REST endpoint APIs that can be found here and samples can be found here.

    Validate that a container is running

    There are several ways to validate that the container is running. Locate the External IP address and exposed port of the container in question, and open your favorite web browser. Use the various request URLs below to validate the container is running. The example request URLs listed below are http://localhost:5000, but your specific container may vary. Keep in mind that you're to rely on your container's External IP address and exposed port.

    Request URL Purpose
    http://localhost:5000/ The container provides a home page.
    http://localhost:5000/status Requested with an HTTP GET, to validate that the container is running without causing an endpoint query. This request can be used for Kubernetes liveness and readiness probes.
    http://localhost:5000/swagger The container provides a full set of documentation for the endpoints and a Try it out feature. With this feature, you can enter your settings into a web-based HTML form and make the query without having to write any code. After the query returns, an example CURL command is provided to demonstrate the HTTP headers and body format that's required.

    Container's home page

    Stop the container

    To shut down the container, in the command-line environment where the container is running, select Ctrl+C.

    Troubleshooting

    When you run the container, the container uses stdout and stderr to output information that is helpful to troubleshoot issues that happen while starting or running the container.

    Billing

    The Speech containers send billing information to Azure, using a Speech resource on your Azure account.

    Queries to the container are billed at the pricing tier of the Azure resource that's used for the <ApiKey>.

    Azure Cognitive Services containers aren't licensed to run without being connected to the billing endpoint for metering. You must enable the containers to communicate billing information with the billing endpoint at all times. Cognitive Services containers don't send customer data, such as the image or text that's being analyzed, to Microsoft.

    Connect to Azure

    The container needs the billing argument values to run. These values allow the container to connect to the billing endpoint. The container reports usage about every 10 to 15 minutes. If the container doesn't connect to Azure within the allowed time window, the container continues to run but doesn't serve queries until the billing endpoint is restored. The connection is attempted 10 times at the same time interval of 10 to 15 minutes. If it can't connect to the billing endpoint within the 10 tries, the container stops running.

    Billing arguments

    For the docker run command to start the container, all three of the following options must be specified with valid values:

    Option Description
    ApiKey The API key of the Cognitive Services resource that's used to track billing information.
    The value of this option must be set to an API key for the provisioned resource that's specified in Billing.
    Billing The endpoint of the Cognitive Services resource that's used to track billing information.
    The value of this option must be set to the endpoint URI of a provisioned Azure resource.
    Eula Indicates that you accepted the license for the container.
    The value of this option must be set to accept.

    For more information about these options, see Configure containers.

    Blog posts

    Developer samples

    Developer samples are available at our GitHub repository.

    View webinar

    Join the webinar to learn about:

    • How to deploy Cognitive Services to any machine using Docker
    • How to deploy Cognitive Services to AKS

    Summary

    In this article, you learned concepts and workflow for downloading, installing, and running Speech containers. In summary:

    • Speech provides two Linux containers for Docker, encapsulating speech to text and text to speech.
    • Container images are downloaded from the private container registry in Azure.
    • Container images run in Docker.
    • You can use either the REST API or SDK to call operations in Speech containers by specifying the host URI of the container.
    • You're required to provide billing information when instantiating a container.

    Important

    Cognitive Services containers are not licensed to run without being connected to Azure for metering. Customers need to enable the containers to communicate billing information with the metering service at all times. Cognitive Services containers do not send customer data (e.g., the image or text that is being analyzed) to Microsoft.

    Next steps