Q: Could you explain these errors from the custom speech to text container?

Error 1: Failed to fetch manifest: Status: 400 Bad Request Body: { "code": "InvalidModel", "message": "The specified model isn't supported for endpoint manifests." } If you're training with the latest custom model, we currently don't support that. If you train with an older version, it should be possible to use. We are still working on supporting the latest versions. Essentially, the custom containers do not support Halide or ONNX-based acoustic models (which is the default in the custom training portal). This is due to custom models not being encrypted and we don't want to expose ONNX models, however; language models are fine. The model size may be larger by 100 MB. Error 2: HTTPAPI result code = HTTPAPI_OK. HTTP status code = 400. Reason: Synthesis failed. StatusCode: InvalidArgument, Details: Voice doesn't match. You need to provide the correct voice name in the request, which is case-sensitive. Refer to the full service name mapping. Error 3: { "code": "InvalidProductId", "message": "The subscription SKU \"CognitiveServices.S0\" isn't supported in this service instance." } You reed to create a Speech resource, not an Azure AI services resource.

Q: How can I get real-time APIs to handle audio > 30 seconds long?

The RecognizeOnce() SDK operation in interactive mode processes up to 30 seconds of audio where utterances are expected to be short. You use StartContinuousRecognition() for dictation or conversation longer than 30 seconds.

Q: When I use the speech to text service, why am I getting this error?

Error in STT call for file 9136835610040002161_413008000252496: { "reason": "ResultReason.Canceled", "error_details": "Due to service inactivity the client buffer size exceeded. Resetting the buffer. SessionId: xxxxx..." } Cancellation can occur when the audio feed exceeds the Speech container buffer size. You need to control the concurrency and the Real-Time Factor (RTF) at which you send the audio. The RTF can fluctuate depending on concurrent requests, CPU, and memory allocated.

Question 1

How do Speech containers work and how do I set them up?

Accepted Answer

For an overview, see Install and run Speech containers. When setting up the production cluster, there are several things to consider. First, setting up single language, multiple containers, on the same machine, shouldn't be a large issue. If you're experiencing problems, it might be a hardware-related issue - so we would first look at resource, that is; CPU and memory specifications.

Consider for a moment, the ja-JP container and latest model. The acoustic model is the most demanding piece CPU-wise, while the language model demands the most memory. When we benchmarked the use, it takes about 0.6 CPU cores to process a single speech to text request when audio is flowing in at real-time, for example, from a microphone. If you're feeding audio faster than real-time, for example, from a file, that usage can double (1.2x cores). Meanwhile, the memory in this context is operating memory for decoding speech. It does not take into account the actual full size of the language model, which resides in file cache. It's an extra 2 GB for ja-JP; for en-US, it might be more (6-7 GB).

If you have a machine where memory is scarce, and you're trying to deploy multiple languages on it, it's possible that file cache is full, and the OS is forced to page models in and out. For a running transcription that could be disastrous and might lead to slowdowns and other performance implications.

Furthermore, we prepackage executables for machines with the advanced vector extension (AVX2) instruction set. A machine with the AVX512 instruction set requires code generation for that target, and starting 10 containers for 10 languages might temporarily exhaust CPU. A message like this one appears in the docker logs:

2020-01-16 16:46:54.981118943 
[W:onnxruntime:Default, tvm_utils.cc:276 LoadTVMPackedFuncFromCache]
Can't find Scan4_llvm__mcpu_skylake_avx512 in cache, using JIT...

You can set the number of decoders you want inside a single container using DECODER MAX_COUNT variable. Start with your CPU and memory SKU and then refer to the recommended host machine resource specifications.

Question 2

Could you help with capacity planning and cost estimation of on-premises Speech to text containers?

Accepted Answer

For container capacity in batch processing mode, each decoder could handle 2-3x in real-time, with two CPU cores, for a single recognition. We don't recommend keeping more than two concurrent recognitions per container instance, but recommend running more instances of containers for reliability and availability reasons, behind a load balancer.

Though we could have each container instance running with more decoders. For example, we may be able to set up seven decoders per container instance on an eight-core machine at more than 2x each, yielding 15x throughput. There's a param DECODER_MAX_COUNT to be aware of. For the extreme case, reliability and latency issues arise, with throughput increased significantly. For a microphone, it is at 1x real-time. The overall usage should be at about one core for a single recognition.

For scenario of processing 1-K hours per day in batch processing mode, in an extreme case, 3 VMs could handle it within 24 hours but not guaranteed. To handle spike days, failover, update, and to provide minimum backup/BCP, we recommend 4-5 machines instead of 3 per cluster, and with 2+ clusters.

For hardware, we use standard Azure VM DS13_v2 as a reference (each core must be 2.6 GHz or better, with AVX2 instruction set enabled).

Instance	vCPU(s)	RAM	Temp storage	Pay-as-you-go with AHB	1-year reserve with AHB (% Savings)	3-year reserved with AHB (% Savings)
`DS13 v2`	8	56 GiB	112 GiB	$0.598/hour	$0.3528/hour (~41%)	$0.2333/hour (~61%)

Based on the design reference (two clusters of 5 VMs to handle 1 K hours/day audio batch processing), one-year hardware cost will be:

2 (clusters) * 5 (VMs per cluster) * $0.3528/hour * 365 (days) * 24 (hours) = $31 K / year

When mapping to physical machine, a general estimation is 1 vCPU = 1 Physical CPU Core. In reality, 1vCPU is more powerful than a single core.

For on-premises, all of these extra factors come into play:

On what type the physical CPU is and how many cores on it
How many CPUs running together on the same box/machine
How VMs are set up
How hyper-threading / multi-threading is used
How memory is shared
The OS, etc.

Normally it isn't as well tuned as Azure the environment. Considering other overhead, one estimation is 10 physical CPU cores = 8 Azure vCPU. Though popular CPUs only have eight cores. With on-premises deployment, the cost is higher than using Azure VMs. Also, consider the depreciation rate.

Service cost is the same as the online service: $1/hour for speech to text. The Speech service cost is:

$1 * 1000 * 365 = $365 K

Maintenance cost paid to Microsoft depends on the service level and content of the service. People cost isn't included. Other infrastructure costs such as storage, networks, and load balancers aren't included.

Question 3

Can I use a custom acoustic model and language model with Speech container?

Accepted Answer

One model ID can be used, either a custom language model or custom acoustic model. Passing acoustic and language models concurrently isn't supported.

Question 4

Could you explain these errors from the custom speech to text container?

Accepted Answer

Error 1:

Failed to fetch manifest: Status: 400 Bad Request Body:
{
    "code": "InvalidModel",
    "message": "The specified model isn't supported for endpoint manifests."
}

If you're training with the latest custom model, we currently don't support that. If you train with an older version, it should be possible to use. We are still working on supporting the latest versions.

Essentially, the custom containers do not support Halide or ONNX-based acoustic models (which is the default in the custom training portal). This is due to custom models not being encrypted and we don't want to expose ONNX models, however; language models are fine. The model size may be larger by 100 MB.

Error 2:

HTTPAPI result code = HTTPAPI_OK.
HTTP status code = 400.
Reason:  Synthesis failed.
StatusCode: InvalidArgument,
Details: Voice doesn't match.

You need to provide the correct voice name in the request, which is case-sensitive. Refer to the full service name mapping.

Error 3:

{
    "code": "InvalidProductId",
    "message": "The subscription SKU \"CognitiveServices.S0\" isn't supported in this service instance."
}

You reed to create a Speech resource, not an Azure AI services resource.

Question 5

What API protocols are supported, REST or WS?

Accepted Answer

For speech to text and custom speech to text containers, we currently only support the websocket based protocol. The SDK only supports calling in WS but not REST. For more information, see host URLs.

Question 6

How can we benchmark a rough measure of transactions/second/core?

Accepted Answer

Here are some of the rough numbers to expect from the model prior to general availability (GA):

For files, the throttling is in the Speech SDK, at 2x. The first five seconds of audio aren't throttled. The decoder is capable of doing about 3x real-time. For this, the overall CPU usage is close to two cores for a single recognition.
For mic, it is at 1x real-time. The overall usage should be at about one core for a single recognition.

This can all be verified from the docker logs. We actually dump the line with session and phrase/utterance statistics, and that includes the RTF numbers.

Question 7

How do I make multiple containers run on the same host?

Accepted Answer

By setting each container to listen on a different port using the -p argument. Use -p :5000. For example: -p 5001:5000.

Question 8

How can I get real-time APIs to handle audio > 30 seconds long?

Accepted Answer

The RecognizeOnce() SDK operation in interactive mode processes up to 30 seconds of audio where utterances are expected to be short. You use StartContinuousRecognition() for dictation or conversation longer than 30 seconds.

Question 9

What are the recommended resources, CPU and RAM; for 50 concurrent requests?

Accepted Answer

How many concurrent requests will a 4-core, 8-GB RAM handle? If we have to serve for example, 50 concurrent requests, how many Core and RAM is recommended?

At real-time, 8 with our latest en-US, so we recommend using more docker containers beyond six concurrent requests. Beyond 16 cores it becomes nonuniform memory access (NUMA) node sensitive. The following table describes the minimum and recommended allocation of resources for each Speech container.

The following table describes the minimum and recommended allocation of resources

Container	Minimum	Recommended	Speech Model
Speech to text	4 core, 4-GB memory	8 core, 8-GB memory	+4 to 8 GB memory

Note

The minimum and recommended allocations are based on Docker limits, not the host machine resources. For example, speech to text containers memory map portions of a large language model. We recommend that the entire file should fit in memory. You need to add an extra 4 to 8 GB to load the speech models (see above table). Also, the first run of either container might take longer because models are being paged into memory.

The following table describes the minimum and recommended allocation of resources

Container	Minimum	Recommended	Speech Model
Custom speech to text	4 core, 4-GB memory	8 core, 8-GB memory	+4 to 8 GB memory

Note

The minimum and recommended allocations are based on Docker limits, not the host machine resources. For example, speech to text containers memory map portions of a large language model. We recommend that the entire file should fit in memory. You need to add an extra 4 to 8 GB to load the speech models (see above table). Also, the first run of either container might take longer because models are being paged into memory.

The following table describes the minimum and recommended allocation of resources

Container	Minimum	Recommended
Neural text to speech	6 core, 12-GB memory	8 core, 16-GB memory

Note

The minimum and recommended allocations are based on Docker limits, not the host machine resources. Also, the first run of either container might take longer because models are being paged into memory.

The following table describes the minimum and recommended allocation of resources

Container	Minimum	Recommended
Speech language identification	1 core, 1-GB memory	1 core, 1-GB memory

Note

The minimum and recommended allocations are based on Docker limits, not the host machine resources. Also, the first run of either container might take longer because models are being paged into memory.

Each core must be at least 2.6 GHz or faster.
For files, the throttling will be in the Speech SDK, at 2x. The first 5 seconds of audio aren't throttled.
The decoder is capable of doing about 2-3x real-time. For this, the overall CPU usage will be close to two cores for a single recognition. That's why we don't recommend keeping more than two active connections, per container instance. The extreme side would be to put about 10 decoders at 2x real-time in an eight-core machine like DS13_V2. For the container version 1.3 and later, there's a param you could try setting -e DECODER_MAX_COUNT=20.
For microphone, it is at 1x real-time. The overall usage should be at about one core for a single recognition.

Consider the total number of hours of audio you have. If the number is large, to improve reliability and availability, we suggest running more instances of containers, either on a single box or on multiple boxes, behind a load balancer. Orchestration could be done using Kubernetes (K8S) and Helm, or with Docker compose.

As an example, to handle 1000 hours/24 hours, we have tried setting up 3-4 VMs, with 10 instances/decoders per VM.

Question 10

Does the Speech container support punctuation?

Accepted Answer

We have capitalization (ITN) available in the on-premises container. Punctuation is language-dependent, and not supported for some languages, including Chinese and Japanese.

We do have implicit and basic punctuation support for the existing containers, but it is off by default. What that means is that you can get the . character in your example, but not the 。 character. To enable this implicit logic, here's an example of how to do so in Python using our Speech SDK. It would be similar in other languages:

speech_config.set_service_property(
    name='punctuation',
    value='implicit',
    channel=speechsdk.ServicePropertyChannel.UriQueryParameter
)

Question 11

Why am I getting 404 errors when attempting to POST data to speech to text container?

Accepted Answer

Speech to text containers don't support REST API. The Speech SDK uses WebSockets. For more information, see host URLs.

Question 12

Why is the container running as a nonroot user? What issues might occur because of this?

Accepted Answer

The default user inside the container is a nonroot user. This provides protection against processes escaping the container and obtaining escalated permissions on the host node. By default, some platforms like the OpenShift Container Platform already do this by running containers using an arbitrarily assigned user ID. For these platforms, the nonroot user must have permissions to write to any externally mapped volume that requires writes. For example a logging folder, or a custom model download folder.

Question 13

When I use the speech to text service, why am I getting this error?

Accepted Answer

Error in STT call for file 9136835610040002161_413008000252496:
{
    "reason": "ResultReason.Canceled",
    "error_details": "Due to service inactivity the client buffer size exceeded. Resetting the buffer. SessionId: xxxxx..."
}

Cancellation can occur when the audio feed exceeds the Speech container buffer size. You need to control the concurrency and the Real-Time Factor (RTF) at which you send the audio. The RTF can fluctuate depending on concurrent requests, CPU, and memory allocated.

Speech service containers frequently asked questions (FAQ)

General questions

How do Speech containers work and how do I set them up?

Could you help with capacity planning and cost estimation of on-premises Speech to text containers?

Can I use a custom acoustic model and language model with Speech container?

Could you explain these errors from the custom speech to text container?

What API protocols are supported, REST or WS?

How can we benchmark a rough measure of transactions/second/core?

How do I make multiple containers run on the same host?

Technical questions

How can I get real-time APIs to handle audio > 30 seconds long?

What are the recommended resources, CPU and RAM; for 50 concurrent requests?

Does the Speech container support punctuation?

Why am I getting 404 errors when attempting to POST data to speech to text container?

Why is the container running as a nonroot user? What issues might occur because of this?

When I use the speech to text service, why am I getting this error?

Next steps

Feedback

Additional resources