question

LuciaPozzan-8319 avatar image
0 Votes"
LuciaPozzan-8319 asked romungi-MSFT commented

ssml prosody tag

According to ssml 1.1 (https://www.w3.org/TR/speech-synthesis11/), the prosody rate tag should only include non-negative numbers:

"rate: a change in the speaking rate for the contained text. Legal values are: a non-negative percentage or "x-slow", "slow", "medium", "fast", "x-fast", or "default". Labels "x-slow" through "x-fast" represent a sequence of monotonically non-decreasing speaking rates. When the value is a non-negative percentage it acts as a multiplier of the default rate. For example, a value of 100% means no change in speaking rate, a value of 200% means a speaking rate twice the default rate, and a value of 50% means a speaking rate of half the default rate. The default rate for a voice depends on the language and dialect and on the personality of the voice. The default rate for a voice should be such that it is experienced as a normal speaking rate for the voice when reading aloud text. Since voices are processor-specific, the default rate will be as well."

However, this does not seem to be the case when specifying prosody rate in Microsoft TTS, as <prosody rate="30.00%"> plays at a higher speed than 100% and seems to be interpreted as "<prosody rate="+30.00%">.

Is this a bug or a conscious decision to depart from SSML standards? Is there a way to force the tag to be interpreted as intended?

azure-speech
· 1
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

@LuciaPozzan-8319 As per the design the rate is a relative value with Azure TTS. As per the documentation:

112789-image.png

I did not find a force it to use the standard as mentioned above though.

0 Votes 0 ·
image.png (19.4 KiB)

0 Answers