question

HirofumiKojima-5332 avatar image
0 Votes"
HirofumiKojima-5332 asked Samir-6734 answered

Text to Speech with timestamp in JSON format

Hi,

Does Azure text-to-speech (TTS) have a feature similar to Amazon Polly speech marks?
For example, given a text, it will provide the following output.


input: "Mary had a little lamb."
output (json format): {"time":0,"type":"sentence","start":0,"end":23,"value":"Mary had a little lamb."}
# " 0" and "23" are the timestamps in milliseconds.


Since I'm thinking of converting this json file to srt file for using subtitles, If Azure TTS has a feature to output a json file like the one above, I would appreciate it if you could let me know.


Regards.


azure-speech
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

romungi-MSFT avatar image
0 Votes"
romungi-MSFT answered FathyEltanany-8942 commented

@HirofumiKojima-5332 Yes, this should be possible by subscribing to the WordBoundary events. This event is raised at the beginning of each new spoken word and will provide a time offset within the spoken stream and a text offset within the input prompt.

  • AudioOffset reports the output audio's elapsed time between the beginning of synthesis and the start of the next word. This is measured in hundred-nanosecond units (HNS) with 10,000 HNS equivalent to 1 millisecond.

  • WordOffset reports the character position in the input string (original text or SSML) immediately before the word that's about to be spoken.

You can also subscribe to viseme output along with word boundary to get the response similar to AWS polly's speech marks.


· 1
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

@romungi-MSFT

I have tried Wordboundary and it return the bellow result
Word boundary event received: SpeechSynthesisWordBoundaryEventArgs(audio_offset=500000, text_offset=192, word_length=9), audio offset in ms: 50.0ms

is there is anyway to return the input word with the audiooffset

0 Votes 0 ·
Samir-6734 avatar image
0 Votes"
Samir-6734 answered

@romungi-MSFT I have tried to capture speech marks using WordBoundey however I am not able to receive the event. Here is post I have crated with detailed explanation. Would you able to guide me what might be wrong in my code?

https://docs.microsoft.com/en-us/answers/questions/849161/azure-text-to-speech-synthesizerwordboundary-metho.html


Thanks,
Samir

5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.