question

Jeonghoon-0637 avatar image
0 Votes"
Jeonghoon-0637 asked Jeonghoon-0637 published

Get facial pose events

Hello,

I am using viseme events.

For example, when the text "hello" is made into speech and viseme events for speech are output, speech and viseme events do not match.


Audio offset: 50ms, viseme id: 0.
Audio offset: 50ms, viseme id: 12.
Audio offset: 237ms, viseme id: 4.
Audio offset: 300ms, viseme id: 14.
Audio offset: 387ms, viseme id: 8.
Audio offset: 512ms, viseme id: 0.

The actual voice start is 118 ms, but the Audio offset comes out to be 50 ms. How can I solve this problem?


https://docs.microsoft.com/en-us/azure/cognitive-services/speech-service/how-to-speech-synthesis-viseme?pivots=programming-language-csharp

azure-speech
· 5
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Hi, we weren't able to reproduce this issue. Can you kindly share the SSML used? Thanks.

0 Votes 0 ·

Hi,

I used unity3D.

This is the code written for executing "SSML" and the result.


var synthesizer = new SpeechSynthesizer(config, null);

synthesizer.VisemeReceived += (s, e) =>
{
Debug.Log($"Viseme event received. Audio offset: " + $"{e.AudioOffset / 10000}ms, viseme id: {e.VisemeId}");
};

string ssml = "<speak version=\"1.0\" xmlns=\"http://www.w3.org/2001/10/synthesis\" xmlns:mstts=\"https://www.w3.org/2001/mstts\" xml:lang=\"" + "en-US" + "\"> <voice name =\"" + "en-US-GuyNeural" + "\"> " + "hello" + "</voice> </speak>";

var result = await synthesizer.SpeakSsmlAsync(ssml);

Audio offset: 50ms, viseme id: 0.
Audio offset: 50ms, viseme id: 12.
Audio offset: 237ms, viseme id: 4.
Audio offset: 300ms, viseme id: 14.
Audio offset: 387ms, viseme id: 8.
Audio offset: 512ms, viseme id: 0.


This is the code written for executing "text" and the result.

var result = await synthesizer.SpeakTextAsync("hello");

Audio offset: 50ms, viseme id: 0.
Audio offset: 50ms, viseme id: 12.
Audio offset: 200ms, viseme id: 4.
Audio offset: 325ms, viseme id: 14.
Audio offset: 400ms, viseme id: 8.
Audio offset: 600ms, viseme id: 0.


0 Votes 0 ·

Hi,

I used unity3D.

This is the code written for executing "SSML" and the result.

var synthesizer = new SpeechSynthesizer(config, null);

synthesizer.VisemeReceived += (s, e) =>
{
Debug.Log($"Viseme event received. Audio offset: " + $"{e.AudioOffset / 10000}ms, viseme id: {e.VisemeId}");
};

// ssml
string ssml = "<speak version=\"1.0\" xmlns=\"http://www.w3.org/2001/10/synthesis\" xmlns:mstts=\"https://www.w3.org/2001/mstts\" xml:lang=\"" + "en-US" + "\"> <voice name =\"" + "en-US-GuyNeural" + "\"> " + "hello" + "</voice> </speak>";

var result = await synthesizer.SpeakSsmlAsync(ssml);

Audio offset: 50ms, viseme id: 0.
Audio offset: 50ms, viseme id: 12.
Audio offset: 237ms, viseme id: 4.
Audio offset: 300ms, viseme id: 14.
Audio offset: 387ms, viseme id: 8.
Audio offset: 512ms, viseme id: 0.




This is the code written for executing "text" and the result.

var result = await synthesizer.SpeakTextAsync("hello");

Audio offset: 50ms, viseme id: 0.
Audio offset: 50ms, viseme id: 12.
Audio offset: 200ms, viseme id: 4.
Audio offset: 325ms, viseme id: 14.
Audio offset: 400ms, viseme id: 8.
Audio offset: 600ms, viseme id: 0.

0 Votes 0 ·

Hi,

I used unity3D.

This is the code written for executing "SSML" and the result.

var synthesizer = new SpeechSynthesizer(config, null);

synthesizer.VisemeReceived += (s, e) =>
{
Debug.Log($"Viseme event received. Audio offset: " + $"{e.AudioOffset / 10000}ms, viseme id: {e.VisemeId}");
};

// ssml
string ssml = "<speak version=\"1.0\" xmlns=\"http://www.w3.org/2001/10/synthesis\" xmlns:mstts=\"https://www.w3.org/2001/mstts\" xml:lang=\"" + "en-US" + "\"> <voice name =\"" + "en-US-GuyNeural" + "\"> " + "hello" + "</voice> </speak>";

var result = await synthesizer.SpeakSsmlAsync(ssml);

Audio offset: 50ms, viseme id: 0.
Audio offset: 50ms, viseme id: 12.
Audio offset: 237ms, viseme id: 4.
Audio offset: 300ms, viseme id: 14.
Audio offset: 387ms, viseme id: 8.
Audio offset: 512ms, viseme id: 0.




This is the code written for executing "text" and the result.

var result = await synthesizer.SpeakTextAsync("hello");

Audio offset: 50ms, viseme id: 0.
Audio offset: 50ms, viseme id: 12.
Audio offset: 200ms, viseme id: 4.
Audio offset: 325ms, viseme id: 14.
Audio offset: 400ms, viseme id: 8.
Audio offset: 600ms, viseme id: 0.

0 Votes 0 ·

Hi,

I used unity3D.

This is the code written for executing "SSML" and the result.

var synthesizer = new SpeechSynthesizer(config, null);

synthesizer.VisemeReceived += (s, e) =>
{
Debug.Log($"Viseme event received. Audio offset: " + $"{e.AudioOffset / 10000}ms, viseme id: {e.VisemeId}");
};

// ssml
string ssml = "<speak version=\"1.0\" xmlns=\"http://www.w3.org/2001/10/synthesis\" xmlns:mstts=\"https://www.w3.org/2001/mstts\" xml:lang=\"" + "en-US" + "\"> <voice name =\"" + "en-US-GuyNeural" + "\"> " + "hello" + "</voice> </speak>";

var result = await synthesizer.SpeakSsmlAsync(ssml);

Audio offset: 50ms, viseme id: 0.
Audio offset: 50ms, viseme id: 12.
Audio offset: 237ms, viseme id: 4.
Audio offset: 300ms, viseme id: 14.
Audio offset: 387ms, viseme id: 8.
Audio offset: 512ms, viseme id: 0.


This is the code written for executing "text" and the result.

var result = await synthesizer.SpeakTextAsync("hello");

Audio offset: 50ms, viseme id: 0.
Audio offset: 50ms, viseme id: 12.
Audio offset: 200ms, viseme id: 4.
Audio offset: 325ms, viseme id: 14.
Audio offset: 400ms, viseme id: 8.
Audio offset: 600ms, viseme id: 0.

0 Votes 0 ·

0 Answers