question

63066220 avatar image
0 Votes"
63066220 asked GiftA-MSFT answered

Viseme Event time offsets in Custom Neural Voice are weird.

I found the Viseme Event time offsets in Custom Neural Voice are strange. (ko-KR in use)
The following cases are the results of outputting the Visems Event from the voice synthesized by the same Text with different VoiceNames (InJoonNeural, OHW_Neural).

Below is InJoonNeural provided by the existing Neural Voice.

InJoonNeural
1. Viseme : 0, Time : 50ms
2. Viseme : 20, Time : 50ms
3. Viseme : 4, Time : 350ms
4. Viseme : 21, Time : 450ms
5. Viseme : 2, Time : 525ms
6. Viseme : 19, Time : 625ms
7. Viseme : 12, Time : 650ms
8. Viseme : 4, Time : 650ms
9. Viseme : 6, Time : 775ms
10. Viseme : 8, Time : 806ms
11. Viseme : 0, Time : 50ms

Below is the Custom Neural Voice, OHW_Neural.

OHW_Neural
1. Viseme : 0, Time : 50ms
2. Viseme : 20, Time : 100ms
3. Viseme : 4, Time : 225ms
4. Viseme : 21, Time : 325ms
5. Viseme : 2, Time : 375ms
6. Viseme : 19, Time : 437ms
7. Viseme : 4, Time : 650ms
8. Viseme : 6, Time : 700ms
9. Viseme : 8, Time : 893ms
10. Viseme : 0, Time : 1087ms

Compared with InJoonNeural, the time between No. 6 and No. 8 of InJoonNeural is 25 ms, while the time between No. 6 and No. 7 of the corresponding OHW_Neural is 113 ms, showing a large difference.

When comparing this with the directly synthesized wav file, I found that the Viseme Event is being output when the Viseme Event should not appear as the voice is almost finished. (From No. 7 of OHW_Neural)

Is there any way to improve the problems mentioned above?
I wonder if this problem can be improved if the pronunciation of the training data used in Custom Neural Voice is correct.

azure-speech
· 1
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Hi, I've forwarded your feedback to the product group, will share updates soon. Thanks.

1 Vote 1 ·

1 Answer

GiftA-MSFT avatar image
0 Votes"
GiftA-MSFT answered

Hi, following up. Different voice have different speaking rates. So, viseme time can't be compared between voices. For the issue "When comparing this with the directly synthesized wav file, I found that the Viseme Event is being output when the Viseme Event should not appear as the voice is almost finished. (From No. 7 of OHW_Neural)", please redeploy your endpoint. You will use the latest code by redeploy endpoint.



--- Kindly Accept Answer if the information helps. Thanks.

5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.