question

DougBergman-1312 avatar image
0 Votes"
DougBergman-1312 asked GinoCuevas-1238 edited

How to get phonemes from azure speech sdk

Hi, I am following the Microsoft Azure Speech-to-Text Python sdk tutorial here. I would like to know if there is a way to return the phonemes, an intermediate step in generating the interpreted text. Is that possible? If so, can you please refer me to the documentation and hopefully some sample code that does this. I searched and could not find anything that already answered my question.

Thanks!
Doug




azure-cognitive-servicesazure-speech
· 2
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Hello Doug,

Thanks for reaching out to us. I don't think there is a official way to do that, but I can communicate with the product group as a feature request. Could you please tell me the specific reason or scenario you want the intermediate result?


Regards,
Yutong

0 Votes 0 ·

Hi Yutong, thanks for the reply. I can't go into great detail but basically I would like to identify the boundaries of phonemes and syllables within wav files. I hope that helps to clarify.

0 Votes 0 ·
SarthakAgarwal-6706 avatar image
0 Votes"
SarthakAgarwal-6706 answered GinoCuevas-1238 edited

Hi @DougBergman-1312

Is this not what you'r looking for? Sample response for the Phonemes of the word "Thank":

{
"Duration": 4700000,
"Offset": 11500000,
"Phonemes": [
{
"Duration": 2100000,
"Offset": 11500000,
"Phoneme": "th",
"PronunciationAssessment": {
"AccuracyScore": 100.0
}
},
{
"Duration": 900000,
"Offset": 13700000,
"Phoneme": "ae",
"PronunciationAssessment": {
"AccuracyScore": 100.0
}
},
{
"Duration": 700000,
"Offset": 14700000,
"Phoneme": "ng",
"PronunciationAssessment": {
"AccuracyScore": 100.0
}
},
{
"Duration": 700000,
"Offset": 15500000,
"Phoneme": "k",
"PronunciationAssessment": {
"AccuracyScore": 100.0
}
}
],
"PronunciationAssessment": {
"AccuracyScore": 100.0,
"ErrorType": "None"
},
"Word": "Thank"
}

· 3
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Hi @SarthakAgarwal-6706 , yes that is what I had in mind. Can you please post the script that returned this json?

0 Votes 0 ·

Hi @DougBergman-1312
Were you able to figure out how @SarthakAgarwal-6706 was able to get the response with Phoneme results?

I can't figure out how to achieve such results.

0 Votes 0 ·
YutongTie-MSFT avatar image
0 Votes"
YutongTie-MSFT answered

Hello Doug,

This is

Pronunciation assessment evaluates speech pronunciation and gives speakers feedback on the accuracy and fluency of spoken audio. With pronunciation assessment, language learners can practice, get instant feedback, and improve their pronunciation so that they can speak and present with confidence. Educators can use the capability to evaluate pronunciation of multiple speakers in real-time.

But the pronunciation assessment feature is currently only available in regions westus, eastasia and centralindia, and only supports language en-US.

Please refer to following sample code for how to set up and retrieve.

https://github.com/Azure-Samples/cognitive-services-speech-sdk/blob/master/samples/cpp/windows/console/samples/speech_recognition_samples.cpp#L633

Regards,
Yutong

5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.