question

Minseong-7022 avatar image
0 Votes"
Minseong-7022 asked YinheWei-3673 edited

Asking about Speech to text Pronunciation Assessment(About Phoneme Recognition)

Hello,

I'm looking for feature that gives me exact result of recognized phoneme.
I was able to find that azure speech to text pronunciation assessment supports score for each phoneme of reference text.
But I'm wondering if there is a way to get exact recognized phoneme(In case of low score feedback from phoneme assessment).

Thank you for your help, in advance.
Have a nice day :D

azure-speech
· 2
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

@Minseong-7022 You can set the configuration parameter granularity level to Phoneme to get the score on the full text, word and phoneme level. This should give you word level results and accuracy. For example for this phrase:

She sells seashells by the sea shore.

The following result is available:

 Word-level details:
     1: word: she, accuracy score: 100.0, error type: None;
     2: word: sells, accuracy score: 57.0, error type: None;
     3: word: seashells, accuracy score: 34.0, error type: Mispronunciation;
     4: word: by, accuracy score: 94.0, error type: Insertion;
     5: word: the, accuracy score: 94.0, error type: Insertion;
     6: word: she, accuracy score: 84.0, error type: Insertion;
     7: word: shore, accuracy score: 74.0, error type: Insertion;

Is this what you are expecting?




0 Votes 0 ·

No, I already know that.
What I'm asking about was not the score rather, exact wrong phoneme that result in low score.

1 Vote 1 ·

1 Answer

YinheWei-3673 avatar image
0 Votes"
YinheWei-3673 answered YinheWei-3673 edited

Hi, @Minseong-7022

We have a preview feature which can probably handle your ask.
You can add one additional field "NBestPhonemeCount" to the json config as below:

var pronAssessmentConfig = PronunciationAssessmentConfig.FromJson($"{<!-- -->{\"referenceText\":\"<reference text>\",\"gradingSystem\":\"HundredMark\",\"granularity\":\"Phoneme\",\"dimension\":\"Comprehensive\",\"enableMiscue\":\"False\",\"NBestPhonemeCount\":5}}");

This additional field can trigger the "NBestPhonemes" section in the output json payload, meaning the top phonemes which are most probably spoken by the speaker, ranking by a score which indicates the probability.
You the treat the top1 as the actual spoken phoneme.
See below for example:

      "Words": [
         {
            "Word" : "Good",
            "Offset" : 500000,
            "Duration" : 2700000,
            "PronunciationAssessment": {
               "AccuracyScore" : 100.0,
               "ErrorType" : "None"
            },
            "Syllables" : [
               {
                  "Syllable" : "ɡuhd",
                  "Offset" : 500000,
                  "Duration" : 2700000,
                  "PronunciationAssessment" : {
                     "AccuracyScore": 100.0
                  }
               }
            ],
            "Phonemes": [
               {
                  "Phoneme" : "ɡ",
                  "Offset" : 500000,
                  "Duration": 1200000,
                  "PronunciationAssessment": {
                     "AccuracyScore": 100.0,
                     "NBestPhonemes": [
                        {
                            "Phoneme": "g",
                            "Score": 100.0
                        },
                        {
                            "Phoneme": "k",
                            "Score": 5.0
                        },
                        ... // remaining n best phonemes
                     ]
                  }
               },

Thanks,
Yinhe

5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.