How do we get SAPI or IPA phoneme results for en-GB for a pronunciation assessment?

Umair Habib 0 Reputation points
2024-03-26T03:20:15.3233333+00:00

Hi,

So I am trying to use the Speech Service that Azure AI Speech provides. I want to use the pronunciation assessment feature from the speech-to-text functionality. Here below is the code I used:

def speech_recognize_with_pronuncation(filePath, text_script):
    referenceText = text_script if text_script else "" 
    speech_config = speechsdk.SpeechConfig(subscription=config("SPEECH_KEY"), region=config("SERVICE_REGION"))
    speech_config.speech_recognition_language="en-GB"
    audio_config = speechsdk.audio.AudioConfig(filename=filePath)
    
    speech_recognizer = speechsdk.SpeechRecognizer(
        speech_config=speech_config, 
        audio_config=audio_config
    )

    pronunciation_config = speechsdk.PronunciationAssessmentConfig(json_string="{{\"referenceText\":\"{}\",\"gradingSystem\":\"HundredMark\",\"granularity\":\"Phoneme\",\"phonemeAlphabet\":\"IPA\"}}".format(referenceText))

    pronunciation_config.enable_prosody_assessment() 
    pronunciation_config.enable_content_assessment_with_topic("greeting")
    pronunciation_config.apply_to(speech_recognizer)

    speech_recognition_result = speech_recognizer.recognize_once()

    pronunciation_assessment_result_json = speech_recognition_result.properties.get(speechsdk.PropertyId.SpeechServiceResponse_JsonResult)

    return pronunciation_assessment_result_json

The problem I am facing is that the phonemes that are getting returned after analysis are empty. However, the accuracy scores are available for the empty phonemes. Here below is the response showing empty phonemes but with score values:

{
    "status_code": 200,
    "message": "Successful",
    "detail": {
        "Id": "d954ca5f17f54eb9a36978c67211b4fc",
        "RecognitionStatus": "Success",
        "Offset": 800000,
        "Duration": 4000000,
        "Channel": 0,
        "DisplayText": "Heart.",
        "SNR": 46.816277,
        "NBest": [
            {
                "Confidence": 0.91353595,
                "Lexical": "heart",
                "ITN": "heart",
                "MaskedITN": "heart",
                "Display": "Heart.",
                "PronunciationAssessment": {
                    "AccuracyScore": 42.0,
                    "FluencyScore": 100.0,
                    "ProsodyScore": 0.0,
                    "CompletenessScore": 100.0,
                    "PronScore": 28.4
                },
                "Words": [
                    {
                        "Word": "heart",
                        "Offset": 800000,
                        "Duration": 4000000,
                        "PronunciationAssessment": {
                            "AccuracyScore": 42.0,
                            "ErrorType": "Mispronunciation",
                            "Feedback": {
                                "Prosody": {
                                    "Break": {
                                        "ErrorTypes": [
                                            "None"
                                        ],
                                        "BreakLength": 0
                                    },
                                    "Intonation": {
                                        "ErrorTypes": [],
                                        "Monotone": {
                                            "SyllablePitchDeltaConfidence": 0.50218755
                                        }
                                    }
                                }
                            }
                        },
                        "Syllables": [
                            {
                                "Syllable": "",
                                "PronunciationAssessment": {
                                    "AccuracyScore": 42.0
                                },
                                "Offset": 800000,
                                "Duration": 4000000
                            }
                        ],
                        "Phonemes": [
                            {
                                "Phoneme": "",
                                "PronunciationAssessment": {
                                    "AccuracyScore": 42.0
                                },
                                "Offset": 800000,
                                "Duration": 1300000,
                                "blend_shape": []
                            },
                            {
                                "Phoneme": "",
                                "PronunciationAssessment": {
                                    "AccuracyScore": 42.0
                                },
                                "Offset": 2200000,
                                "Duration": 500000,
                                "blend_shape": []
                            },
                            {
                                "Phoneme": "",
                                "PronunciationAssessment": {
                                    "AccuracyScore": 42.0
                                },
                                "Offset": 2800000,
                                "Duration": 700000,
                                "blend_shape": []
                            },
                            {
                                "Phoneme": "",
                                "PronunciationAssessment": {
                                    "AccuracyScore": 42.0
                                },
                                "Offset": 3600000,
                                "Duration": 1200000,
                                "blend_shape": []
                            }
                        ]
                    }
                ],
                "Incorrect_spoken_words": [
                    {
                        "Word": "heart",
                        "Offset": 800000,
                        "Duration": 4000000,
                        "PronunciationAssessment": {
                            "AccuracyScore": 42.0,
                            "ErrorType": "Mispronunciation",
                            "Feedback": {
                                "Prosody": {
                                    "Break": {
                                        "ErrorTypes": [
                                            "None"
                                        ],
                                        "BreakLength": 0
                                    },
                                    "Intonation": {
                                        "ErrorTypes": [],
                                        "Monotone": {
                                            "SyllablePitchDeltaConfidence": 0.50218755
                                        }
                                    }
                                }
                            }
                        },
                        "Syllables": [
                            {
                                "Syllable": "",
                                "PronunciationAssessment": {
                                    "AccuracyScore": 42.0
                                },
                                "Offset": 800000,
                                "Duration": 4000000
                            }
                        ],
                        "Phonemes": [
                            {
                                "Phoneme": "",
                                "PronunciationAssessment": {
                                    "AccuracyScore": 42.0
                                },
                                "Offset": 800000,
                                "Duration": 1300000,
                                "blend_shape": []
                            },
                            {
                                "Phoneme": "",
                                "PronunciationAssessment": {
                                    "AccuracyScore": 42.0
                                },
                                "Offset": 2200000,
                                "Duration": 500000,
                                "blend_shape": []
                            },
                            {
                                "Phoneme": "",
                                "PronunciationAssessment": {
                                    "AccuracyScore": 42.0
                                },
                                "Offset": 2800000,
                                "Duration": 700000,
                                "blend_shape": []
                            },
                            {
                                "Phoneme": "",
                                "PronunciationAssessment": {
                                    "AccuracyScore": 42.0
                                },
                                "Offset": 3600000,
                                "Duration": 1200000,
                                "blend_shape": []
                            }
                        ]
                    }
                ]
            }
        ]
    }
}

Please address this issue ASAP since this featue is crucial in my project and our use for this feature will be huge.

Thank you
Umair Habib

Azure AI Speech
Azure AI Speech
An Azure service that integrates speech processing into apps and services.
1,393 questions
{count} votes

1 answer

Sort by: Most helpful
  1. navba-MSFT 17,115 Reputation points Microsoft Employee
    2024-03-26T04:25:58.4066667+00:00

    @Umair Habib Welcome to Microsoft Q&A Forum, Thank you for posting your query here!

    Please refer this docs.

    User's image

    I also had a discussion with the Product Owners regarding this. The support for en-GB was stopped recently. Please note that, we exposed SAPI phone incorrect because we just copy the en-US phones for en-GB.

    Also en-GB does not have accurate definition for SAPI in our document. See below article:

    https://learn.microsoft.com/en-us/azure/ai-services/speech-service/speech-ssml-phonetic-sets

    Also note that if phoneme name is not available, both syllable and spoken phoneme are not available either.

    Regarding the ETA in the roadmap for the Pronunciation Assessment support for en-GB:

    We can prioritize our work if your usage on Pronunciation Assessment is large than 1k/month, we can priority our work and support to expose the phoneme name or reach the minimum of Commitment Tiers

    User's image

    Hope this helps.

    0 comments No comments