Azure Pronuciation Assessment recognition offset lag

Andrew Pasquale 20 Reputation points
2024-04-22T16:43:22.98+00:00

I'm using the Pronunciation Assessment with the recognizeOnceAsync method.

We are presenting a word for assessment and measuring the response time. Sometimes the offset returned with the recognition corresponds closely with the time reported from the sessionStarted event, but other times there is a lag between those times and sometimes the person has spoken the word too quickly and no recognition occurs.

I'm wondering if there are any strategies or additional events to wait until the service is ready to recognize speech? One possibility might be to use continuous recognition and pause it when we aren't expecting speech? Our app relies on measuring timing accurately, but this seems like an issue for most scenarios with recognizeOnceAsync method.

Azure AI Speech
Azure AI Speech
An Azure service that integrates speech processing into apps and services.
1,405 questions
{count} votes

Accepted answer
  1. dupammi 6,645 Reputation points Microsoft Vendor
    2024-04-23T01:16:26.1266667+00:00

    Hi @Andrew Pasquale

    Thank you for using the Microsoft Q&A forum.

    Regarding your query, it seems that you are experiencing a lag between the offset returned with the recognition and the time reported from the sessionStarted event when using the Pronunciation Assessment feature with the recognizeOnceAsync method in Microsoft Azure Cloud.

    One possible solution to this issue is to use continuous recognition and pause it when you aren't expecting speech. This can help you avoid the lag between the offset returned with the recognition and the time reported from the sessionStarted event. Additionally, you can use the sessionStopped event to determine when the service is ready to recognize speech again. This event is triggered when the session is stopped, and you can use it to determine when to start the next session.

    Another possible solution is to use the configuration parameter Phoneme granularity level in the Pronunciation Assessment feature to get the score on the full text, word, and phoneme level. This should give you word-level results and accuracy. You can also use the NBestPhonemeCount field in the PronunciationAssessmentConfig to get the top phonemes that are most probably spoken by the speaker, ranking by a score that indicates the probability. You can treat the top1 as the actual spoken phoneme.

    To indicate whether, and how many potential spoken phonemes to get confidence scores for, set the NBestPhonemeCount parameter to an integer value such as 5.

    I hope you understand. Thank you.


    Please don't forget to click Accept Answer and Yes when the provided answer was helpful.

    0 comments No comments

0 additional answers

Sort by: Most helpful