Pronunciation assessment

Pronunciation assessment evaluates speech pronunciation and gives speakers feedback on the accuracy and fluency of spoken audio. With pronunciation assessment, language learners can practice, get instant feedback, and improve their pronunciation so that they can speak and present with confidence. Educators can use the capability to evaluate pronunciation of multiple speakers in real-time.

In this article, you'll learn how to set up PronunciationAssessmentConfig and retrieve the PronunciationAssessmentResult using the speech SDK.


The pronunciation assessment feature currently supports en-US language, which is available on all speech-to-text regions. The support for en-GB and zh-CN languages is under preview, which is available on westus, eastasia and centralindia regions.

Pronunciation assessment with the Speech SDK

In the samples below, you'll create a PronunciationAssessmentConfig, then apply it to a SpeechRecognizer.

The following snippets illustrate how to use language identification in your apps:

var pronunciationAssessmentConfig = new PronunciationAssessmentConfig(
    "reference text", GradingSystem.HundredMark, Granularity.Phoneme);

using (var recognizer = new SpeechRecognizer(
    // apply the pronunciation assessment configuration to the speech recognizer
    var speechRecognitionResult = await recognizer.RecognizeOnceAsync();
    var pronunciationAssessmentResult =
    var pronunciationScore = pronunciationAssessmentResult.PronunciationScore;
auto pronunciationAssessmentConfig =
    PronunciationAssessmentConfig::Create("reference text",

auto recognizer = SpeechRecognizer::FromConfig(

// apply the pronunciation assessment configuration to the speech recognizer
speechRecognitionResult = recognizer->RecognizeOnceAsync().get();
auto pronunciationAssessmentResult =
auto pronunciationScore = pronunciationAssessmentResult->PronunciationScore;
PronunciationAssessmentConfig pronunciationAssessmentConfig =
    new PronunciationAssessmentConfig("reference text", 

SpeechRecognizer recognizer = new SpeechRecognizer(

// apply the pronunciation assessment configuration to the speech recognizer
Future<SpeechRecognitionResult> future = recognizer.recognizeOnceAsync();
SpeechRecognitionResult result = future.get(30, TimeUnit.SECONDS);
PronunciationAssessmentResult pronunciationAssessmentResult =
Double pronunciationScore = pronunciationAssessmentResult.getPronunciationScore();

pronunciation_assessment_config = \
        speechsdk.PronunciationAssessmentConfig(reference_text='reference text',
speech_recognizer = speechsdk.SpeechRecognizer(
        speech_config=speech_config, \

# apply the pronunciation assessment configuration to the speech recognizer
result = speech_recognizer.recognize_once()
pronunciation_assessment_result = speechsdk.PronunciationAssessmentResult(result)
pronunciation_score = pronunciation_assessment_result.pronunciation_score
var pronunciationAssessmentConfig = new SpeechSDK.PronunciationAssessmentConfig("reference text",
    PronunciationAssessmentGranularity.Word, true);
var speechRecognizer = SpeechSDK.SpeechRecognizer.FromConfig(speechConfig, audioConfig);
// apply the pronunciation assessment configuration to the speech recognizer

speechRecognizer.recognizeOnceAsync((result: SpeechSDK.SpeechRecognitionResult) => {
        var pronunciationAssessmentResult = SpeechSDK.PronunciationAssessmentResult.fromResult(result);
        var pronunciationScore = pronunciationAssessmentResult.pronunciationScore;
        var wordLevelResult = pronunciationAssessmentResult.detailResult.Words;
SPXPronunciationAssessmentConfiguration* pronunciationAssessmentConfig =
    [[SPXPronunciationAssessmentConfiguration alloc]init:@"reference text"

SPXSpeechRecognizer* speechRecognizer = \
        [[SPXSpeechRecognizer alloc] initWithSpeechConfiguration:speechConfig

// apply the pronunciation assessment configuration to the speech recognizer
[pronunciationAssessmentConfig applyToRecognizer:speechRecognizer];

SPXSpeechRecognitionResult *result = [speechRecognizer recognizeOnce];
SPXPronunciationAssessmentResult* pronunciationAssessmentResult = [[SPXPronunciationAssessmentResult alloc] init:result];
double pronunciationScore = pronunciationAssessmentResult.pronunciationScore;

Pronunciation assessment configuration parameters

This table lists the configuration parameters for pronunciation assessment.

Parameter Description Required?
ReferenceText The text that the pronunciation will be evaluated against. Required
GradingSystem The point system for score calibration. The FivePoint system gives a 0-5 floating point score, and HundredMark gives a 0-100 floating point score. Default: FivePoint. Optional
Granularity The evaluation granularity. Accepted values are Phoneme, which shows the score on the full text, word and phoneme level, Word, which shows the score on the full text and word level, FullText, which shows the score on the full text level only. Default: Phoneme. Optional
EnableMiscue Enables miscue calculation. With this enabled, the pronounced words will be compared to the reference text, and will be marked with omission/insertion based on the comparison. Accepted values are False and True. Default: False. Optional
ScenarioId A GUID indicating a customized point system. Optional

Pronunciation assessment result parameters

This table lists the result parameters of pronunciation assessment.

Parameter Description
AccuracyScore Pronunciation accuracy of the speech. Accuracy indicates how closely the phonemes match a native speaker's pronunciation. Word and full text level accuracy score is aggregated from phoneme level accuracy score.
FluencyScore Fluency of the given speech. Fluency indicates how closely the speech matches a native speaker's use of silent breaks between words.
CompletenessScore Completeness of the speech, determined by calculating the ratio of pronounced words to reference text input.
PronunciationScore Overall score indicating the pronunciation quality of the given speech. This is aggregated from AccuracyScore, FluencyScore and CompletenessScore with weight.
ErrorType This value indicates whether a word is omitted, inserted or badly pronounced, compared to ReferenceText. Possible values are None (meaning no error on this word), Omission, Insertion and Mispronunciation.

Sample responses

A typical pronunciation assessment result in JSON:

  "RecognitionStatus": "Success",
  "Offset": "400000",
  "Duration": "11000000",
  "NBest": [
        "Confidence" : "0.87",
        "Lexical" : "good morning",
        "ITN" : "good morning",
        "MaskedITN" : "good morning",
        "Display" : "Good morning.",
            "PronScore" : 84.4,
            "AccuracyScore" : 100.0,
            "FluencyScore" : 74.0,
            "CompletenessScore" : 100.0,
        "Words": [
              "Word" : "Good",
              "Offset" : 500000,
              "Duration" : 2700000,
                "AccuracyScore" : 100.0,
                "ErrorType" : "None"
              "Word" : "morning",
              "Offset" : 5300000,
              "Duration" : 900000,
                "AccuracyScore" : 100.0,
                "ErrorType" : "None"

Next steps

  • See the sample code on GitHub for pronunciation assessment.
  • See the sample code on GitHub for pronunciation assessment.
  • See the sample code on GitHub for pronunciation assessment.
  • See the sample code on GitHub for pronunciation assessment.
  • See the sample code on GitHub for pronunciation assessment.