Automatically identify the spoken language with language identification model

Azure Video Analyzer for Media (formerly Video Indexer) supports automatic language identification (LID), which is the process of automatically identifying the spoken language content from audio and sending the media file to be transcribed in the dominant identified language.

Currently LID supports: English, Spanish, French, German, Italian, Mandarin Chinese, Japanese, Russian, and Portuguese (Brazilian).

Make sure to review the Guidelines and limitations section below.

Choosing auto language identification on indexing

When indexing or re-indexing a video using the API, choose the auto detect option in the sourceLanguage parameter.

When using portal, go to your Account videos on the Video Analyzer for Media home page and hover over the name of the video that you want to re-index. On the right-bottom corner click the re-index button. In the Re-index video dialog, choose Auto detect from the Video source language drop-down box.

auto detect

Model output

Video Analyzer for Media transcribes the video according to the most likely language if the confidence for that language is > 0.6. If the language cannot be identified with confidence, it assumes the spoken language is English.

Model dominant language is available in the insights JSON as the sourceLanguage attribute (under root/videos/insights). A corresponding confidence score is also available under the sourceLanguageConfidence attribute.

"insights": {
        "version": "",
        "duration": "0:05:30.902",
        "sourceLanguage": "fr-FR",
        "language": "fr-FR",
        "transcript": [...],
        . . .
        "sourceLanguageConfidence": 0.8563

Guidelines and limitations

  • Automatic language identification (LID) supports the following languages:

    English, Spanish, French, German, Italian, Mandarin Chines, Japanese, Russian, and Portuguese (Brazilian).

  • Even though Video Analyzer for Media supports Arabic (Modern Standard and Levantine), Hindi, and Korean, these languages are not supported in LID.

  • If the audio contains languages other than the supported list above, the result is unexpected.

  • If Video Analyzer for Media cannot identify the language with a high enough confidence (>0.6), the fallback language is English.

  • There is no current support for file with mixed languages audio. If the audio contains mixed languages, the result is unexpected.

  • Low-quality audio may impact the model results.

  • The model requires at least one minute of speech in the audio.

  • The model is designed to recognize a spontaneous conversational speech (not voice commands, singing, etc.).

Next steps