Example: Detect language with Text Analytics

The Language Detection feature of the Azure Text Analytics REST API evaluates text input for each document and returns language identifiers with a score that indicates the strength of the analysis.

This capability is useful for content stores that collect arbitrary text, where language is unknown. You can parse the results of this analysis to determine which language is used in the input document. The response also returns a score that reflects the confidence of the model. The score value is between 0 and 1.

The Language Detection feature can detect a wide range of languages, variants, dialects, and some regional or cultural languages. The exact list of languages for this feature isn't published.

If you have content expressed in a less frequently used language, you can try the Language Detection feature to see if it returns a code. The response for languages that can't be detected is unknown.

Tip

Text Analytics also provides a Linux-based Docker container image for language detection, so you can install and run the Text Analytics container close to your data.

Preparation

You must have JSON documents in this format: ID and text.

The document size must be under 5,120 characters per document. You can have up to 1,000 items (IDs) per collection. The collection is submitted in the body of the request. The following sample is an example of content you might submit for language detection:

 {
     "documents": [
         {
             "id": "1",
             "text": "This document is in English."
         },
         {
             "id": "2",
             "text": "Este documento está en inglés."
         },
         {
             "id": "3",
             "text": "Ce document est en anglais."
         },
         {
             "id": "4",
             "text": "本文件为英文"
         },                
         {
             "id": "5",
             "text": "Этот документ на английском языке."
         }
     ]
 }

Step 1: Structure the request

For more information on request definition, see Call the Text Analytics API. The following points are restated for convenience:

  • Create a POST request. To review the API documentation for this request, see the Language Detection API.

  • Set the HTTP endpoint for language detection. Use either a Text Analytics resource on Azure or an instantiated Text Analytics container. It must include the /languages resource: https://westus.api.cognitive.microsoft.com/text/analytics/v2.1/languages.

  • Set a request header to include the access key for Text Analytics operations. For more information, see Find endpoints and access keys.

  • In the request body, provide the JSON documents collection you prepared for this analysis.

Tip

Use Postman or open the API testing console in the documentation to structure a request and POST it to the service.

Step 2: POST the request

Analysis is performed upon receipt of the request. For information on the size and number of requests you can send per minute and second, see the data limits section in the overview.

Recall that the service is stateless. No data is stored in your account. Results are returned immediately in the response.

Step 3: View the results

All POST requests return a JSON-formatted response with the IDs and detected properties.

Output is returned immediately. You can stream the results to an application that accepts JSON or save the output to a file on the local system. Then, import the output into an application that you can use to sort, search, and manipulate the data.

Results for the example request should look like the following JSON. Notice that it's one document with multiple items. Output is in English. Language identifiers include a friendly name and a language code in ISO 639-1 format.

A positive score of 1.0 expresses the highest possible confidence level of the analysis.

{
    "documents": [
        {
            "id": "1",
            "detectedLanguages": [
                {
                    "name": "English",
                    "iso6391Name": "en",
                    "score": 1
                }
            ]
        },
        {
            "id": "2",
            "detectedLanguages": [
                {
                    "name": "Spanish",
                    "iso6391Name": "es",
                    "score": 1
                }
            ]
        },
        {
            "id": "3",
            "detectedLanguages": [
                {
                    "name": "French",
                    "iso6391Name": "fr",
                    "score": 1
                }
            ]
        },
        {
            "id": "4",
            "detectedLanguages": [
                {
                    "name": "Chinese_Simplified",
                    "iso6391Name": "zh_chs",
                    "score": 1
                }
            ]
        },
        {
            "id": "5",
            "detectedLanguages": [
                {
                    "name": "Russian",
                    "iso6391Name": "ru",
                    "score": 1
                }
            ]
        }
    ],

Ambiguous content

If the analyzer can't parse the input, it returns (Unknown). An example is if you submit a text block that consists solely of Arabic numerals.

    {
      "id": "5",
      "detectedLanguages": [
        {
          "name": "(Unknown)",
          "iso6391Name": "(Unknown)",
          "score": "NaN"
        }
      ]

Mixed-language content

Mixed-language content within the same document returns the language with the largest representation in the content, but with a lower positive rating. The rating reflects the marginal strength of the assessment. In the following example, input is a blend of English, Spanish, and French. The analyzer counts characters in each segment to determine the predominant language.

Input

{
  "documents": [
    {
      "id": "1",
      "text": "Hello, I would like to take a class at your University. ¿Se ofrecen clases en español? Es mi primera lengua y más fácil para escribir. Que diriez-vous des cours en français?"
    }
  ]
}

Output

The resulting output consists of the predominant language, with a score of less than 1.0, which indicates a weaker level of confidence.

{
  "documents": [
    {
      "id": "1",
      "detectedLanguages": [
        {
          "name": "Spanish",
          "iso6391Name": "es",
          "score": 0.9375
        }
      ]
    }
  ],
  "errors": []
}

Summary

In this article, you learned concepts and workflow for language detection by using Text Analytics in Azure Cognitive Services. The following points were explained and demonstrated:

  • Language detection is available for a wide range of languages, variants, dialects, and some regional or cultural languages.
  • JSON documents in the request body include an ID and text.
  • The POST request is to a /languages endpoint by using a personalized access key and an endpoint that's valid for your subscription.
  • Response output consists of language identifiers for each document ID. The output can be streamed to any app that accepts JSON. Example apps include Excel and Power BI, to name a few.

See also

Text Analytics overview
Frequently asked questions (FAQ)
Text Analytics product page

Next steps