Transparency Note for Azure Cognitive Service for language

What is a Transparency Note?

An AI system includes not only the technology, but also the people who will use it, the people who will be affected by it, and the environment in which it is deployed. Creating a system that is fit for its intended purpose requires an understanding of how the technology works, its capabilities and limitations, and how to achieve the best performance. Microsoft's Transparency Notes are intended to help you understand how our AI technology works, the choices system owners can make that influence system performance and behavior, and the importance of thinking about the whole system, including the technology, the people, and the environment. You can use Transparency Notes when developing or deploying your own system, or share them with the people who will use or be affected by your system.

Microsoft's Transparency Notes are part of a broader effort at Microsoft to put our AI principles into practice. To find out more, see Responsible AI principles from Microsoft.

General principles

This Transparency Note discusses Azure Cognitive Service for Language and the key considerations for making use of this technology responsibly. There are a number of things you need to consider when deciding how to use and implement AI-powered products and features:

  • Will this product or feature perform well in my scenario? Before deploying AI into your scenario, test how it performs using real-life data and make sure it can deliver the accuracy you need.
  • Are we equipped to identify and respond to errors? AI-powered products and features will not be 100% accurate, so consider how you will identify and respond to any errors that may occur.

Introduction to Azure Cognitive Service for language

Azure Cognitive Service for language is a cloud-based service that provides Natural Language Processing (NLP) features for text mining and text analysis, including:

Read the overview to get an introduction to each feature and review the example use cases. See the How-to guides and the API reference to understand more details about what each feature does and what gets returned by the system.

This article contains basic guidelines for how to use Azure Cognitive Service for language features responsibly, and specific guidance for a few features. Read the general information first and then jump to the specific article if you're using once of the features below.

General guidelines to understand and improve performance

Understand confidence scores

The sentiment, named entity recognition, language detection and health functions all return a confidence score as a part of the system response. This is an indicator of how confident the service is with the system's response. A higher value indicates that the service is more confident that the result is accurate. For example, the system recognizes entity of category U.S. Driver's License Number on the text 555 555 555 when given the text "My NY driver's license number is 555 555 555" with a score of .75 and might recognize category U.S. Driver's License Number on the text 555 555 555 with a score of .65 when given the text "My NY DL number is 555 555 555". Given the more specific context in the first example, the system is more confident in its response. In many cases, the system response can be used without examining the confidence score. In other cases, you can choose to use a response only if its confidence is above a specified confidence score threshold.

Understand and measuring performance

The performance of Azure Cognitive Service for language features is measured by examining how well the system recognizes the supported NLP concepts (at a given threshold value in comparison with a human judge.) For named entity extraction (NER), for example, one might count the true number of phone number entities in some text based on human judgement, and then compare with the output of the system from processing the same text. Comparing the human judgement with the system recognized entities would allow you to classify the events into two kinds of correct (or "true") events and two kinds of incorrect (or "false") events.

Outcome Correct/Incorrect Definition Example
True Positive Correct The system returns the same result that would be expected from a human judge. The system correctly recognizes PII entity of category Phone Number on the text 1-234-567-8910 when given the text: "You can reach me at my office number 1-234-567-9810".
True Negative Correct The system does not return a result, and this aligns with what would be expected from human judge. The system does not recognize any PII entity when given the text: "You can reach me at my office number".
False Positive Incorrect The system returns a result where a human judge would not. The system incorrectly recognizes PII entity of category Phone Number for the text office number when given the text: "You can reach me at my office number".
False Negative Incorrect The system does not return a result when a human judge would. The system incorrectly misses a Phone Number PII entity on the text 1-234-567-8910 when given the text: "You can reach me at my office number 1-234-567-9810".

Azure Cognitive Service for language features will not always be correct. You'll likely experience both false negative and false positive errors. It's important to consider how each type of error will affect your system. Carefully think through scenarios where true events won't be recognized and where incorrect events will be recognized and what the downstream affects will be in your implementation. Make sure to build in ways to identify, report and respond to each type of error. Plan to periodically review the performance of your deployed system to ensure errors are being handled appropriately.

How to set confidence score thresholds

You can choose to make decisions in your system based on the confidence score the system returns. You can adjust the confidence score threshold your system uses to meet your needs. If it is more important to identify all potential instances of the NLP concepts you want, you can use a lower threshold. This means that you may get more false positives but fewer false negatives. If it is more important for your system to recognize only true instances of the feature you're calling, you can use a higher threshold. If you use a higher threshold, you may get fewer false positives but more false negatives. Different scenarios call for different approaches. In addition, threshold values may not have consistent behavior across individual features of Azure Cognitive Service for language and categories of entities. For example, do not make assumptions that using a certain threshold for NER category Phone Number would be sufficient for another NER category, or that a threshold you use in NER would work similarly for Sentiment Analysis. Therefore, it is critical that you test your system with any thresholds you want to experiment with using real data it will process in production to determine the effects of various threshold values.

The quality of the incoming text to the system will affect your results

Azure Cognitive Service for language features only processes text. The fidelity and formatting of the incoming text will affect the performance of the system. Make sure you consider the following:

  • Speech transcription quality may affect the quality of the results. If your source data is voice, make sure you use the highest quality combination of automatic and human transcription to ensure the best performance. Consider using custom speech models for better quality results.
  • Lack of standard punctuation or casing may affect the quality of your results. If you are using a speech system, like Cognitive Services Speech to Text, be sure to select the option to include punctuation.
  • Optical character recognition (OCR) quality may affect the quality of the system. If your source data is images and you use OCR technology to generate the text, incorrect text generated may affect the performance of the system. Consider using custom OCR models to help improve the quality of results.
  • If your data includes frequent misspellings, consider using Bing Spell Check to correct misspellings.
  • Tabular data may not be identified correctly depending on how you send the table text to the system. Examine how you send text from tables in source documents to the service. For tables in documents, consider using Microsoft Form Recognizer or another similar service, which will allow you to get the appropriate keys and values to send to Azure Cognitive Service for language so contextual keys can be sent close enough to the values for the system to properly recognize the entities.
  • Microsoft trained its Azure Cognitive Service for language feature models (with the exception of language detection) using natural language text data that is comprised primarily of fully formed sentences and paragraphs. Therefore, using this service for data that most closely resembles this type of text will yield the best performance. We recommend avoiding use of this service to evaluate incomplete sentences and phrases where possible, as the performance here may be reduced.
  • The service only supports single language text. If your text includes multiple languages like "the sandwich was Bueno", the output may not be accurate.
  • The language code must match the input text language to get accurate results. If you are not sure about the input language you can use the language detection feature.

Fairness

At Microsoft, we strive to empower every person on the planet to do more. An essential part of this goal is working to create technologies and products that are fair and inclusive. Fairness is a multi-dimensional, sociotechnical topic and impacts many different aspects of our product development. You can learn more about Microsoft’s approach to fairness here.

One dimension we need to consider is how well the system performs for different groups of people. This may include looking at the accuracy of the model as well as measuring the performance of the complete system. Research has shown that without conscious effort focused on improving performance for all groups, it is often possible for the performance of an AI system to vary across groups based on factors such as race, ethnicity, language, gender, and age.

Each service and feature is different, and our testing may not perfectly match your context or cover all scenarios required for your use case. We encourage developers to thoroughly evaluate error rates for the service with real-world data that reflects your use case, including testing with users from different demographic groups.

For Azure Cognitive Service for language, certain dialects and language varieties within our supported languages and text from some demographic groups may not yet have enough representation in our current training datasets. We encourage you to review our responsible use guidelines, and if you encounter performance differences, we encourage you to let us know.

Performance varies across features and languages

Various languages are supported for each Azure Cognitive Service for language feature. You may find that performance for a particular feature is not consistent with another feature. Also, you may find that for a particular feature that performance is not consistent across various languages.

Next steps

If you are using any of the features below, be sure to review the specific information for that feature.

See also

Also, make sure to review: