Transparency note for Personally Identifiable Information (PII)


This article assumes that you're familiar with guidelines and best practices for the Text Analytics service. For more information, see Transparency note for Text Analytics.

The Text Analytics service supports named entity recognition to identify and categorize information in your text. The preview of NER v3.1 supports the personal (PII) categories of entities. A wide variety of personal entities such as names, organizations, addresses, phone numbers, financial account numbers or codes and government and country or region specific identification numbers can be recognized. A subset of these personal entities is protected health information (PHI). If you specify domain=phi in your request, you will only get the PHI entities returned. The full list of PII and PHI entity categories can be found in the table here.

Read example NER request and example response to see how to send text to the service and what to expect back.

Example use cases

Customers may want to recognize various categories of PII for several reasons:

  • Apply sensitivity labels - For example, based on the results from the PII service, a public sensitivity label might be applied to documents where no PII entities are detected. For documents where US addresses and phone numbers are recognized, a confidential label might be applied. A highly confidential label might be used for documents where bank routing numbers are recognized.
  • Redact some categories of personal information from documents that get wider circulation - For example, if customer contact records are accessible to first line support representatives, the company may want to redact the customer's personal information besides their name from the version of the customer history to preserve the customer's privacy.
  • Redact personal information in order to reduce unconscious bias - For example, during a company's resume review process, they may want to block name, address and phone number to help reduce unconscious gender or other biases.
  • Replace personal information in source data for machine learning to reduce unfairness – For example, if you want to remove names that might reveal gender when training a machine learning model, you could use the service to identify them and you could replace them with generic placeholders for model training.

Considerations when choosing a use case

  • Avoid high-risk automatic redaction or information classification scenarios – Any scenario where failures to redact personal information could expose people to the risk of identity theft and physical or psychological harms should include careful human oversight.
  • Avoid scenarios that use personal information for a purpose that consent was not obtained for - For example, a company has resumes from past job applicants. The applicants did not give their consent to be contacted for promotional events when they submitted their resumes. Based on this scenario, the PII service should not be used to identify contact information for the purpose of inviting the past applicants to a trade show.
  • Avoid scenarios that use the service to harvest personal information from publicly available content.
  • Avoid scenarios that replace personal information in text with the intent to mislead people.

Characteristics and limitations

Depending on your scenario, input data and the entities you wish to extract, you could experience different levels of performance. The following sections are designed to help you understand key concepts about performance as they apply to using the Text Analytics PII service.

Understand and measure performance

Since both false positive and false negative errors can occur, it is important to understand how both types of errors might affect your overall system. In redaction scenarios, for example, false negatives could lead to personal information leakage. For redaction scenarios, consider a process for human review to account for this type of error. For sensitivity label scenarios, both false positives and false negatives could lead to misclassification of documents. The audience may unnecessarily limited for documents labelled as confidential where a false positive occurred. PII could be leaked where a false negative occurred and a public label was applied.

You can adjust the threshold for confidence score your system uses to tune your system. If it is more important to identify all potential instances of PII, you can use a lower threshold. This means that you may get more false positives (non- PII data being recognized as PII entities), but fewer false negatives (PII entities not recognized as PII). If it is more important for your system to recognize only true PII data, you can use a higher threshold. Threshold values may not have consistent behavior across individual categories of PII entities. Therefore, it is critical that you test your system with real data it will process in production.

System limitations and best practices for enhancing performance

  • Make sure you understand all the entity categories that can be recognized by the system. Depending on your scenario, your data may include other information that could be considered personal but is not covered by the categories the service currently supports.

  • Context is important for all entity categories to be correctly recognized by the system, as it often is for humans to recognize an entity. For example, without context a ten-digit number is just a number, not a PII entity. However, given context like You can reach me at my office number 2345678901, both the system and a human can recognize the ten-digit number as a phone number. Always include context when sending text to the system to obtain the best possible performance.

  • Person names in particular require linguistic context. Send as much context as possible for better person name detection.

  • For conversational data, consider sending more than a single turn in the conversation to ensure higher likelihood that the required context is included with the actual entities.
    In the following conversation, if you send a single row at a time, the passport number will not have any context associated with it and the EU Passport Number PII category will not be recognized.

    Hi, how can I help you today?
    I want to renew my passport
    Sure, what is your current passport number?
    Its 123456789, thanks.

    However, if you send the whole conversation it will be recognized because the context is included.

  • Sometimes multiple entity categories can be recognized for the same entity. If we take the previous example:

    Hi, how can I help you today?
    I want to renew my passport
    Sure, what is your current passport number?
    Its 123456789, thanks.

    Several different countries have the same format for passport numbers, so several different specific entity categories may be recognized. In some cases, using the highest confidence score may not be sufficient to choose the right entity class. If your scenario depends on the specific entity category being recognized, you may need to disambiguate the result elsewhere in your system either through a human review or additional validation code. Thorough testing on real life data can help you identify if you're likely to see multiple entity categories for recognized for your scenario.

    Although many international entities are supported, currently the service only supports English text. Consider verifying the language the input text is in if you're not sure it will be all in English.

  • The PII service only takes text as an input. If you are redacting information from documents in other formats, make sure to carefully test your redaction code to ensure identified entities are not accidentally leaked.

See also