Transparency note for Sentiment Analysis

What is a transparency note?


This article assumes that you're familiar with guidelines and best practices for Azure Cognitive Service for language. For more information, see Transparency note for Azure Cognitive Service for language.

An AI system includes not only the technology, but also the people who will use it, the people who will be affected by it, and the environment in which it is deployed. Creating a system that is fit for its intended purpose requires an understanding of how the technology works, its capabilities and limitations, and how to achieve the best performance. Microsoft's Transparency Notes are intended to help you understand how our AI technology works, the choices system owners can make that influence system performance and behavior, and the importance of thinking about the whole system, including the technology, the people, and the environment. You can use Transparency Notes when developing or deploying your own system, or share them with the people who will use or be affected by your system.

Microsoft's Transparency notes are part of a broader effort at Microsoft to put our AI principles into practice. To find out more, see Responsible AI principles from Microsoft.

Introduction to Sentiment Analysis

Azure Cognitive Service for language's Sentiment Analysis feature evaluates text and returns sentiment scores and labels for each sentence. This is useful for detecting positive, neutral and negative sentiment in social media, customer reviews, discussion forums and other product and service scenarios.

Sentiment Analysis returns an overall sentiment label (positive, negative, neutral, mixed) and confidence scores for positive, negative and neutral for an entire document. An overall sentiment label (positive, negative and neutral) and confidence scores for positive, negative and neutral are also returned for each sentence within the document. Scores closer to 1 indicate a higher confidence in the label's classification, while lower scores indicate lower confidence. By default, the overall sentiment label is the greatest of the three confidence scores, though you can alternatively define a threshold for any or all of the individual sentiment confidence scores that works best for your scenario. For each document or each sentence, the predicted scores associated with the labels (positive, negative and neutral) add up to 1. Read more details about sentiment labels and scores.

In addition, the optional opinion mining feature returns aspects (such as the attributes of products or services) and their associated opinion words. For each aspect an overall sentiment label is returned along with confidence scores for positive and negative sentiment. For example, the sentence The restaurant had great food and our waiter was friendly has two aspects food and waiter, and their corresponding opinion words are great and friendly. The two aspects therefore receive sentiment classification positive, with confidence scores between 0 and 1.0. Read more details about opinion mining.

See the JSON response for this example.

Example use cases

Sentiment Analysis can be used in multiple scenarios across a variety of industries that this type of systems. Some examples include:

  • Monitor for positive and negative feedback trends in aggregate. After the introduction of a new product, a retailer could use the sentiment service to monitor multiple social media outlets for mentions of the product with their sentiment. They could review the trending sentiment in their weekly product meetings.
  • Run sentiment analysis on raw text results of surveys to gain insights for analysis and follow-up with participants (customers, employees, consumers, etc.). A store has an internal policy with following up with customers' negative reviews within 24 hours and thanking all positive reviews within a week. They use the sentiment service to categorize reviews for easy and timely follow up.
  • Help customer service staff improve the customer engagement through insights captured from real-time analysis of interactions. Extract insights from transcribed customer services calls to better understand customer-agent interactions and trends to improve customer engagements.

Considerations when choosing a use case

  • Avoid automatic actions without human intervention for high impact scenarios. For example, don't automatically give bonuses to employees based on sentiment scores from the service for their customer service interaction text. A person should always review source data when another person's economic situation, health or safety is affected.
  • Carefully consider scenarios outside of the product and service review domain. Since the model is trained on product and service reviews, the system may not accurately recognize sentiment focused language in other domains. Always make sure to test the system on operational test datasets to ensure you get the performance you need. Your operational test dataset should reflect the real data your system will see in production with all the characteristics and variation you will have when your product is deployed. Synthetic data and tests that don't reflect your end-to-end scenario won't be sufficient.
  • Carefully consider scenarios that take automatic action to filter or remove content. You can add a human review cycle and/or re-rank content (rather than filtering it completely) if your goal is to ensure content meets your community standards.

Characteristics and limitations

Depending on your scenario and input data, you could experience different levels of performance. The following information is designed to help you understand key concepts about performance as they apply to using Sentiment Analysis.

System limitations and best practices for enhancing performance

Because sentiment is somewhat subjective it is not possible to provide a universally applicable estimate of performance for the model. Ultimately, performance depends on a number of factors such as the subject domain, the characteristics of the text processed, the use case for the system, and how people interpret the system's output.

Key limitations and characteristics to bear in mind:

  • The machine learning model that is used to predict sentiment was trained on product and service reviews. That means the service will perform most accurately for similar scenarios and less accurately for scenarios outside product and service reviews. For example, personnel reviews may use different language to describe sentiment and thus, you might not get the results or performance you would expect for that scenario. Words like "strong" in phrases like "Shafali was a strong leader" may not have positive sentiment because the word strong may not have a clear positive sentiment in product and service reviews.

  • Since the model is trained on product and service reviews, certain dialects and language varieties less represented in data may have lower accuracy.

  • The model has no understanding about the relative importance of various sentences that are sent together. Since the overall sentiment is a simple aggregate score of the sentences, it overall sentiment score may not agree with a human's interpretation which would take into account the fact that some sentences may have more importance in determining the overall sentiment.

  • The model may have problems recognizing sarcasm. Context, like tone of voice, facial expression, the author of the text, the audience for the text, or prior conversation can be important to understand the sentiment in some cases. Often with sarcasm, additional context is needed to recognize if a text input is positive or negative. Given that the service only sees the text input, classifying sarcastic sentiment can be less accurate. For example, that was awesome, could be either positive or negative depending on the context, tone of voice, facial expression, author and the audience.

  • You may find confidence scores for positive, negative, and neutral sentiments differ according to your scenario. Instead of using the overall sentence level sentiment for the full document or sentence, you can define a threshold for any or all of the individual sentiment confidence scores that works best for your scenario. For example if it is more important to identify all potential instances of negative sentiment, you can use a lower threshold on the negative sentiment instead of looking at the overall sentiment label. This means that you may get more false positives (neutral or positive text being recognized as negative sentiment), but fewer false negatives (negative text not recognized as negative sentiment). For example, you might want to read all product feedback that has some potential negative sentiment for ideas for product improvement. In that case, you could use the negative sentiment score only and set a lower threshold. This may lead to extra work because you'd end up reading some reviews that aren't negative, but you're more likely to identify opportunities for improvement. If it is more important for your system to recognize only true negative text, you can use a higher threshold or use the overall sentiment label. For example, you may want to respond to product reviews that are negative. If you want to minimize the work to read and respond to negative reviews, you could only use the overall sentiment prediction and ignore the individual sentiment scores. While there may be some negative sentiment predicted that you miss, you're likely to get most of the truly negative reviews. Threshold values may not have consistent behavior across scenarios. Therefore, it is critical that you test your system with real data it will process in production.

  • The confidence score magnitude does not reflect the intensity of the sentiment. It is based on the confidence of the model for a particular sentiment (positive, neutral, negative). Therefore, if your system depends on the intensity of the sentiment, consider using a human reviewer or post processing logic on the individual opinion scores or the original text to help rank the intensity of the sentiment.

See also