Transparency note for extractive summarization
What is a transparency note?
An AI system includes not only the technology, but also the people who will use it, the people who will be affected by it, and the environment in which it is deployed. Creating a system that is fit for its intended purpose requires an understanding of how the technology works, its capabilities and limitations, and how to achieve the best performance. Microsoft's Transparency Notes are intended to help you understand how our AI technology works, the choices system owners can make that influence system performance and behavior, and the importance of thinking about the whole system, including the technology, the people, and the environment. You can use Transparency Notes when developing or deploying your own system, or share them with the people who will use or be affected by your system.
Microsoft's Transparency notes are part of a broader effort at Microsoft to put our AI principles into practice. To find out more, see Responsible AI principles from Microsoft.
Introduction to extractive summarization
Extractive summarization is a feature in Azure Cognitive Service for language that produces a summary by extracting sentences that collectively represent the most important or relevant information within the original content.
This feature is designed to help address the problem with content that users think is too long to read. Extractive summarization condenses articles, papers, or documents to key sentences.
The basics of extractive summarization
The extractive summarization feature in Azure Cognitive Service for language uses natural language processing techniques to locate key sentences in an unstructured text document. These sentences collectively convey the main idea of the document. This feature is provided as an API for developers. They can use it to build intelligent solutions based on the relevant information extracted to support various use cases.
Extractive summarization returns a rank score as a part of the system response along with extracted sentences and their position in the original documents, which is unlike some other Azure Cognitive Service for language features. A rank score is an indicator of how relevant one sentence the model determines is important or relevant to the main idea of a document. The model gives a score between 0 and 1 to each sentence and returns the highest scored sentences per request. If you request a three-sentence summary, the service returns the three highest scored sentences.
Example use cases
You might want to use extractive summarization to:
- Assist the processing of documents to improve efficiency.
- Distill critical information from lengthy documents, reports, and other text forms.
- Highlight key sentences in documents.
- Quickly skim documents in a library.
- Generate news feed content.
You can also use extractive summarization in multiple scenarios across a variety of industries. For example, you can use extractive summarization to:
- Extract key information from public news articles to produce insights such as trends and news spotlights.
- Classify documents by their key contents.
- Distill important information from long documents to empower solutions such as search, question and answering, and decision support.
- Empower solutions for clustering documents by their relevant content.
There are other great use cases that we haven't planned for. We encourage you to come up with your own. As you do so, keep in mind the characteristics and limitations described in the next section. We want the service to be used for innovation, but with care. Please draw on actionable information that enables responsible integration in your use cases, and conduct your own testing specific to your scenarios.
Use case guidance
Extractive summarization isn't appropriate for all use case scenarios.
Do not use
Don't use extractive summarization for automatic actions without human intervention for high-impact scenarios. A person should always review source data when another person's economic situation, health, or safety is affected.
Characteristics and limitations
Based on your scenario and input data, you could experience different levels of performance. The following information explains key concepts about performance as they apply to using extractive summarization.
System limitations and best practices to enhance performance
- Because the extractive summarization feature in Azure Cognitive Service for language is trained on document-based texts, such as news articles, scientific reports, and legal documents, when used with texts in certain genres such as conversations and transcriptions, which are less represented in the training data, might produce output with lower accuracy.
- When used with texts that may contain errors or are less similar to well-formed sentences, such as texts extracted from lists, tables, charts, or scanned in via OCR (Optical Character Recognition), the extractive summarization feature might produce output with lower accuracy.
- Most of the training data is in commonly used languages such as English, German, French, Chinese, Japanese, and Korean. The trained models might not perform as well on input in other languages.
- Documents must be "cracked," or converted, from their original format into plain and unstructured text.
- Although the service can handle a maximum of 25 documents per request, the latency performance of the API increases with larger documents (it becomes slower). This is especially true if the documents contain close to the maximum 125,000 characters.
- The model gives a score between 0 and 1 to each sentence and returns the highest scored sentences per request. If you request a three-sentence summary, the service returns the three highest scored sentences. If you request a five-sentence summary from the same document, the service returns the next two highest scored sentences in addition to the first three sentences.
- The service returns extracted sentences in their chronological order by default. To change the order, specify
sortBy. The accepted values forsortByareOffset(default). The value ofOffsetis the character positions of the extracted sentences and the value ofRankis the rank scores of the extracted sentences.
See also
- Transparency note for Azure Cognitive Service for language
- Transparency note for Named Entity Recognition and Personally Identifying Information
- Transparency note for the health feature
- Transparency note for Key Phrase Extraction
- Transparency note for Language Detection
- Transparency note for Question answering
- Transparency note for Sentiment Analysis
- Data Privacy and Security for Azure Cognitive Service for language
- Guidance for integration and responsible use with Azure Cognitive Service for language