Transparency note for extractive summarization

Extractive summarization is a feature in Azure Text Analytics that produces a summary by extracting sentences that collectively represent the most important or relevant information within the original content.

This feature is designed to help address the problem with content that users think is too long to read. Extractive summarization condenses articles, papers, or documents to key sentences.

The basics of extractive summarization

The extractive summarization feature in Text Analytics uses natural language processing techniques to locate key sentences in an unstructured text document. These sentences collectively convey the main idea of the document. This feature is provided as an API for developers. They can use it to build intelligent solutions based on the relevant information extracted to support various use cases.

Extractive summarization returns a rank score as a part of the system response along with extracted sentences and their position in the original documents, which is unlike some other Text Analytics features. A rank score is an indicator of how relevant one sentence the model determines is important or relevant to the main idea of a document. The model gives a score between 0 and 1 to each sentence and returns the highest scored sentences per request. If you request a three-sentence summary, the service returns the three highest scored sentences.

Example use cases

You might want to use extractive summarization to:

  • Assist the processing of documents to improve efficiency.
  • Distill critical information from lengthy documents, reports, and other text forms.
  • Highlight key sentences in documents.
  • Quickly skim documents in a library.
  • Generate news feed content.

You can also use extractive summarization in multiple scenarios across a variety of industries. For example, you can use extractive summarization to:

  • Extract key information from public news articles to produce insights such as trends and news spotlights.
  • Classify documents by their key contents.
  • Distill important information from long documents to empower solutions such as search, question and answering, and decision support.
  • Empower solutions for clustering documents by their relevant content.

There are other great use cases that we haven't planned for. We encourage you to come up with your own. As you do so, keep in mind the characteristics and limitations described in the next section. We want Text Analytics to be used for innovation, but with care. Please draw on actionable information that enables responsible integration in your use cases, and conduct your own testing specific to your scenarios.

Use case guidance

Extractive summarization isn't appropriate for all use case scenarios.

Do not use

Don't use extractive summarization for automatic actions without human intervention for high-impact scenarios. A person should always review source data when another person's economic situation, health, or safety is affected.

Characteristics and limitations

Based on your scenario and input data, you could experience different levels of performance. The following information explains key concepts about performance as they apply to using extractive summarization.

System limitations and best practices to enhance performance

  • Because the extractive summarization feature in Text Analytics is trained on document-based texts, such as news articles, scientific reports, and legal documents, when used with texts in certain genres such as conversations and transcriptions, which are less represented in the training data, might produce output with lower accuracy.
  • When used with texts that may contain errors or are less similar to well-formed sentneces, such as texts extracted from lists, tables, charts, or scanned in via OCR (Optical Character Recognition), the extractive summarization feature might produce output with lower accuracy.
  • Most of the training data is in commonly used languages such as English, German, French, Chinese, Japanese, and Korean. The trained models might not perform as well on input in other languages.
  • Documents must be "cracked," or converted, from their original format into plain and unstructured text.
  • Although the service can handle a maximum of 25 documents per request, the latency performance of the API increases with larger documents (it becomes slower). This is especially true if the documents contain close to the maximum 125,000 characters.
  • The model gives a score between 0 and 1 to each sentence and returns the highest scored sentences per request. If you request a three-sentence summary, the service returns the three highest scored sentences. If you request a five-sentence summary from the same document, the service returns the next two highest scored sentences in addition to the first three sentences.
  • The service returns extracted sentences in their chronological order by default. To change the order, specify sortBy. The accepted values for sortBy are Offset (default). The value of Offset is the character positions of the extracted sentences and the value of Rank is the rank scores of the extracted sentences.

See also