Content metadata properties used in Azure Cognitive Search

SharePoint Online and Azure Blob Storage can contain various content, and many of those content types have metadata properties that can be useful to index. Just as you can create search fields for standard blob properties like metadata_storage_name, you can create fields for metadata properties that are specific to a document format.

Supported document formats

Cognitive Search supports blob indexing and SharePoint Online document indexing for the following document formats:

  • PDF
  • Microsoft Office formats: DOCX/DOC/DOCM, XLSX/XLS/XLSM, PPTX/PPT/PPTM, MSG (Outlook emails), XML(both 2003 and 2006 WORD XML)
  • Open Document formats: ODT, ODS, ODP
  • HTML
  • XML
  • KML (XML for geographic representations)
  • ZIP
  • GZ
  • EPUB
  • EML
  • RTF
  • Plain text files (see also Indexing plain text)
  • JSON (see Indexing JSON blobs)
  • CSV (see Indexing CSV blobs)

Properties by document format

The following table summarizes processing done for each document format, and describes the metadata properties extracted by a blob indexer and the SharePoint Online indexer.

Document format / content type Extracted metadata Processing details
HTML (text/html or application/xhtml+xml) metadata_content_encoding
metadata_content_type
metadata_language
metadata_description
metadata_keywords
metadata_title
Strip HTML markup and extract text
PDF (application/pdf) metadata_content_type
metadata_language
metadata_author
metadata_title
metadata_creation_date
Extract text, including embedded documents (excluding images)
DOCX (application/vnd.openxmlformats-officedocument.wordprocessingml.document) metadata_content_type
metadata_author
metadata_character_count
metadata_creation_date
metadata_last_modified
metadata_page_count
metadata_word_count
Extract text, including embedded documents
DOC (application/msword) metadata_content_type
metadata_author
metadata_character_count
metadata_creation_date
metadata_last_modified
metadata_page_count
metadata_word_count
Extract text, including embedded documents
DOCM (application/vnd.ms-word.document.macroenabled.12) metadata_content_type
metadata_author
metadata_character_count
metadata_creation_date
metadata_last_modified
metadata_page_count
metadata_word_count
Extract text, including embedded documents
WORD XML (application/vnd.ms-word2006ml) metadata_content_type
metadata_author
metadata_character_count
metadata_creation_date
metadata_last_modified
metadata_page_count
metadata_word_count
Strip XML markup and extract text
WORD 2003 XML (application/vnd.ms-wordml) metadata_content_type
metadata_author
metadata_creation_date
Strip XML markup and extract text
XLSX (application/vnd.openxmlformats-officedocument.spreadsheetml.sheet) metadata_content_type
metadata_author
metadata_creation_date
metadata_last_modified
Extract text, including embedded documents
XLS (application/vnd.ms-excel) metadata_content_type
metadata_author
metadata_creation_date
metadata_last_modified
Extract text, including embedded documents
XLSM (application/vnd.ms-excel.sheet.macroenabled.12) metadata_content_type
metadata_author
metadata_creation_date
metadata_last_modified
Extract text, including embedded documents
PPTX (application/vnd.openxmlformats-officedocument.presentationml.presentation) metadata_content_type
metadata_author
metadata_creation_date
metadata_last_modified
metadata_slide_count
metadata_title
Extract text, including embedded documents
PPT (application/vnd.ms-powerpoint) metadata_content_type
metadata_author
metadata_creation_date
metadata_last_modified
metadata_slide_count
metadata_title
Extract text, including embedded documents
PPTM (application/vnd.ms-powerpoint.presentation.macroenabled.12) metadata_content_type
metadata_author
metadata_creation_date
metadata_last_modified
metadata_slide_count
metadata_title
Extract text, including embedded documents
MSG (application/vnd.ms-outlook) metadata_content_type
metadata_message_from
metadata_message_from_email
metadata_message_to
metadata_message_to_email
metadata_message_cc
metadata_message_cc_email
metadata_message_bcc
metadata_message_bcc_email
metadata_creation_date
metadata_last_modified
metadata_subject
Extract text, including text extracted from attachments. metadata_message_to_email, metadata_message_cc_email and metadata_message_bcc_email are string collections, the rest of the fields are strings.
ODT (application/vnd.oasis.opendocument.text) metadata_content_type
metadata_author
metadata_character_count
metadata_creation_date
metadata_last_modified
metadata_page_count
metadata_word_count
Extract text, including embedded documents
ODS (application/vnd.oasis.opendocument.spreadsheet) metadata_content_type
metadata_author
metadata_creation_date
metadata_last_modified
Extract text, including embedded documents
ODP (application/vnd.oasis.opendocument.presentation) metadata_content_type
metadata_author
metadata_creation_date
metadata_last_modified
metadata_title
Extract text, including embedded documents
ZIP (application/zip) metadata_content_type Extract text from all documents in the archive
GZ (application/gzip) metadata_content_type Extract text from all documents in the archive
EPUB (application/epub+zip) metadata_content_type
metadata_author
metadata_creation_date
metadata_title
metadata_description
metadata_language
metadata_keywords
metadata_identifier
metadata_publisher
Extract text from all documents in the archive
XML (application/xml) metadata_content_type
metadata_content_encoding
metadata_language
Strip XML markup and extract text
KML (application/vnd.google-earth.kml+xml) metadata_content_type
metadata_content_encoding
metadata_language
Strip XML markup and extract text
JSON (application/json) metadata_content_type
metadata_content_encoding
Extract text
NOTE: If you need to extract multiple document fields from a JSON blob, see Indexing JSON blobs for details
EML (message/rfc822) metadata_content_type
metadata_message_from
metadata_message_to
metadata_message_cc
metadata_creation_date
metadata_subject
Extract text, including attachments
RTF (application/rtf) metadata_content_type
metadata_author
metadata_character_count
metadata_creation_date
metadata_last_modified
metadata_page_count
metadata_word_count
Extract text
Plain text (text/plain) metadata_content_type
metadata_content_encoding
metadata_language
Extract text
CSV (text/csv) metadata_content_type
metadata_content_encoding
Extract text
NOTE: If you need to extract multiple document fields from a CSV blob, see Indexing CSV blobs for details

See also