Content metadata properties used in Azure Cognitive Search
Several of the indexer-supported data sources, including Azure Blob Storage, Azure Data Lake Storage Gen2, and SharePoint, contain standalone files or embedded objects of various content types. Many of those content types have metadata properties that can be useful to index. Just as you can create search fields for standard blob properties like metadata_storage_name, you can create fields in a search index for metadata properties that are specific to a document format.
Supported document formats
Cognitive Search supports blob indexing and SharePoint document indexing for the following document formats:
- CSV (see Indexing CSV blobs)
- EML
- EPUB
- GZ
- HTML
- JSON (see Indexing JSON blobs)
- KML (XML for geographic representations)
- Microsoft Office formats: DOCX/DOC/DOCM, XLSX/XLS/XLSM, PPTX/PPT/PPTM, MSG (Outlook emails), XML (both 2003 and 2006 WORD XML)
- Open Document formats: ODT, ODS, ODP
- Plain text files (see also Indexing plain text)
- RTF
- XML
- ZIP
Properties by document format
The following table summarizes processing done for each document format, and describes the metadata properties extracted by a blob indexer and the SharePoint indexer.
| Document format / content type | Extracted metadata | Processing details |
|---|---|---|
| CSV (text/csv) | metadata_content_typemetadata_content_encoding |
Extract text NOTE: If you need to extract multiple document fields from a CSV blob, see Indexing CSV blobs for details |
| DOC (application/msword) | metadata_content_typemetadata_authormetadata_character_countmetadata_creation_datemetadata_last_modifiedmetadata_page_countmetadata_word_count |
Extract text, including embedded documents |
| DOCM (application/vnd.ms-word.document.macroenabled.12) | metadata_content_typemetadata_authormetadata_character_countmetadata_creation_datemetadata_last_modifiedmetadata_page_countmetadata_word_count |
Extract text, including embedded documents |
| DOCX (application/vnd.openxmlformats-officedocument.wordprocessingml.document) | metadata_content_typemetadata_authormetadata_character_countmetadata_creation_datemetadata_last_modifiedmetadata_page_countmetadata_word_count |
Extract text, including embedded documents |
| EML (message/rfc822) | metadata_content_typemetadata_message_frommetadata_message_tometadata_message_ccmetadata_creation_datemetadata_subject |
Extract text, including attachments |
| EPUB (application/epub+zip) | metadata_content_typemetadata_authormetadata_creation_datemetadata_titlemetadata_descriptionmetadata_languagemetadata_keywordsmetadata_identifiermetadata_publisher |
Extract text from all documents in the archive |
| GZ (application/gzip) | metadata_content_type |
Extract text from all documents in the archive |
| HTML (text/html or application/xhtml+xml) | metadata_content_encodingmetadata_content_typemetadata_languagemetadata_descriptionmetadata_keywordsmetadata_title |
Strip HTML markup and extract text |
| JSON (application/json) | metadata_content_typemetadata_content_encoding |
Extract text NOTE: If you need to extract multiple document fields from a JSON blob, see Indexing JSON blobs for details |
| KML (application/vnd.google-earth.kml+xml) | metadata_content_typemetadata_content_encodingmetadata_language |
Strip XML markup and extract text |
| MSG (application/vnd.ms-outlook) | metadata_content_typemetadata_message_frommetadata_message_from_emailmetadata_message_tometadata_message_to_emailmetadata_message_ccmetadata_message_cc_emailmetadata_message_bccmetadata_message_bcc_emailmetadata_creation_datemetadata_last_modifiedmetadata_subject |
Extract text, including text extracted from attachments. metadata_message_to_email, metadata_message_cc_email and metadata_message_bcc_email are string collections, the rest of the fields are strings. |
| ODP (application/vnd.oasis.opendocument.presentation) | metadata_content_typemetadata_authormetadata_creation_datemetadata_last_modifiedmetadata_title |
Extract text, including embedded documents |
| ODS (application/vnd.oasis.opendocument.spreadsheet) | metadata_content_typemetadata_authormetadata_creation_datemetadata_last_modified |
Extract text, including embedded documents |
| ODT (application/vnd.oasis.opendocument.text) | metadata_content_typemetadata_authormetadata_character_countmetadata_creation_datemetadata_last_modifiedmetadata_page_countmetadata_word_count |
Extract text, including embedded documents |
| PDF (application/pdf) | metadata_content_typemetadata_languagemetadata_authormetadata_titlemetadata_creation_date |
Extract text, including embedded documents (excluding images) |
| Plain text (text/plain) | metadata_content_typemetadata_content_encodingmetadata_language |
Extract text |
| PPT (application/vnd.ms-powerpoint) | metadata_content_typemetadata_authormetadata_creation_datemetadata_last_modifiedmetadata_slide_countmetadata_title |
Extract text, including embedded documents |
| PPTM (application/vnd.ms-powerpoint.presentation.macroenabled.12) | metadata_content_typemetadata_authormetadata_creation_datemetadata_last_modifiedmetadata_slide_countmetadata_title |
Extract text, including embedded documents |
| PPTX (application/vnd.openxmlformats-officedocument.presentationml.presentation) | metadata_content_typemetadata_authormetadata_creation_datemetadata_last_modifiedmetadata_slide_countmetadata_title |
Extract text, including embedded documents |
| RTF (application/rtf) | metadata_content_typemetadata_authormetadata_character_countmetadata_creation_datemetadata_last_modifiedmetadata_page_countmetadata_word_count |
Extract text |
| WORD 2003 XML (application/vnd.ms-wordml) | metadata_content_typemetadata_authormetadata_creation_date |
Strip XML markup and extract text |
| WORD XML (application/vnd.ms-word2006ml) | metadata_content_typemetadata_authormetadata_character_countmetadata_creation_datemetadata_last_modifiedmetadata_page_countmetadata_word_count |
Strip XML markup and extract text |
| XLS (application/vnd.ms-excel) | metadata_content_typemetadata_authormetadata_creation_datemetadata_last_modified |
Extract text, including embedded documents |
| XLSM (application/vnd.ms-excel.sheet.macroenabled.12) | metadata_content_typemetadata_authormetadata_creation_datemetadata_last_modified |
Extract text, including embedded documents |
| XLSX (application/vnd.openxmlformats-officedocument.spreadsheetml.sheet) | metadata_content_typemetadata_authormetadata_creation_datemetadata_last_modified |
Extract text, including embedded documents |
| XML (application/xml) | metadata_content_typemetadata_content_encodingmetadata_language |
Strip XML markup and extract text |
| ZIP (application/zip) | metadata_content_type |
Extract text from all documents in the archive |
See also
Povratne informacije
Pošalјite i prikažite povratne informacije za