SharePoint 2013: Office documents Search crawl message indicating "This item was partially parsed."
Consider the following scenario:
You start a Search Service Application crawl in the SharePoint Server 2013 environment. Crawled Office documents that contain embedded images report this crawl warning:
This item was partially parsed. The item has been truncated in the index because it exceeds the maximum size.
The default SharePoint Search document parser extracts the contents from an Office document when the crawler processes a file. If the parser encounters an embedded photo, the associated schema reference shows the presence of a picture which the crawler passes over. With embedded images such as bitmaps, the associated image is identified as a shape without an associated schema. This generates the crawl warning message, because the parser cannot determine how to process the item. The current feature set for native, built-in handlers for Office documents in SharePoint Server 2013 does not support embedded images.
Note. The warning message indicates that the embedded items are not processed. However, the remaining text and properties are still crawled and are searchable.
With a minimum server update level of July 2014 CU (15.0.4631.1001) for SharePoint Server 2013, you can use a third-party iFilter of your choice to parse bitmaps and override the built-in SharePoint 2013 handler.
The Filter packs built natively in 2013 do not have the full functionality of the filter packs from SharePoint 2010 and FAST. The Office 2010 Filter pack referenced shown here includes the ability to parse embedded images effectively and can be used to override the out-of-box iFilter for Office documents:
Microsoft Office 2010 Filter Packs
To enable iFilter support, use the commandlet Set-SPEnterpriseSearchFileFormatState which has the switch –UseIFilter for this purpose. The full command to switch to the iFilter follows.
Note: Use a similar command for any iFilter that you want to replace after you install the iFilter.
$ssa = Get-SPEnterpriseSearchServiceApplication
Get-SPEnterpriseSearchFileFormat -SearchApplication $ssa -Identity docx
Set-SPEnterpriseSearchFileFormatState -SearchApplication $ssa -Identity docx -UseIFilter $true -Enable $true
On each server that hosts the Content Processing component, the Search Host Controller service must be restarted to accept the changes. Use the following procedure:
net stop spsearchhostcontroller
net start spsearchhostcontroller
After you complete these steps, start a full crawl. The documents that were generating the warning before should now be displayed as crawled successfully by using the Office 2010 Filter pack for embedded images.
If at some point there is an update to the existing filter packs built natively to support crawling embedded images, disable the installed, overridden iFilters using the previously mentioned PowerShell commands but setting each file type to –UseIFilters $false.