Indexing images with FAST Search for SharePoint 2010
Most large scale intranet projects will include some sort of image files in the search index. Image search is a common customer requirement, but in my experience people don't know what to expect from image search. Some people seem to think that just searching for the image name is good enough. Others have unrealistic expectation about recognizing objects and text in the image itself. Typical questions are:
- What image formats are supported?
- How is image metadata mapped by default?
- How are updates handled? Will images be reindexed when metadata changes?
- Will search recognize text and characters in the image (Optical Character Recognition/OCR)?
- How big can pictures be?
If you are feeling lazy, jump to the end to see the answers.
Image files usually have a date and time from when they were taken and possibly some technical information like focal length and other photo geek information. Camera devices with GPS usually store some geographical information about where the image was taken. In Windows, do right-click and properties on an image file to bring up the metadata details.
In this example my image is a screenshot from a page in some documentation. It had very little metadata, so I added a title, subject, tags and some comments. Let's see how it is indexed.
Indexing test files with docpush
Whenever I test search features I use docpush to index content. Unless you are indexing directories with lots of files this is far quicker than using the SharePoint crawler features. I usually include the -C LIVE option to wait for the callback telling me that the content is in the index.
My example files are on a share available to everyone.
docpush -c sp -C LIVE \\danglsp\share\TestImages\example.tif
I get a warning about not having the advanced filter pack enabled, but the content is still indexed.
Let's see what was indexed. In my test environment I have enabled the brilliant show all stylesheet to what is actually indexed. Another way to inspect what gets indexed is to run psctrl doctrace on before you run docpush. Then run doclog -a to see what happened in the processing pipeline. Yet another way is to add a Spy stage to the end of your pipeline.
Well, this isn't looking too good. The title is simple a link to my image. I was expecting it to be the title I set in my metadata. Also, I don't see the subject, tags or any other information from the metadata. The reference to chowinery.com is just a default prefix from docpush, so I don't care about that now. Seems like this was just indexed as a file. Let's enable the advanced filter pack to see if we can get some more metadata indexed.
Testing with Advanced Filter Pack enabled
Let's repeat the test with advanced filter pack enabled. This enables the search engine to get more content from hundreds of filetypes.
Now repeat the docpush command from above.
Looking good. No more warning. Searching for it again we see that it is looking a lot better. Our metadata is indexed.
For a full list of supported image file types, see the C:\FASTSearch\etc\formatdetector\converter_rules.xml file. It includes your normal JPEG/JPG, BMP, GIF and TIFF in addition to the more obscure Corel Draw, g3fax and even Harvard Graphics format whatever that is. Note that certain file types are excluded from the index by default. Make sure you enable indexing JPEG and JPG file extensions in Central Administration -> Content SSA -> Manage File Types.
This simple example is not a full image search application. We still need to preview images and do some simple redesign. This is covered in a separate post Tutorial: Render images in search results with FAST Search for SharePoint 2010.
Let's review the questions from the start of this article.
What image formats are supported?
Basically any file format you have heard about, and then some.
How is image metadata mapped by default?
From our example we saw the default mapping:
title -> title
comments -> description
How are updates handled? Will images be reindexed when metadata changes?
Yes. Changing the metadata will force the image to be reindexed on the next incremental or full crawl.
Will search recognize text and characters in the image (Optical Character Recognition/OCR)?
Yes, you can do OCR for TIFF files in FAST Search for SharePoint.
How big can pictures be?
Very big. Pixel data is not stored in the index either, so big pictures are no problem. Still normal file size restrictions apply. Read an interesting discussion about max file size in FAST Search for SharePoint in the MSDN forums.