Mystery Solved - Crawled Properties in SharePoint (Part 1)
Crawled properties in SharePoint are metadata that is extracted from documents during crawls and is based on the protocol handler used. Metadata includes information such as author, create date, subject, title, etc. Administrators can control which crawled properties are mapped to managed properties, and, in doing so, the end user search experience can be enhanced.
The dilemma comes into play when you’re trying to figure out which crawled property to map to a managed property. Several of the crawled properties do not have descriptive names (propID) but rather an integer for the name. This makes it impossible for one to know which crawled property could be mapped to a managed property.
Viewing Crawled Properties
Crawled properties can be seen in the Shared Services Provider’s Metadata Properties page.
1. Launch SharePoint 3.0 Central Administration
2. Select Shared Services Administration
3. Select the Shared Services Provider
a. In the Shared Services Provider, you can see the categories of the Crawled Properties by one of two ways:
i. Search Settings
ii. Search Administration (if the Infrastructure Update is installed)
b. Select the Search Administration link
c. In the Queries and Results heading, select Metadata Properties.
d. On the Metadata Property Mapping page, select Crawled Properties to see the list of Crawled Property Categories.
e. From here, you can see the Crawled Properties View and the number of properties associated with each category.
NOTE: The number of crawled properties per category may differ than the screen shot due to environment differences.
From the screenshot, you can see there are 11 crawled property categories that are available out of the box. Some of the categories are self-explanatory while others may not be so clear. Given that, let’s take a look at what these categories represent.
Basic Category – can contain metadata that is associated with the gatherer, search, core, and storage property sets. In my environment, there are 10 different GUIDs (property sets) in the Basic Crawled Property Category.
Business Data Category – metadata that is associated with content in the Business Data Catalog
Internal Category – metadata internal to SharePoint
Mail Category – this metadata is associated with Microsoft Exchange Server
Notes Category – metadata this is associated with Lotus Notes
Office Category – metadata contained in Microsoft Office documents such as Word, Excel, PowerPoint, etc
People Category – metadata that is associated with the people profiles in SharePoint. The majority of these are also mapped to various managed properties from Active Directory and SharePoint information.
SharePoint Category – metadata that is part of the Microsoft Office schema available out of the box.
Tiff Category – contains metadata associated mainly with documents that have been scanned, faxed, along with word processing and Optical Character Recognition (OCR).
Web Category – HTML metadata associated with web pages
XML Category – includes metadata associated with the XML filter
We have just completed a very high-level overview of the “out of the box” crawled properties available in Microsoft Office SharePoint Server 2007. From here on, we’re going to dive into each of the crawled property categories, one at a time, to learn exactly what metadata is available to be crawled and what it means. Remember, some of the crawled properties may be different in your environment based on the content being crawled. Also, while I’ve tried to find all of the information, there are still a few crawled properties I simply cannot find.
The best approach when figuring out a big challenge is to break it up into smaller chunks. For our first example, let’s look closer at the Crawled Properties View – Basic for the Property Name called Basic:12(Integer) .
There are two properties with this exact same name where one is mapped to Size and the other isn’t mapped to anything.
Select the Property Name Basic:12(Integer) that is mapped to Size to see the details of this property.
The Name and Information section is the key to finding the meaning behind the any of the crawled properties whether it is in this category or any of the other categories. Given that, let’s look closer at the six (6) elements that make up the Name and Information Section.
Name and Information Section of a Crawled Property
As you can see, there are six (6) pieces of information that are associated with each crawled property: Property Name, Category, Property Set ID, Variant Type, Data Type, and Multi-valued.
Property Name – This is the name the development team gave this property when the program was written. It is hard-coded in the program and cannot be changed.
Category – This is a grouping of crawled properties based on the iFilter and Protocol Handler used to extract the metadata from the content. The category name can be edited but it is not recommended as search functionality will break.
Property Set ID - A GUID that identifies the property set for the crawled property. Doing a search for the GUID B725F130-47EF-101A-A5F1-02608C9EEBAC and filtering the results for the Property Name of 12 yields several links to related content. One such link, on MSDN, provides a tremendous amount of information. This tells us that this property set is a System property and the propID of 12 is the file size
The system-provided file system size of the item, in bytes.
name = System.Size
shellPKey = PKEY_Size
formatID = B725F130-47EF-101A-A5F1-02608C9EEBAC
propID = 12
inInvertedIndex = true
isColumn = true
isColumnSparse = false
columnIndexType = OnDisk
maxSize = 128
label = Size
invitationText = Add a file size
hideLabel = false
type = UInt64
groupingRange = Size
isInnate = true
multipleValues = false
isGroup = false
aggregationType = Sum
isTreeProperty = false
isViewable = true
isQueryable = true
includeInFullTextQuery = false
conditionType = String
defaultOperation = Equal
sortByAlias = None
additionalSortByAliases = None
defaultColumnWidth = 10
displayType = Number
alignment = Right
relativeDescriptionType = Size
defaultSortDirection = Descending
formatAs = General
formatAs = YesNo
formatAs = ByteSize
formatDurationAs = hh:mm:ss
formatAs = General
formatTimeAs = ShortTime
formatDateAs = ShortDate
useValueForDefault = False
minValue = 134217729
setValue = 134217729
text = >129 MB
control = Default
control = Default
control = Default
control = Default
Any of the GUID’s can be referenced on MSDN to gain a better insight into each of their properties.
Variant Type – The variant type defines the type of data for a property i.e. text, data and time, yes/no, integer, etc. The following table describes the some of the variant types used in SharePoint.
Data Type – Corresponds to the Variant Type
Multi-valued – Describes whether this property can hold more than one value. Most of the crawled properties are not multi-valued.
Now that we know how to read the information on the properties, the next post we'll see what they are all about.