Regular Expressions Support in SharePoint 2010 Crawling

Article
01/21/2010

Search admins often need to omit from a crawl files that match a certain pattern. E.g.:

· In a bank, file names starting with SSN

· In a business site, files names with credit card number

· URLs having specific value of a certain parameter of an aspx file

· etc..

The usual solution is to allow admins to create “crawl rules” that restrict crawlers from following specific links. The most basic crawl rule specifies a complete URL for the file to be crawled, which requires the admin to create as many rules as there are files in their repository. A more practical solution often implemented involves the use of the wildcard character: “*”. This character matches everything, so admins can create a rule using the wildcard to include (or omit) all files under a particular folder or path:

\\myfileserver\myclientsfolder\*

This works if all the files are located neatly in one folder, but what if they are spread across the repository (or Web site)? This is the problem that is solved by using regular expression (RegEx) syntax.

The SharePoint Solution

In SharePoint 2007, the wild card operator “*” is the only operator supported in crawl rules for matching characters. As mentioned, it is a brute force operator that matches everything. Wildcard-only rules do not provide the admin the flexibility to, for example, recognize and omit URLs that contain Social Security Numbers, or that have an aspx parameter with a specific value.

SharePoint 2010 includes some new capabilities in this area. The default behavior of crawl rules in SharePoint 2010 is the same as it was in SharePoint 2007, but with SharePoint Search 2010, administrators can create crawl rules to include or exclude URLs that match regular expressions. To enable regular expressions, the admin need only select the check box on the Crawl Rules creation UI as shown in the image below.

Regular Expression Operators

The table below lists and describes the regular expression operators that are supported for crawl rules in SharePoint 2010:

Operator	Symbol	Description	example Rule	Will match	Won’t match
Group	()	Characters can be grouped using round brackets. Any operator applied on it will be applied on the group.
Match any character	.	This operator matches any character. It does not match with NULL.	https://mysite/page.ht..	https://mysite/page.html	https://mysite/page.htm
Match zero or one	?	It allows the expression to not exist in the target address or can have only one repetition.	https://mysite/page(1)?.html	https://mysite/page.html AND https://mysite/page1.html	https://mysite/page11.html
Match zero or more	*	It allows the expression to not exist in the target address or can have any number of repititions.	“https://mysite/page(1)*.html	https://mysite/page.html AND https://mysite/page111.html	https://mysite/pag.html
Match at least one	+	It requires the expression on which it is applied to exist in the target address at least once.	https://mysite/page(1)+.html	https://mysite/page1.html AND https://mysite/page111.html	https://mysite/page.html
Exact count	{num}	This operator is denoted by a number inside “{}”, e.g. {5}. It restricts the expression on which it is applied to have exactly the specified number of repetitions in the target address.	https://myfiles/(9){4}-(0){2}.html	https://myfiles/9999-00.html	https://myfiles/999-00.html
Minimum count	{num, }	This operator is denoted by a number inside “{}” followed by a "," e.g. {5,}. It restricts the expression on which it is applied to have at least the specified number of repetitions in the target address.	https://myfiles/(9){4,}-(0){2}.html	https://myfiles/9999-00.html AND https://myfiles/99999-00.html	https://myfiles/999-00.html
Range count	{num1,num2}	This operator is denoted by 2 numbers inside “{}” separated by a "," e.g. {5,8}. First number defines lower limit and second number defines the upper limit. It restricts the expression on which it is applied to have any repititions in the URL between num1 and num2. A valid rule will always have num1 < num2.	https://myfiles/(9){4}-(0){2,3}.html	https://myfiles/9999-00.html AND https://myfiles/9999-000.html	https://myfiles/9999-0000.html
Alternation	\|	This operator is applied on two expressions and it matches ONLY one of the two expressions.	file://myshare/((folder1)\|(folder2))/.*	\\myshare\folder1\<any files> OR \\myshare\folder2\<any files>	\\myshare\folder1folder2\<any files>
List	[ <list of chars> ]	This operator is denoted by a list of characters inside “[]”. It matches with any of the characters which are specified in the list. Admin can specify a range of characters by using "-" operator in it.	https://testhost/test[1-3].htm	https://testhost/test1.htm OR https://testhost/test2.htm OR https://testhost/test3.htm	https://testhost/test.htm

Using RegEx Operators in Crawl Rules

Once you understand the RegEx operators above and how to enable them in the crawler, there are only a couple other things you need to keep in mind:

Protocol part

Regular expression operators cannot be used in the protocol part of the URL. This means, for example, the following RegEx rule cannot be created:

.*//www.microsoft.com/.*

If you try to create a rule like this, the system will add https:// in the beginning and thus make “.*” as the second part of the URL. The resulting rule in this case will be:

https:// .*//www.microsoft.com/.*

which may not be what you intended.

Case sensitive comparison

RegEx rules are case insensitive by default. In order to allow a rule to do case sensitive matching of a URL, the administrator should select the “Match case” check box in the rule creation UI as shown below:

If the “Match case” checkbox is selected, the crawler will preserve the case of matching URLs during the crawl. In the example above, the rule will match: https://test/AbC123.html and WILL NOT match to https://test/Abc123.html.

This feature comes in handy when SharePoint is used to crawl web sites hosted on Unix based web servers, which are case sensitive.

Examples

Here are some interesting examples demonstrating the usefulness of Regular Expression in crawl rules:

Rule	Description
\\myshare\.*	Match everything under the share “myshare”
file://.*/[0-9]{4}-[1,2]-[a-z]{4}.docx	Match all the links with file names having the following pattern: <4 digints>-<1 or 2>-<4 characters>.docx
\\myshare\((folder1)\|(myfolder))\.*	Match all files in folder1 or myfolder in \\myshare
https://mysite/myasp.aspx[?]param1=value	Specify a regex operator: "?" in this case, in regex rule.
https://mysite/myasp.aspx[?]param[12]=.*	Match all links pointing to myasp.aspx with either param1 or param2 specified.
https://site/..aspx[?]category=1&subcategory=.	Match all aspx links that have a specific parameter value and ignore the value of second parameter
file://clientsdata/[0-9]{3}-[0-9]{2}-[0-9]{4}.*	Match all files that start with Social Security Number

Syed Anas Hashmi | SDET | Microsoft Enterprise Search Group