Differences Between MOSS Content Sources

I have been getting this question few times now with some of my MOSS Search customers. I have been asked what is the difference in setting up a "SharePoint Sites" MOSS Content Source versus a "Web Sites" MOSS Content Source.

Here I have tried to add some of my thoughts on these two:

  • Both these Content Sources allow you to have Name to describe the Content Source, so that you know the name for tracking and other Search related tasks

  • They both allow you to have Start Addresses for crawling this Content Source. In the Web Sites Content Course, this can include any content, from a single web page to a whole entire web site. For a SharePoint Sites Content Source, this can include Office SharePoint Server sites and WSS Sites.

  • For setting the Crawl Settings, this is where the difference between the two come about. In a Web Sites Content Source, you can specify that you only want to crawl the server of which you entered the Start Address above, or only crawl the first page of the start address above, or .... (which is my favorite) ... have a custom crawl settings set up. Here you can specify Server Hops and Page Depths. These two options are not available in a SharePoint Sites Content Source.

  • Page Depths are the number of links to follow on the same hostname. So for your SharePoint Sites Content Source, if you have a Page Depth of 1, the crawler will follow links from the home page and then stop.

  • Server Hops are the number of host name changes that the crawler will make. For example, if you have a Server hop of 1, a link on your site will be followed to any other host name, but it will not be followed to another server hop.

  • One additional difference is that the SharePoint content source allows users to crawl a single WSS site collection, which is not possible in a Web Sites content source. Meaning, if you want to crawl only a site collection, you have to put its URL in a SharePoint content source like https://myserver.com/sites/mikesitecollection and select the radio button to “Crawl only the SharePoint site of each start address”. If you put the same start address in a Web Sites content source, it will go all the way to the top ([https://myserver.com](blocked::https://myserver.com/ "blocked::https://myserver.com/

    https://myserver.com/")) and start crawling because that is the default for all SharePoint content.

  • Also the Web Sites Content Source can figure out that the starting address is a SharePoint site from the response header during the crawl, and then switch protocol handlers for crawling.

  • Of course, both content sources allow Full and Incremental Crawls and they both allow you to create schedules.

Hope that helps.

Thanks
Mike