Indexing and Searching ASPX Files

Someone sent me a comment to one of last week’s postings lamented the trouble with indexing ASPX files, particularly because MS product support says that we don’t recommend searching them.  This person’s company has a lot of content in ASPX files and doesn’t want to go back and change them to HTML or Word files.  I don’t blame them.

Good news — indexing ASPX files is fine in many circumstances.  The default settings are there to ward off a couple ofworst-case scenarios.

If you’re using ASPX files as a means for storing pages, that in and of itself is fine. If the pages are being built dynamically, that’s fine, too, except for two situations:

  1. If your ASPX file contains code that causes the page to render different content every time it’s retrieved, that could be a problem — but it might not be.  If the part of the page you need to index changes all the time, what’s in the index will always be wrong.  That would be bad.

    If, on the other hand, part of the page didn’t change (or didn’t change very often), and it’s that part that needs to be indexed, that’s probably fine.  For example, a page that fetches the U.S. State Department’s traveler advisories for different countries; you could have a page with hard-coded content such as “Traveler Advisory for Canada**” and code that fetches the current advisory over SOAP, RSS, or HTML clipping.

    In this case, (a) it’s the fact that there’s a page with traveler advisory info for Canada that’s most important — clicking on the link within search results to get to the content of the advisory is probably fine, and (b) the advisories don’t change that frequently, so even indexing on the content of the advisory isn’t that risky.

    For extra credit, your ASPX page would keep the <META> tag that reports the page’s last modified time in sync with the date on which the fetched advisory report most recently changed.  This would free up our index gatherer process from having to re-crawl the page every time.

  2. If the ASPX page adapts its content to the person viewing the page, it’s not a good candidate for being indexed.  We index a given content source with one Windows account, ideally one with maximum privileges.  We try to retrieve the permissions for each piece of content and index those, too, so when we return query results, they’re trimmed to only display results you’re allowed to actually open.

    If, however, the page has only a little bit of content when you read it, but a lot of content when the index gatherer “reads” it, it could be returned in the search results as a false positive; when you click on it, the page you then try to read might not be relevant at all.

    If, on the other hand, you overcompensate by using a low-privilege account to index content, you’ll fail to record a lot of content, and pages that should match a query won’t be returned as results (a false negative condition).

    The only way any search architecture could get around this is to execute each page with every known user’s credentials, which is completely impractical at best, non-performant at worse, and a potential security hole at worst.  We’re not going to be doing that.

Keep these two factors in mind, and if you can mitigate them, go into SPS’ search settings and tell it to index ASPX files with a clear conscience.

(**Don’t misinterpret this as anything but an abstract example.  I picked the U.S. State Department because I live in the United States.  I picked Canada because I’m a citizen of Canada.)