A LINQ provider for Web queries

To start a series of "LINQ provider" posts, today I upload a provider sample that in some sense treats the Internet as a database: For a SQL Server database, you can make tables in a database accessible to LINQ by writing classes with attributes that define how objects of these classes are retrieved from rows in tables. LINQ can then use these classes to issue queries against the database. Similarly, this provider allows adding attributes to classes to specify how such objects are retrieved from Web pages, and you can then issue LINQ queries against them.

The project "WebLinq" in the attached solution contains this provider - it is not very sophisticated, it just contains three files:
- WebLinqAttributes.cs contains the attributes that are recognized
- WebContext.cs is the class your WebLinq enabled classes inherit from
- Utils.cs contains helper functions to GET / POST to a web site and to find substrings in a text.

The project "WebSources" defines some classes for 
- Searching for articles in the CiteSeer web sites (see below)
- Searching for articles in the MSDN web sites
- Translating words / sentences
- Integrating functions of one variable
- Looking up the current values of stocks from the company symbol

The project "SimpleDemos" uses these two DLLs to demonstrate the last three classes.

The project "TestWebLinq" demonstrates the access to the CiteSeer web sites.

CiteSeer is a database of computer science articles; you can search for articles by keywords, and obtain information about articles, and often even retrieve them directly from the Web site.
To use the CiteSeer demo, enter for example "Support Vector Machines" in the text box labeled "Search terms", and click on the "Retrieve" button. It will take some while to visit the web pages which list available articles, to visit the web page for each article, retrieve the information from this article, and access a another web page for details, but then you should see a list of paragraphs which contain
- Author's name(s)
- Title and year
- Some three lines of introduction
- URL for this article
- URL for downloading the article as pdf file
- Information about the rights for this article

If you are only interested in new articles, try entering 2002 in the "Publication year >=" text field and click again on "Retrieve" (currently I get 3 results back).

Here is how the corresponding query looks in the code:

var doc = new GoogleCiteSeer(searchTerms,0);
var query = from art in doc.Articles
            where art.details.Document != null
               && art.details.Document.bibtex != null
&& art.details.Document.bibtex.year>=minYear
            select art.details;

Here is an example for a class that defines how to read the "BibTeX" part of the Web page with details for an article:

public class CsBibTex {
[StartPart("author = \"")] [EndPart("\"")] public string author;
[StartPart("title = \"")] [EndPart("\"")] public string title;
[StartPart("year = ")] [EndPart(",")] public int year;

This sample code is provided as-is and does not come with any warranty.
You can modify and use the code for commercial and non-commercial purposes.