Web-scraping with VB's XML support

There was an interesting article about using VB's XML support for generating HTML: http://www.infoq.com/news/2009/02/MVC-VB.

I've been using VB and XML for the reverse purpose -- scraping web pages to retreive information. I enjoy sailing, and I wanted to find historical data on windspeeds to know when would be the best time of year to set out on a long trip. (Answer: March and April have the best winds around Seattle).

I found an excellent site, to scrape from, http://www.almanac.com/, which has historical weather data for many places around the country. The first step in scraping is copyright law. Facts alone are not copyrightable, but the act of selecting and compiling facts is a creative work and so the compilation is copyrightable. Hence, for instance, a telephone directory is protected by copyright. So too is almanac.com's compilation. And that's why I only scraped their pages for my own personal use.

The almanac has URLs like this: http://www.almanac.com/weatherhistory/oneday.php?number=994014&wban=99999&day=1&month=4&year=2008&searchtype=. It's easy to see what the format is, and generate similar URLs myself.


The code to parse XHTML:

I looked at the HTML source code of a page from the almanac in Notepad, figured out its structure, and wrote some simple XML queries to dig into it. (Note: the function "Fetch" fetches HTML pages from the web, but converts them into XHTML ready for VB XML queries. More on that later). Here's the VB code. I highlighted the XML queries.


Option Strict On

Imports System.Net

Imports System.IO

Imports <xmlns:xhtml="http://www.w3.org/1999/xhtml">

Module Module1

Dim Places As Integer() = {994014}

Dim Years As Integer() = {2008}

Dim Months As Integer() = {4, 5}

Sub Main()

Console.WriteLine("{1}{0}{2}{0}{3}{0}{4}{0}{5}{0}{6}{0}{7}{0}{8}", vbTab, "Date (Y/M/D)", "Location", "Temp (^F)", "Precipitation (in)", "Visibility (miles)", "Wind Mean (mph)", "Wind Sustained (mph)", "Wind Gust (mph)")

For Each year As Integer In Years

For Each month As Integer In Months

Dim d = New DateTime(year, month, 1)

Dim dnm = New DateTime(If(d.Month = 12, d.Year + 1, d.Year), If(d.Month = 12, 1, d.Month + 1), d.Day)

Dim lastDay = CInt((dnm - d).TotalDays)

For day As Integer = 1 To lastDay

For Each place As Integer In Places

Dim url = String.Format("http://www.almanac.com/weatherhistory/oneday.php?number={0}&wban=99999&day={1}&month={2}&year={3}&searchtype=", place, day, month, year)

Dim fn = Fetch(url)

Dim xml = XElement.Load(fn)

Dim body = (From i In xml...<xhtml:div> Where i.GetAttr("class") = "yui-u first").FirstOrDefault

If body Is Nothing Then Continue For

Dim title = body.<xhtml:h2>.Value.ToString.Replace(",", " ")

If title.ToLower.StartsWith("no data") Then Continue For

Dim temp, precipitation, visibility, windMean, windSustained, windGust As Double?

Dim data = From i In body...<xhtml:td>

For Each td In data

Dim text = td.<xhtml:p>.FirstOrDefault

If text Is Nothing Then Continue For

Dim category = text.Value.Replace(vbCrLf, " ").Replace(vbCr, " ").Replace(vbLf, " ").ToLower

text = td.<xhtml:b>.FirstOrDefault

If text Is Nothing Then Continue For

Dim svalue = text.Value.Replace(vbCrLf, " ").Replace(vbCr, " ").Replace(vbLf, " ").ToLower

Dim value = 0.0 : If Not Double.TryParse(svalue, value) Then Continue For

If category Like "mean temperature" Then temp = value

If category Like "total precipitation" Then precipitation = value

If category Like "visibility" Then visibility = value

If category Like "mean wind speed" Then windMean = value

If category Like "maximum sustained" Then windSustained = value

If category Like "maximum gust" Then windGust = value


Dim s = String.Format("{0:0000}/{1:00}/{2:00}", year, month, day)

Console.WriteLine("{1}{0}{2}{0}{3}{0}{4}{0}{5}{0}{6}{0}{7}{0}{8}", vbTab, s, title, temp, precipitation, visibility, windMean, windSustained, windGust)





End Sub

End Module

Fetching pages: HTML into XHTML

Goal: to use VB's XML support for reading the web page. That's because VB has such nice syntax (I find it easier than xpath, or beautiful soup, or the alternatives). The problem is that most web-pages are written in a sloppy kind of HTML that might render okay but certainly can't be loaded into XElement.Load.

Solution: download Tidy, an awesome open-source library and executable for, well, tidying HTML into proper XHTML. I downloaded "tidy.exe" and put it into my windows directory, so I could execute it without messing around with the path.

The above code calls a function "Fetch". This is the one that fetches pages, and invokes "tidy" to clean up the html. Here is the implementation of Fetch. It uses a function "InputAndOutputToEnd" to redirect input and output of tidy.exe when it runs it. I wrote about InputAndOutputToEnd last month.


Module Helpers

''' <summary>

''' GetAttr: x.GetAttr("attr") is equivalent to x.@attr. It's here to work around a MONO bug: MONO

''' will throw an exception on x.@attr if the attribute is absent; the CLR doesn't. This function

''' also doesn't throw.

''' </summary>

<System.Runtime.CompilerServices.Extension()> Function GetAttr(ByVal e As XElement, ByVal attr As String) As String

If e Is Nothing Then Return ""

For Each a In e.Attributes

If String.Compare(attr, a.Name.LocalName, True) = 0 Then Return a.Value


Return ""

End Function

''' <summary>

''' Fetch: this function fetches the given Url and saves it into a cache in a temporary directory.

''' It returns the filename. If the Url had given back "text/html", then this function invokes

''' "tidy.exe" (from http://tidy.sourceforge.net/) to turn the html into valid XHTML such as can

''' be read with XElement.Load. The function will throw an exception if anything bad happened,

''' e.g. WebException or BadUriException. If asked to fetch a url but this url had already been downloaded

''' previously, and the previous download was no more than "CacheAtLeastDays" old and hadn't

''' been deleted, then the previous download is used. The idea is that our program might well hammer

''' web-services, and we don't want to be too cruel on them, so even if they didn't specify caching

''' for a page then we might still want to cache it. (If the webservice specified a cache longer than

''' CacheAtLeastDays, then any number of internet proxies along the way might cache it, and so

''' CacheAtLeastDays is a minimum rather than a maximum.) This function is not protected against

''' multiple threads calling it. There might be contention if multiple threads call it and try to

''' download and write to the same file. Note: in the cache, URLs are escaped then truncated to 240

''' characters. So if they were longer than that (e.g. long query strings) then there'll be cache

''' conflicts and the wrong data might be returned.

''' </summary>

Function Fetch(ByVal Url As String, Optional ByVal CacheAtLeastDays As Double = 7) As String

Dim dir = IO.Path.GetTempPath & My.Application.Info.AssemblyName & "\fetch"

If Not Directory.Exists(dir) Then Directory.CreateDirectory(dir)

' Note: if the directory already existed, then CreateDirectory just proceeds silently without fuss.

Dim fn = dir & "\" & Uri.EscapeDataString(Url.Replace("http://", "").Replace("/", "_")).Replace("%", "#")

' MONO: If you try to XElement.Load(fn) where fn includes %escapes, then it tries to unescape them.

' So we make sure there are no %escapes in the filename. (CLR doesn't have this quirk.)

fn = fn.Substring(0, Math.Min(240, fn.Length))

' MONO on unix: is fine so long as every directory/filename component is <=240 characters.

' CLR on windows: requires the entire path "fn" to be <=240 characters.

' http://blogs.msdn.com/bclteam/archive/2007/02/13/long-paths-in-net-part-1-of-3-kim-hamilton.aspx

If File.Exists(fn) Then

Dim age = DateTime.Now - File.GetLastWriteTime(fn)

If age.TotalDays <= CacheAtLeastDays Then Return fn


End If

Dim x = WebRequest.Create(Url)

Using r = x.GetResponse

Dim t = ""

Using rs As New StreamReader(r.GetResponseStream)

t = rs.ReadToEnd

End Using

If Not r.ContentType.StartsWith("text/html") Then

My.Computer.FileSystem.WriteAllText(fn, t, False, Text.Encoding.UTF8)

Return fn

End If

Using tidy As New System.Diagnostics.Process

Dim cmd = "tidy"

Dim args = "-asxml -numeric -quiet --doctype omit"

' MONO: XElement.Load throws an exception if DOCTYPE is present. CLR doesn't. Hence we omit the DOCTYPE.

tidy.StartInfo.FileName = cmd

tidy.StartInfo.Arguments = args

tidy.StartInfo.UseShellExecute = False

tidy.StartInfo.RedirectStandardInput = True

tidy.StartInfo.RedirectStandardOutput = True

tidy.StartInfo.RedirectStandardError = True


Dim err = "", op = ""

tidy.InputAndOutputToEnd(t, op, err)


If tidy.HasExited Then

' We had already asked ("-numeric") for tidy to escape non-ascii characters. But

' nonetheless, XElement.Load will throw an exception if there are any, and we really

' don't want that, so we'll do belt-and-braces here:

Dim op2 As New Text.StringBuilder(op.Length)

For i = 0 To op.Length - 1

Dim c = AscW(op(i))

If (c >= 32 AndAlso c < 127) OrElse c = 13 OrElse c = 10 OrElse c = 9 Then


End If


My.Computer.FileSystem.WriteAllText(fn, op2.ToString, False, Text.Encoding.ASCII)

Return fn

End If



End Using

End Using

Return ""

End Function

''' <summary>

''' InputAndOutputToEnd: Given a started process, this lets you supply a string as input if you want,

''' and will read all output and error to the end. This function has no timeout: if we give it an input string

''' but the process fails to read it to completion, or if we ask for standard-output/error but the process

''' fails to close these streams, then the function will block indefinitely. The function will throw

''' an exception if there was an error reading from the streams. The caller is expected to have started

''' the process before calling the function, and the caller is expected to wait for the process to close

''' and to dispose of it afterwards. If the caller uses this function, then the caller should do no

''' other input/output to the process.

''' </summary>

<Runtime.CompilerServices.Extension()> Sub InputAndOutputToEnd(ByVal p As Diagnostics.Process, ByVal StandardInput As String, ByRef StandardOutput As String, ByRef StandardError As String)

If p Is Nothing Then Throw New ArgumentException("process must be non-null", "p")

' Assume p has started. Alas there's no way to check.

If p.StartInfo.UseShellExecute Then Throw New ArgumentException("Set StartInfo.UseShellExecute to false")

If (p.StartInfo.RedirectStandardInput <> (StandardInput IsNot Nothing)) Then Throw New ArgumentException("Provide a non-null Input only when StartInfo.RedirectStandardInput")

If (p.StartInfo.RedirectStandardOutput <> (StandardOutput IsNot Nothing)) Then Throw New ArgumentException("Provide a non-null Output only when StartInfo.RedirectStandardOutput")

If (p.StartInfo.RedirectStandardError <> (StandardError IsNot Nothing)) Then Throw New ArgumentException("Provide a non-null Error only when StartInfo.RedirectStandardError")


' MSDN notes, http://msdn.microsoft.com/en-us/library/system.diagnostics.processstartinfo.redirectstandardoutput.aspx,

' that "Synchronous read operations introduce a dependency between the caller reading from the StandardOutput stream

' and the child process writing to that stream. These dependencies can cause deadlock conditions." We avoid the deadlock

' by running in a separate thread.


Dim outputData As New InputAndOutputToEndData

Dim errorData As New InputAndOutputToEndData


If p.StartInfo.RedirectStandardOutput Then

outputData.Stream = p.StandardOutput

outputData.Thread = New Threading.Thread(AddressOf InputAndOutputToEndProc)


End If

If p.StartInfo.RedirectStandardError Then

errorData.Stream = p.StandardError

errorData.Thread = New Threading.Thread(AddressOf InputAndOutputToEndProc)


End If


If p.StartInfo.RedirectStandardInput Then



End If


If p.StartInfo.RedirectStandardOutput Then outputData.Thread.Join() : StandardOutput = outputData.Output

If p.StartInfo.RedirectStandardError Then errorData.Thread.Join() : StandardError = errorData.Output

If outputData.Exception IsNot Nothing Then Throw outputData.Exception

If errorData.Exception IsNot Nothing Then Throw errorData.Exception

End Sub

Private Class InputAndOutputToEndData

Public Thread As Threading.Thread

Public Stream As IO.StreamReader

Public Output As String

Public Exception As Exception

End Class

Private Sub InputAndOutputToEndProc(ByVal data_ As Object)

Dim data = DirectCast(data_, InputAndOutputToEndData)

Try : data.Output = data.Stream.ReadToEnd : Catch e As Exception : data.Exception = e : End Try

End Sub

End Module