Gov2.0 and Data

Tim Berners-Lee An important thing happened today, and hopefully it will influence the Gov2.0 direction that the UK takes. The Cabinet Office has announced that Tim Berners-Lee is helping the UK government to be more open and accessible on the web.  So aside from some kudos points for getting the inventor of the world wide web to help, why is this such a big thing?

His talk at TED outlines his position and is well worth watching; we need to get data onto the web in a format that we can link together.  In his words:

‘a web for open, linked data that could do for numbers what the Web did for words, pictures, video: unlock our data and reframe the way we use it together’

There is vast amounts of data on the web, but it is designed to be read by humans rather than computers – and it gets really tricky to reverse engineer links back in.  An example of this is the new Google Squared project that tries to take data on the web and present it in a tabular form – ie linked.  For example, try searching for US Presidents and it shows a nice list of Presidents, all works well because it is a pretty simple query.  Now add an additional column of ‘weight’, and now we start to hit problems.  Richard Nixon is apparently 11 pounds and Harry Truman is a pretty impressive 66,200 pounds.

It is just a really good example of what happens if you do not link data.  Without that link specified, you have to infer link.  Harry Truman’s weight of 66,200 pounds is actually the weight of the propeller on the US aircraft carrier ‘USS Harry S. Truman’, not really what we wanted.

A better example of Linked data can be found at dbpedia.org which has taken all the structured data within Wikipedia and hosted it as raw linked data.  So I can now query my home town Oxford and get back raw data such as location, famous people who live here etc etc and also linked data, for example it is on the A420 and therefore linked to Swindon and Chippenham (not the most gripping example I know…).  With linked data, someone (with more inspiration that me) can mine this data and use it.  Hans RoslingTake for example Hans Rosling and his use of statistics on TED (again well worth watching), this is what you can do when you start linking data.

There are still issues, can you trust the data, is it correct?  For example Oxford does not have a population of 38 as reported.  The problem is Wikipedia should feed off the dbpedia site and then add the text around it, that would ensure better quality data.

Another way of doing this is through Microformats, which allow us to build context into the data on a web page to allow machines to consume it  For example the HTML:

 <span>The British Prime Minister, Gordon Brown lives at 
10 Downing Street, London SW1A 2AA</span>

and compare it to:

 <span>The <span class="Job Title">British Prime Minister</span>, 
<span class="Name">Gordon Brown</span> lives at 
<span class="Address1">10 Downing Street</span>, 
<span class="City">London</span> <span class="PostCode">
SW1A 2AA</span></span>

The first example we have to infer what the data is within the human readable text, the second can be easily and accurately read by computers.  Once rendered within a browser both will look the same to a human, but radically different to a computer.

We need to publish data in a structured manner, and then build web sites that consume that data.  Just building websites and including data means that we have to infer what the data means.