Exploring data with F# type providers
Guest post by Thomas Denny Microsoft Student Partner at the University of Oxford
Hi I am currently studying Computer Science at the University of Oxford.
I am the president (2017-18) and was the secretary (2016-17) of the Oxford University Computer Society, and a member of the Oxford University History and Cross Country societies. I also lead a group of Microsoft Student Partners at the university to run talks and hackathons.
F# is an incredibly flexible language, and amongst its many benefits is the ability to use type providers to access and manipulate data from external sources. A type provider allows you to create a .Net type at runtime without the need to declare the type in code - this facility is not dissimilar to LISP's macro features. In F# you might use a type provider in place of a code generation, e.g. for writing wrapper types for a database schema. In this article we use a web page to generate a type that we then use for extracting data from other similar pages, and then we look at how to extract data from a CSV file.
So long as you have F# and NuGet installed you can follow this guide using any editor, but you can make your experience a little easier by also installing Visual Studio Code and the Ionide F# plugin. This plugin has several useful features, but the most useful are its IntelliSense and type annotations features, which are even available for types created by a type provider!
Visual Studio Code
PM> Install-Package FSharp.Data -Version 2.3.3
Parsing and consuming data from HTML is traditionally a heavy task requiring a large amount of code; often a task as simple as extracting the column names of a table will require dozens of lines of code.
We're going to take a look at a simple problem: each year the cast and crew members of a film will often win several different awards (e.g. Academy Award, Golden Globe), and we would like to find the names of the cast or crew members that won the most awards for that particular film.
To start off with, we'll take a look at the accolades received by Spotlight, 2016's Best Picture winner at the Oscars. The results are presented in a table like this:
To start off with, we need to use the HTML type provider to create a new type based on this page. Create a new file called
awards.fsx (an F# script):
#r "FSharp.Data.2.3.3/lib/net40/FSharp.Data.dll" open FSharp.Data type AccoladeData = HtmlProvider<"https://en.wikipedia.org/wiki/List_of_accolades_received_by_Spotlight_(film)">
Next, we have to request the data for that specific page
let spotlightData = AccoladeData.Load("https://en.wikipedia.org/wiki/List_of_accolades_received_by_Spotlight_(film)")
spotlightData is an object of type
AccoladeData, which has properties
Lists - this is standard across all types created by the HTML type provider. However, the properties available off each of these properties varies based on the schema that the type was provided by. In our case, the
Tables property has an
Accolades property, which contains the table data from the page. If you use the Ionide plugin with Visual Studio Code, as described above, you can see this in the IntelliSense suggestions:
Collecting the results together can be done in a few lines of F#. We need to do the following:
- Filter out any results that were not wins
- Group results by the winner
- Count the number of wins for each winner
- Sort the winners by number of wins
This can be done as a simple F# function that takes the accolade table as an argument:
let awardNumbers (data: AccoladeData) = data.Tables.Accolades.Rows |> Seq.filter (fun row -> row.Result = "Won") |> Seq.groupBy (fun row -> row.``Recipient(s) and nominee(s)``) |> Seq.map (fun (person, awards) -> (person, Seq.length awards)) |> Seq.sortByDescending (fun (person, count) -> count)
Each table row is also of a type constructed by the type provider, and it will have properties for each column (e.g. the result, the recipient, etc). Finally, we can print the results:
for (person, count) in awardNumbers spotlightData do printfn "%s,%d" person count
Whilst this example is interesting for a single page, what about other pages with the same table of data? Simply by changing the URL that we load from we can also print the same results for another film:
let moonlightData = AccoladeData.Load("https://en.wikipedia.org/wiki/List_of_accolades_received_by_Moonlight_(2016_film)") for (person, count) in awardNumbers moonlightData do printfn "%s,%d" person count
Finally, we could then collect this data for several films at once in parallel and then print the results for each film:
let urls = [ "https://en.wikipedia.org/wiki/List_of_accolades_received_by_Spotlight_(film)" "https://en.wikipedia.org/wiki/List_of_accolades_received_by_Moonlight_(2016_film)" "https://en.wikipedia.org/wiki/List_of_accolades_received_by_La_La_Land_(film)" ] let allMovies = urls |> Seq.map AccoladeData.AsyncLoad |> Async.Parallel |> Async.RunSynchronously |> Seq.map awardNumbers for movie in allMovies do for (p,c) in movie do printfn "%s,%d" p c
Extracting data from CSVs
The F# Data package also provides a type provider for CSV files. Much like the HTML provider, you can also access all the column names as properties. Here's a simple example that extracts data from the British Government's list of MOT testing stations:
let [<Literal>] MOTUrl = "https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/613984/active-mot-testing-stations.csv" // No need to specifically declare a type from the type provider if we are // loading from one source let data = new CsvProvider<MOTUrl>() let stationsPerArea = data.Rows // Once again, column headers are the properties |> Seq.groupBy (fun row -> row.``VTS Address Line 4``) |> Seq.map (fun (location, rows) -> (location, Seq.length rows)) |> Seq.sortBy (fun (location, count) -> count) for (area, count) in stationsPerArea do printfn "%s,%d" area count
This is just a small glimpse of what you can do with F# data providers - the F# Data package also includes data providers for JSON files, for example.
Try F# Online and learn more about F# at
Microsoft Research F# https://www.microsoft.com/en-us/research/project/f-at-microsoft-research/#