Web Scraping in C#

Guest Post by Ivan Lukianchuk

Ivan Lukianchuk is a seasoned startup founder and award winning pitch artist turned consultant who currently runs Strattenburg Inc . He's a full stack .NET developer who has done a lot of work with Azure over the years building entire solutions from the ground up for clients ranging from independent professionals, to funded startups to $50MM organizations.You can follow Ivan on twitter: @Chronos or contact him via email at ivan@strattenburg.com

In my last post I talked about playing with APIs and how great they are to get information from awesome sites like Github. These sites are friendly enough to provide us with APIs so we can access their data, but then we have those sites that decide to not be so friendly and we bump up against that brick wall. This happened to me recently when trying to work on a new startup idea, so I decided to dive into web scraping. It was something I always wanted to avoid as it seemed messy, ugly and full of complicated REGEXs. I spent a lot of time trying to find the best and easiest methods to do what I wanted, and for the longest time I was convinced I’d have to go off of my beloved .NET platform to do it best! Luckily, after enough Googling and testing I came across a project called AngleSharp. I believe I found it early on, but maybe didn’t understand what exactly it was for, or was looking at the problem from a different lens. The problem with screen scraping, legalities and terms of service aside, is that there are a few different ways to skin the proverbial cat, and many of the resources you find online in the .NET world seem a little stagnant or outdated, which always scares me when learning something new.

I actually found a great blog post of another intrepid explorer who evaluated some of the .NET solutions out there and even gave some stats on which performed the best. His conclusion was also AngleSharp.

AngleSharp is pretty awesome, and I’m only just scratching the surface. It allows you to pull in a web page and then perform queries on it using either LINQ or CSS Selectors! Now you can put your JQuery-fu to work and forgot about complicated REGEXing or doing things manual style. You can even interact with JavaScript, although I haven’t touched it yet myself.

In my initial searches, I thought I might need to login or control a headless browser, so I spent a lot of time looking into PhantomJS, but found it had a high learning curve and was just frustrating to figure out and use. I even found a good .NET port for it here. Luckily with AngleSharp, you can actually control a headless browser and do form submissions, whoopee! The docs are fairly good, although not as flushed out as I’d like, but everything is in active development with several other tie in projects that are currently in Alpha. Doing some Googling of AngleSharp tutorials will give you some other resources written on sites like CodeProject and others. While dated, they help explain a few more beginner concepts in greater depth.

Getting started is pretty simple:

    1: var configuration = AngleSharp.Configuration.Default.WithDefaultLoader();
    2: var context = BrowsingContext.New(configuration);
    3: await context.OpenAsync(“http://yoursite.com”);
    4: var contentSearch = context.Active.QuerySelector(“#name”);
    5: var company = new company();
    6: company.Name = contentSearch.Text();
    7: company.Url = contentSearch.GetAttribute(“href”);

Adding some try/catch and null checks are always helpful, but this gets you in there pretty quickly! You pull in the “context” and then you query it! The “Active” denotes that it’s live and most up to date, not just a static grab. You can do other things like make adjustments and additions to the HTML!

There is more power under the hood, but this solved 98% of my needs and got me rolling! Repeat the process a few times and you’ve pulled all the info you need and can go along your merry way!

Web scraping doesn’t need to be scary anymore and no data can now evade your grasp, unless it’s in Flash! Happy hunting!