GitHub Language Correlation for Jan 1 2014 - Feb 1 2015
GitHub seems like a great place to view some statistics regarding development languages.
I wanted to see what I could find out there in regards to GitHub data.
There were a couple choices I had: use their open API (which allow for a limited set of use daily) or use Google's Big Query engine which happens to have a ton of GitHub data available in it's list of public data sets.
So I went with Google's Big Query engine. And it's pretty slick. Despite having a lot of intermittent failures when running queries or downloading data, I was able to successfully do some pretty cool querying.
I found some data in there (old data, from 2012) regarding language correlation. I knew I wanted to do something like that, but definitely needed something more recent than ancient 2012.
And thanks to Data Hacker MD's post on Language Use On GitHub I was able to get started.
The steps were pretty simple:
1) Get a free 60 day trial account on Google's Big Query
2) Filter out the data I wanted, and shove that into a new "table"
3) Export the table to a very large CSV file
4) Run Data Hacker MD's python script over the data (which uses Spearman's Rank Correlation to determine the correlation between any 2 languages)
5) Exports the results as a SVG file (Scalable Vector Graphics) that you can see below
What we have here is an answer to the question:
"If person A writes code in Language X, are they also likely to write code in Language Y?"
Or in other words:
Some interesting conclusions:
- There is a fairly strong positive correlation between PowerShell and TypeScript...very weird
- C# users aren't likely to avoid other languages (I don't see any negative correlations in the C# column...well except for Puppet)
- But Visual Basic Users, they area a bit more insular and seem to favor certain languages over others.
- Plus, very few languages had a strong negative correlation (XSLT and Julia...)
But overall, and maybe it's just me, but comparing my chart from 2014 - Feb 2015 to Data Hacker MD's chart, seem to show that software developers are developing across multiple languages more and more (i.e. less redish coloring)