Querying 10 years of GitHub data with GHTorrent and Libraries.io

Querying 10 years of GitHub data with GHTorrent and Libraries.io

There is a way to explore GitHub data without any local infrastructure using open source datasets.

magnifying glass on computer screen
Image by : 

opensource.com

x

Get the newsletter

Join the 85,000 open source advocates who receive our giveaway alerts and article roundups.

I’m always on the lookout for new datasets that we can use to show off the power of my team's work. CHAOSSEARCH turns your Amazon S3 object storage data into a fully searchable Elasticsearch-like cluster. With the Elasticsearch API or tools like Kibana, you can then query whatever data you find.

I was excited when I found the GHTorrent project to explore. GHTorrent aims to build an offline version of all data available through the GitHub APIs. If datasets are your thing, this is a project worth checking out or even consider donating one of your GitHub API keys.

Accessing GHTorrent data

There are many ways to gain access to and use GHTorrent’s data, which is available in NDJSON format. This project does a great job making the data available in multiple forms, including CSV for restoring into a MySQL database, MongoDB dumps of all objects, and Google Big Query (free) for exporting data directly into Google’s object storage. There is one caveat: this dataset has a nearly complete dataset from 2008 to 2017 but is not as complete from 2017 to today. That will impact our ability to query with certainty, but it is still an exciting amount of information.

I chose Google Big Query to avoid running any database myself, so I was quickly able to download a full corpus of data including users and projects. CHAOSSEARCH can natively analyze the NDJSON format, so after uploading the data to Amazon S3 I was able to index it in just a few minutes. The CHAOSSEARCH platform doesn’t require users to set up index schemas or define mappings for their data, so it discovered all of the fields—strings, integers, etc.—itself.

With my data fully indexed and ready for search and aggregation, I wanted to dive in and see what insights we can learn, like which software languages are the most popular for GitHub projects.

(A note on formatting: this is a valid JSON query that we won't format correctly here to avoid scroll fatigue. To properly format it, you can copy it locally and send to a command-line utility like jq.)

{"aggs":{"2":{"date_histogram":{"field":"root.created_at","interval":"1M","time_zone":"America/New_York","min_doc_count":1}}},"size":0,"_source":{"excludes":[]},"stored_fields":["*"],"script_fields":{},"docvalue_fields":["root.created_at","root.updated_at"],"query":{"bool":{"must":[],"filter":[{"match_all":{}}],"should":[],"must_not":[{"match_phrase":{"root.language":{"query":""}}}]}}}

 

This result is of little surprise to anyone who’s followed the state of open source languages over recent years.

JavaScript is still the reigning champion, and while some believe JavaScript is on its way out, it remains the 800-pound gorilla and is likely to remain that way for some time. Java faces similar rumors and this data shows that it's a major part of the open source ecosystem.

Given the popularity of projects like Docker and Kubernetes, you might be wondering, “What about Go (Golang)?” This is a good time for a reminder that the GitHub dataset discussed here contains some gaps, most significantly after 2017, which is about when I saw Golang projects popping up everywhere. I hope to repeat this search with a complete GitHub dataset and see if it changes the rankings at all.

Now let's explore the rate of project creation. (Reminder: this is valid JSON consolidated for readability.)

{"aggs":{"2":{"date_histogram":{"field":"root.created_at","interval":"1M","time_zone":"America/New_York","min_doc_count":1}}},"size":0,"_source":{"excludes":[]},"stored_fields":["*"],"script_fields":{},"docvalue_fields":["root.created_at","root.updated_at"],"query":{"bool":{"must":[],"filter":[{"match_all":{}}],"should":[],"must_not":[{"match_phrase":{"root.language":{"query":""}}}]}}}

 

Seeing the rate at which new projects are created would be fun impressive as well, with tremendous growth starting around 2012:

Now that I knew the rate of projects created as well as the most popular languages used to create these projects, I wanted to find out what open source licenses these projects chose. Unfortunately, this data doesn’t exist in the GitHub projects dataset, but the fantastic team over at Tidelift publishes a detailed list of GitHub projects, licenses used, and other details regarding the state of open source software in their Libraries.io data. Ingesting this dataset into CHAOSSEARCH took just minutes, letting me see which open source software licenses are the most popular on GitHub:

(Reminder: this is valid JSON consolidated for readability.)

{"aggs":{"2":{"terms":{"field":"Repository License","size":10,"order":{"_count":"desc"}}}},"size":0,"_source":{"excludes":[]},"stored_fields":["*"],"script_fields":{},"docvalue_fields":["Created Timestamp","Last synced Timestamp","Latest Release Publish Timestamp","Updated Timestamp"],"query":{"bool":{"must":[],"filter":[{"match_all":{}}],"should":[],"must_not":[{"match_phrase":{"Repository License":{"query":""}}}]}}}

 

The results show some significant outliers:

As you can see, the MIT license and the Apache 2.0 license by far outweighs most of the other open source licenses used for these projects, while various BSD and GPL licenses follow far behind. I can’t say that I’m surprised by these results given GitHub’s open model. I would guess that users, not companies, create most projects and that they use the MIT license to make it simple for other people to use, share, and contribute. That Apache 2.0 licensing is right behind also makes sense, given just how many companies want to ensure their trademarks are respected and have an open source component to their businesses.

Now that I identified the most popular licenses, I was curious to see the least used ones. By adjusting my last query, I reversed the top 10 into the bottom 10 and was able to find just two projects using the University of Illinois—NCSA Open Source License. I had never heard of this license before, but it’s pretty close to Apache 2.0. It’s interesting to see just how many different software licenses are in use across all GitHub projects.

github-4_500.png

The University of Illinois/NCSA open source license.

The University of Illinois/NCSA open source license.

 

After that, I dove into a specific language (JavaScript) to see the most popular license used there. (Reminder: this is valid JSON consolidated for readability.)

{"aggs":{"2":{"terms":{"field":"Repository License","size":10,"order":{"_count":"desc"}}}},"size":0,"_source":{"excludes":[]},"stored_fields":["*"],"script_fields":{},"docvalue_fields":["Created Timestamp","Last synced Timestamp","Latest Release Publish Timestamp","Updated Timestamp"],"query":{"bool":{"must":[{"match_phrase":{"Repository Language":{"query":"JavaScript"}}}],"filter":[{"match_all":{}}],"should":[],"must_not":[{"match_phrase":{"Repository License":{"query":""}}}]}}}

 

There were some surprises in this output.

Even though the default license for NPM modules when created with npm init is the one from Internet Systems Consortium (ISC), you can see that a considerable number of these projects use MIT as well as Apache 2.0 for their open source license.

Since the Libraries.io dataset is rich in open source project content, and since the GHTorrent data is missing the last few years’ data (and thus missing any details about Golang projects), I decided to run a similar query to see how Golang projects license their code.

(Reminder: this is valid JSON consolidated for readability.)

{"aggs":{"2":{"terms":{"field":"Repository License","size":10,"order":{"_count":"desc"}}}},"size":0,"_source":{"excludes":[]},"stored_fields":["*"],"script_fields":{},"docvalue_fields":["Created Timestamp","Last synced Timestamp","Latest Release Publish Timestamp","Updated Timestamp"],"query":{"bool":{"must":[{"match_phrase":{"Repository Language":{"query":"Go"}}}],"filter":[{"match_all":{}}],"should":[],"must_not":[{"match_phrase":{"Repository License":{"query":""}}}]}}}

 

The results were quite different than Javascript.

Golang offers a stunning reversal from JavaScript—nearly three times as many Golang projects are licensed with Apache 2.0 over MIT. While it’s hard precisely explain why this is the case, over the last few years there’s been massive growth in Golang, especially among companies building projects and software offerings, both open source and commercially.

As we learned above, many of these companies want to enforce their trademarks, thus the move to the Apache 2.0 license makes sense.

Conclusion

In the end, I found some interesting results by diving into the GitHub users and projects data dump. Some of these I definitely would have guessed, but a few results were surprises to me as well, especially the outliers like the rarely-used NCSA license.

All in all, you can see how quickly and easily the CHAOSSEARCH platform lets us find complicated answers to interesting questions. I dove into this dataset and received deep analytics without having to run any databases myself, and even stored the data inexpensively on Amazon S3—so there’s little maintenance involved. Now I can ask any other questions regarding the data anytime I want.

What other questions are you asking your data, and what data sets do you use? Let me know in the comments or on Twitter @petecheslock.

A version of this article was originally posted on CHAOSSEARCH.


What to read next

Given its popularity, you'd think the MIT License's inception would be well-documented, but it's actually quite a mystery.
Getting started with Python for data science

You don't need expensive tools to tap the power of data science. Get started with these open source tools. 
What you need to know about JSON in MySQL

MySQL's addition of a JSON data type makes the relational database easier to use and blurs the lines between SQL and NoSQL databases.

Topics

About the author

Pete Cheslock - Pete is currently the VP, Products for CHAOSSEACH - a cloud search and analytics company. Prior to CHAOSSEARCH, Pete spent the last 20 years in Technical Operations, working to build and manage large scale SaaS infrastructure at companies like Dyn, Threat Stack and Sonian. When Pete's not working he's checking the status of his briskets and ribs using his IoT powered smoker.