How to use data from millions of open source projects

This month released metadata on over 25 million open source projects.
452 readers like this.
Open lightbulbs.

What if we applied the techniques Google applied to index the internet back in 1998 to the world of open source software? That's exactly the thought Andrew Nesbitt had in 2014 which lead to the creation of, an open source project for indexing other open source projects. This month released metadata on over 25 million open source projects.

You can download it right now from Zenodo, but what can you do with it? To understand what is contained within this dataset, I'll take a quick look at how it's collected.

How data is collected

Everything in begins with package managers. We index project metadata from 33 package managers, filling in gaps from their source repositories where we can. We parse project manifests—a gemfile, package.json, or similar—that includes code from other projects and stores the links between them.

Projects and repositories are one of the key distinctions made in this dataset. Projects are typically the components distributed through one or more package managers. Repositories may belong to a project but most frequently they are consumers, incorporating projects into an application or service.

Repository dependencies—over 100 million of them—are the very core of this dataset. They define links between repositories and the projects that they build upon. Using these links we can easily uncover the most popular projects used in open source. But we can go further.

Dependencies define the links between projects, each with their own set of dependencies, known as transitive or deep dependencies. We can walk the tree-like structures that emerge to uncover the most valuable projects used in open source.

These two use-cases are the glue that brought Andrew and I together: Andrew built to solve the discovery and recommendation problem, while I was looking at identifying critical OSS projects after the Heartbleed issue in OpenSSL.

Of course, no piece of software is an immutable constant. Versions define a release of a project upon which dependencies are mapped, each version having its own dependency tree. Repositories may define a dependency on one single 'pinned' version or a range of versions i.e. >1.1.5. In cases where a package manager does not support the concept of a version tags are used to track releases of projects instead, pulling tagged releases from their source.

Finally, we have the convenience of a de-normalized table projects with related repository fields which in-lines the project and repository data to save you creating those links yourself.

So that's what we have to work with, what can we do with it? provides us with some examples.


It's unsurprising that the approach taken by Google transposed onto open source software results in a great search engine. itself is primarily a cross-platform search engine, using the popularity and value measures to prioritize the most commonly used components. also supports search on third party sites like the package manager Bower and search provider DuckDuckGo.

The goal is for to create a stronger open source ecosystem. We'll do this quickest by becoming a utility for developer-centered services. This data release can be used to create the map and the's API can be used to keep the map up to date. The team is particularly keen to support other package managers struggling with search.

Developer tools

The rate of production and complexity of software is accelerating. Developers rely on tools to automate many of the processes that would otherwise take time away from solving new, interesting problems. helps developers stay afloat of the constantly-shifting dependencies for the foundation of their applications, highlighting potential issues to maintainers, such as outdated, deprecated, and unmaintained software, and projects with license incompatibilities. Our fledgling CI tool builds these checks into the development process while supporting the development of

Prioritization and contribution is already working with organizations who are concerned by the issues highlighted in Nadia Egbhal's Roads and Bridges report for the Ford Foundation, one of the two foundations supporting the team. This report brought to light the very real problem of how under-supported some of the key open source projects that we all depend upon are—at work, at home, and everywhere in-between. The world is waking up to this problem and is one of the central resources being used to tackle it.

For the last two years, we've been experimenting with this data ourselves. We've built pages to highlight important projects, projects with too little attention and too few contributors. Shining a light on these kinds of projects, recommending them to contributors who are looking to make an impact with their time or to benefit directly from their own contributions is one of the approaches is taking to create a stronger ecosystem and raise the quality of all software.

If you would like to use's data you can download it right now from Zenodo. For more information check out the documentation.

Ben with a coffee.
The other guy at @librariesio and @dependencyci.

Comments are closed.

Creative Commons LicenseThis work is licensed under a Creative Commons Attribution-Share Alike 4.0 International License.