What do the numbers behind an open source project tell us about where it is headed? That's the subject of Jesus M. Gonzalez-Barahona's OSCON 2014 talk later this month, where he looks at four open source cloud computing projects—OpenStack, CloudStack, Eucalyptus, and OpenNebula—and turns those numbers into a meaningful analysis.
And Gonzalez-Barahona knows analytics. As co-founder of Bitergia, he is an expert in the quantitative aspects of open source software projects. Bitergia's goal is to analyze software development metrics and to help projects and communities put these numbers to use by managing and improving their processes.
In this interview, Gonzalez-Barahona spoke with me about his OSCON 2014 talk, leveraging metrics, trends, visualization, and more.
Without giving too much away, what will you be discussing at your OSCON talk?
I will be presenting an analysis of the development communities around four cloud-infrastructure projects: OpenStack, CloudStack, Eucalyptus and OpenNebula. I will try to show how, even being wildly different in many aspects, they (or at least some of them) also show some similar trends.
Tell us a little bit about Bitergia. What is exciting about the exclusive focus on open source projects?
We believe that for making rational decisions involving free and open source software, you need to have the communities and development communities in mind. And for that, you need the numbers and data that characterize them. We are aimed at providing those data and helping you to understand it. Open source software projects are the most exciting area of the whole software development landscape. Helping to understand them, and giving people tools to improve their knowledge about projects and communities is keeping us really happy, and is a continuous source of fun.
What metrics do you think are most important for open source projects to be aware of? Do they differ from project to project?
They may differ from project to project, but some metrics are useful. For example, those determining the age of developers by cohort (or generation), which shows almost immediately the attraction and retention of developers that a project is experiencing over time. With just a quick browse you can determine if experienced people are still around, if you're getting new blood, or if you have a high burn-out. Company participation is also interesting, from a diversity point of view. And of course, there are those metrics related to neutrality: how companies and independent developers interact with each other, if some of them get favored when they interact, or not. Activity metrics have been used for many years, and those are also obviously interesting too. And now, we're working a lot on performance metrics: how well is your bug fixing working, or which bottlenecks you happen to have in your code review process.
How might a project leverage metrics to inform decision making? What is the best example you can think of showing how a project can improve from what they have learned?
Just two examples:
After seeing their aging metrics, a certain company decided to invest in a whole new policy to keep developers involved in their pet projects because they realized they were losing too many of them for certain cohorts, and they were really risking not having experienced developers in one or two years.
With some open source foundations we have been working on very precisely determining the participation and effort by developers affilated with certain companies, because that was central to the negotiations between these companies in deciding how to coordinate to support the project.
As you've worked with the metrics of various open source projects through the years, what has stood out to you as surprising? Are there any trends which seem to be emerging?
Something that you see once and again is how corporate support matters a lot in some big projects. Granted, individual developers are important, but a medium/large company can decide to double the number of developers just by assigning experienced people to the project, thus boosting it and generating a lot of momentum. But the process is not easy: you have to carefully understand the dynamics, so that you don't cause burn-out in volunteer developers or in others from other companies that are not investing in the project at the same pace. Some may think that "helping to accelerate" a project is just a matter of pouring money (or developers) on it. On the contrary, we see it's almost an art: you have to carefully track what's happening, and react very quickly to problems, and maybe even slow down a bit as to not completely trash the effort. But we've also seen how, when it works, it is really possible to come from almost zero to hundreds or even thousands of developers in just a few years, and still having a sustainable and healthy community.
How can visualizations help quickly provide a snapshot of data to a project's community in a meaningful way?
There is so much data around that you need the right visualization to find out the interesting information. The appropriate chart, or in some cases just the appropriate number, can provide you much more insight than a lot of data. I guess this is usual in big-data problems anyway. Consider that analyzing large open source software projects is a matter of analyzing millions of records (commits, changes in tickets, posts, etc.). Either you have the right visualization, or you're lost.
Why it is important to have open source software tools to analyze open source projects?
If you look around, several systems for analyzing and visualizing open source software development are emerging. But unfortunately, most of them are proprietary. And I say unfortunately because that's a pity for the open source software community at large. It's not only that maybe you don't have the resources to use those systems, or that you cannot use them as you would like because you don't control them. Even when they are free, and you basically have what you need, they may still not be benefiting you as much as you could. You cannot innovate as you would like. You cannot adapt the system as your needs evolve. You cannot plug in at will with other data, or to other systems.
In short: you are not in control. This is not new, it is just the list of reasons why we all prefer open source software and find it more convenient and competitive. But it is for me of special concern that, in a field that we need to better understand how our projects work, we would have the only option of using proprietary systems or services. Having all the software be open source, all the data public and available (including intermediate data) is of paramount importance to the distributed control and improvement of open source software during the next years.