Daniel Izquierdo, co-founder of software development analytics provider Bitergia, has been analyzing data for his upcoming talk at OpenStack Summit in Austin. In this interview, he offers a preview of his talk Gender-diversity analysis of technical contributions in the OpenStack projects.
When did you start analyzing the OpenStack ecosystem gender diversity? And how has it changed over the years?
The idea was raised during the last OpenStack Summit [Tokyo, October 2015]. This took place after the talk titled Overcoming Subtle Bias. It seems that some numbers about gender diversity were presented, but someone in the community asked for actual technical contributions by women in the OpenStack project. This is what we do in Bitergia—analyze open source communities from several perspectives. This analysis specifically defines those technical contributions as commits, patchsets, and code review actions.
With respect to the data, there's a clear difference between the beginning of the project and now. This shows a growth of at least 5 times since 2010, although the current percentage is around a 6% of the contributions and a 10% of the population. However, that growth seems to be more stable during the last couple of years. This growth is also presented in the general project in terms of developers and reviewers. It is also noticeable the fact that even when there's a stabilization in the trend of new women coming into OpenStack, there's a jump of those being core reviewers (reviewers that can vote +2 or -2 in a code review process).
Tell us how you gathered and analyzed the data. And what "manual polishing" was required?
This consisted of an initial analysis of the first name of the developers and using an API provided by the Genderize team. We know that there are cases in which names evolve as being for females or males depending on the year, and even depending on the country. This is where the API helps a lot by providing a probability that a given name is for females or males.
With this perspective, we can easily play with a list of names. To bring some context, we covered around a 60% of the total names found in the approximately 10,000 names aggregated with information from Git and Gerrit repositories.
Up to this point, this analysis has provided an automated way of looking for gender within the OpenStack repositories, but there are developers that are using nicknames or names that are more difficult to analyze, such as people originally coming from Asia, where the API does not help that much. This polishing process then consisted of manually looking for the most important developers up to certain level to be sure that we were having the big numbers. We also wanted to be sure that the women listed in the WOO wiki were in the analysis. In the automated process, eight of them were not properly detected, given the use of nicknames and other similar issues.
What did the data tell you about the types of contributions women are making to the OpenStack project?
For some reason, there's a difference in the type of contributions. Women contributions show a work on code—measured in commits—of 40%, while men show an increase of such value up to a 60%. This said, the top five projects with the highest number of commits made by women are in this order: documentation, infrastructure, and Neutron, Nova, and Horizon.
You mainly focused on analyzing Git and Gerrit repositories. What additional areas of analysis might help improve the accuracy of your results?
A community is not only code, but this is a first approach. A project as big as OpenStack has a lot of extra data sources, such as mailing lists, the Askbot instance, Launchpad, or IRC channels, among others. If we focus in technical contributions, Launchpad seems to be another candidate to be part of the analysis. And mailing lists or IRC are communication channels where a lot of technical discussions take place. For sure adding those to the analysis will help to have more data.
Regarding the accuracy of the results, there's still a bunch of identities not linked to any gender group, and this is around a 7% of the contributions and 24% of the total identities. As Asian countries usually shows a higher percentage of women working on technical issues, such as developing tasks, this analysis may be underestimating those results. A proper database with Asian names or extra manual effort would help a lot to improve the analysis.
What were a few other things you learned from the OpenStack community analysis?
We've been analyzing OpenStack since around 2012 through the Activity Board dashboard, and later producing the quarterly reports. I'd mention the way the community has self-regulated and exponentially grown as one of the key factors for being successful. As an example of this data, the quarterly reports show a continuous increase in the number of core reviewers that is in line with the continuous growth of the community. Even when big players may be competitors, in the OpenStack community they are playing with the same rules under the same umbrella, and this helps to show a stable development process.
Also, it has been interesting to see how the [OpenStack] Foundation started to produce specific policies when the very same community felt that the efficiency was decaying (e.g., time to merge changesets was increasing). And again, seen in the quarterly reports, this shows a decrease in those numbers thanks to the policies applied to the newcomers, such as training sessions during the summits, and adding new core reviewers and letting the developers know about the importance of reviewing in time.
This is, of course, more a matter of each of the project teams and how they perceive how the project is behaving, but data helps a lot in this process. Gitdm dataset, Russell Bryant scripts for code review, or Stackalytics have helped to have more and more awareness about how OpenStack is being developed.
Which other open source communities would you like to analyze for diversity or other community insights?
I'd like to check if some policies applied within the communities are useful in terms of the effort put by some members of the communities and other types of resources. I can think of the Outreachy programs. My personal perception is that having numbers and perhaps this analysis each year may help a lot to understand if those programs are useful for bringing diversity to the project. As the number of people involved in these programs is not that high, it is easy to check if the attracted developers are still in the project after a while.
In addition to this, something that I'd like to work on is to study if the big core of women coming to OpenStack are really coming because some organization brought developers to the project. If so, the Foundation may consider additional actions focused on the organizations working in the project. Or perhaps at the level of the future developers of OpenStack that are currently at the high school where they will decide their immediate future.
Thus, I'd say that any open source project community is aware that diversity is a plus for them. Foundations such as Wikimedia, Mozilla, and others are probably a good starting point for further analysis. And this should help to have a comparison among the several policies they apply to deal with the diversity gap issue. If data show that some policies applied in communities are useful, that could be later applied to others. But to reach that level, we first need to produce data to make those decisions.
See Daniel's OpenStack Summit talk in Austin on Thursday, April 28, from 11:50am-12:30pm.