The previous two articles in this series looked at open source community health and the metrics used to understand it. They showed examples of how open source communities have measured at their health through metrics. This final article brings those ideas together, discussing the challenges of implementing community health metrics for your own community.
First, you must decide which metrics you want to examine. This requires understanding your questions about reaching your goals as a community. The metrics relevant to you are those that can answer your questions. Otherwise, you risk being overwhelmed by the amount of data available.
Second, you need to anticipate how you want to react to the metrics. This is about making decisions based on what your data shows you. For example, this includes managing engagement with other community members, as discussed in previous articles.
Third, you must differentiate between good and bad results in your metrics. A common pitfall is to compare your community to other communities, but the truth is that every community works and behaves differently. You can't necessarily even compare metrics within the same project. For example, you may be unable to compare the number of commits in repositories within the same project because one may be squashing commits while the other might have hundreds of micro commits. You can establish a baseline of where you are and have been and then see whether you've improved over time.
The final organizational challenge I want to discuss is Personally Identifiable Information (PII) concerns. One of open source's core values and strengths is the transparency of how contributors work. This means everyone has information about who's engaged, including their name, email address, and possibly other information. There are ethical considerations about using that data.
In recent years, regulations like the European General Data Protection Regulation (GDPR) have defined legal requirements for what you can and cannot do with PII data. The key question is whether you need to ask everyone's permission to process their data. This is an opt-in strategy. On the other hand, you might choose to use the data and provide an opt-out process.
This distinction is important. For instance, suppose you're providing metrics and dashboards as a service to your community. In an effort to improve the community, you might make the case that the (already publicly available) information has greater value for the community once it's processed. Either way, make it clear what data you use and how you use it.
Where is your community data being collected? To answer this, consider all the places and platforms your community is engaging in. This includes the software repository, whether it's GitLab, GitHub, Bitbucket, Codeberg, or just a mailing list and a Git server. It may also include issue trackers, a change request workflow system like Gerrit, or a wiki.
But don't stop at the software development interactions. Where else does the community exist? These could be forums, mailing lists, instant messaging channels, question-and-answer sites, or meetups. There's a lot of activity in open source communities that doesn't strictly involve software development work but that you want to recognize in your metrics. These non-coding activities may be hard to track automatically, but you should pay special attention to them or risk ignoring important community members.
With all of these considerations addressed, it's time to take action.
1. Retrieve the data
Once you've identified the data sources, you must get the data and make it useful. Collecting raw data is almost always the easiest step. You have logs and APIs for that. Once set up, the (hopefully occasional) main challenge is when APIs and log formats change.
2. Data enrichment
Once you have the data, you probably need to enrich it.
First, you must unify the data. This step includes converting data into a standard format, which is no small feat. Just think of all the different ways to express a simple date. The order of the year, month, and day varies between regions; dates may use dots, slashes, or other symbols, or they can be expressed in the Unix epoch. And that's just a timestamp!
Whatever your raw data format is, make it consistent for analysis. You also want to determine the level of detail. For example, when you look at a Git log, you may only be interested in when a commit was made and by whom, which is high-level information. Then again, maybe you also want to know what files were touched or how many lines were added and removed. That's a detailed view.
You may also want to track metadata about different contributions. This may involve adding contextual information on how the data was collected or the circumstances under which it was created. For example, you could tag contributions made during the Hacktoberfest event.
Finally, standardize the data into a format suitable for analysis and visualization.
When you care about who is active in your community (and possibly what organizations they work for), you must pay special attention to identity. This can be a challenge because contributors may use different usernames and email addresses across the various platforms. You need a mechanism to track an individual by several online identifiers, such as an issue tracker, mailing list, and chat.
You can also pre-process data and calculate metrics during the data enrichment phase. For example, the original raw data may have a timestamp of when an issue was opened and closed, but you really want to know the number of days the issue has been open. You may also have categorization criteria for contributions, such as identifying which contribution came from a core contributor, who's been doing a lot in a project, how many "fly by" contributors show up and then leave, and so on. Doing these calculations during the enrichment phase makes it easier to visualize and analyze the data and requires less overhead at later stages.
3. Make data useful
Now that your data is ready, you must decide how to make it useful. This involves figuring out who the user of the information is and what they want to do with it. This helps determine how to present and visualize the data. One thing to remember is that the data may be interesting but not impactful by itself. The best way to use the data is to make it part of a story about your community.
You can use the data in two ways to tell your community story:
- Have a story in mind, and then verify that the data supports how you perceive the community. You can use the data as evidence to corroborate the story. Of course, you should look for evidence that your story is incorrect and try to refute it, similar to how you make a scientific hypothesis.
- Use data to find anomalies and interesting developments you wouldn't have otherwise observed. The results can help you construct a data-driven story about the community by providing a new perspective that perhaps has outgrown casual observation.
Solve problems with open source
Before you address the technical challenges, I want to give you the good news that you're in open source technology, and others have already solved many of the challenges you're facing. There are several open source solutions available to you:
- CHAOSS GrimoireLab: The industry standard and enterprise-ready solution for community health analytics.
- CHAOSS Augur: A research project with a well-defined data model and bleeding-edge functionality for community health analytics.
- Apache Kibble: The Apache Software Foundations' solution for community health analytics.
- CNCF Dev Analytics: CNCF's GitHub statistics for community health analytics.
To overcome organizational challenges, rely on the CHAOSS Project, a community of practice around community health.
The important thing to remember is that you and your community aren't alone. You're a part of a larger community that's constantly growing.
I've covered a lot in the past three articles. Here's what I hope you take away:
- Use metrics to identify where your community needs help.
- Track whether specific actions lead to changes.
- Track metrics early, and establish a baseline.
- Gather the easy metrics first, and get more sophisticated later.
- Present metrics in context. Tell a story about your community.
- Be transparent with your community about metrics. Provide a public dashboard and publish reports.