Most software produces data, and many data owners are currently working out how to release their data publicly as part of a wider “data for good” movement that includes groups like the Engine Room, NGOs, private individuals, communities, and companies.
Ushahidi users are no exception to this, and we’ve been working hard to provide ways to access and release their datasets and on the issues and considerations needed to do that. We’re writing plugins and API code, but we’re also active in groups like the Responsible Data Forum; we are thinking about what it means to balance the potential social good of wider dataset release with the potential risks that come with making any data public.
Ushahidi is a global organization that empowers people to make a serious impact with open source technologies, cross-sector partnerships, and ground-breaking ventures.
Do no harm
Ushahidi is also a crowdsourcing platform, handling inputs from users via direct reports, SMS, Twitter, Facebook, and other social media and specialist data gathering platforms. Ushahidi platforms are used widely in contexts ranging from monitoring conflicts and election violence to disasters, development initiatives, wildlife monitoring, and park benches.
Although making the location of a park bench public carries little risk, many of the datasets managed by Ushahidi users contain information that is personal, often gathered under extreme circumstances, and potentially dangerous to its subjects, collectors, or managers. Sharing data from these platforms isn’t just about clicking on a share button. If you make a dataset public, you have a responsibility, to the best of your knowledge, skills, and advice, to do no harm to the people connected to that dataset. You balance making data available to people who can do good with it and protecting the data subjects, sources, and managers. You should also do data risk analysis (what could happen, how bad are the consequences of that) that covers population, mappers, organizations, and leads. That places a lot of responsibility onto anyone releasing a crowdsourced dataset, and even for datasets whose release is benign, there are still questions of "Who owns this data?" to be addressed.
Types of data
A typical Ushahidi instance contains several types of data.
Direct reports: Messages input by crowdsourcers via web forms or SMS. Ushahidi has a standard set of fields for these (title, description, category list etc), but also creates custom form fields to capture items of specific interest to the map owner.
Indirect reports: Messages scraped from other applications (Twitter, Facebook etc.) either through APIs or by crowdsourcers adding them as direct reports. Geolocations: Latitude and longitude for each location name in a list. These are usually either a) found automatically by the platform using a gazetteer like Nominatim, b) input by site administrators, or c) input by the person submitting a direct report.
Category lists: The categories that reports are tagged with; these lists are usually created by the site administrator.
Media: Images, video, audio files added to a report by the reporter or an administrator.
Who owns the data?
Ownership is a recurring issue here. If a community of people add reports to a site, and that site also sucks in data from social media, who owns that data? For example, third party data (e.g. Twitter messages) has restrictions on storage and ownership that even with the original sender’s permission could make it illegal for you to distribute or keep on your site. Questions about ownership have already been asked and answered in many open data and social media sites, often involving much work and lost data as licenses are transferred (see OpenStreetMap’s license moves, for example). Having crowdsourcers sign a contributor agreement, and having a data license statement on any dataset you put out, is a good start.
The ethical process
Risk is a recurring issue too. There are privacy concerns for data subjects (e.g. accidentally making locations and phone numbers public), safety concerns for crowdsourcers reporting conflict, violence, hate speech etc; safety risks from revealing sensitive locations (e.g. personal homes or addresses of places like rape crisis centers that need to be kept secret), team privacy and safety concerns (e.g. admins’ email addresses, but also things like activity data for team members being actively stalked, and the possibility of members being arrested as spies in their home countries).
Life isn’t always clear-cut, and when life isn’t clear-cut, we often start talking about processes:
Ethical process: Assessing the potential risks in sharing a dataset; selecting which data you should and should not share. Balancing the potential harms from sharing information from the deployment against the potential good. If you’re not sure, don’t share, but if you’ve checked, cleaned, double-checked and the risk is minimal (and ethical: you’re working with other people’s information here), seriously consider it. If it’s come from a personal source (SMS, email etc), check it. At least twice.
Legal process: choosing who to share with, writing nondisclosure agreements, academic ethics agreements etc. You might want to share data that’s been in the media because it’s already out there, but you could find yourself in interesting legal territory if you do (see under: GDELT). In some countries, slander and libel laws could also be a consideration.
Physical process: where to put cleaned data and how to make it available. There are many data warehouses which specialise in hosting social-good datasets: these include the Humanitarian Data Exchange (HDX), which specialises in disaster-related data, and sites like datahub.io. Ushahidi data can also by shared by making it public on an Ushahidi instance with an API (e.g. crowdmap.com) or CSV download button (this is an Ushahidi plugin), or by making data available to people on request.
As a crisismapper, I often go through the ethical process. I generally do a manual investigation first, or supervise someone who already has access to the deployment dataset doing this, with them weeding out all the obvious PII and worrisome data points, then ask someone local to the deployment area to do a manual investigation for problems that aren’t obvious to people outside the area (for example, in Homs, the location of a bakery was dangerous information to release, because of targeted bombing).
Some of the things I look for on a first pass include:
- Identification of reports and subjects: Phone numbers, email addresses, names, personal addresses.
- Military information: actions, activities, equipment.
- Uncorroborated crime reports: violence, corruption etc that aren’t also supported by local media reports.
- Inflammatory statements (these might re-ignite local tensions).
- Veracity: Are these reports true - or at least, are they supported by external information.
Things that make this difficult include untranslated sections of text (you’ll need a native speaker or good auto translate software), codes (e.g. what does “41” mean as a message?) and the amount of time it takes to check through every report by hand. This can be hard work, but if you don’t do these things, you’re not doing due diligence on your data, and that can be desperately important.
The meta data level
I also look at the data release on a meta level, starting with "Who needs this data?" and "How accurate does it really need to be for them?"
Open data is, by its nature, open, and it’s difficult to anticipate all the uses that people have for a dataset you release, but some example users are:
- Academics: analysis of social media use, group dynamics etc.
- People in your organization: for lessons learned reports, for illustration, for visualizations, for analysis of things like report tempo (how frequently did information arrive for processing, translation etc.
- Data subjects: to check veracity of data about them, and to ask for data removal (sometimes under laws designed for this purpose, e.g. EU privacy laws). I haven’t seen this happen in a crowdsourced instance yet, but its only a matter of time.
Useful meta-level questions that I’ve seen asked by communities include:
- How geographically accurate does your data release have to be? E.g. would it be okay/ better to release data at a lower geographical precision (e.g. to town level rather than street)?
- Do you need to release every report? Most deployments have a lot of junk messages (usually tagged unverified) - remember, the smaller amount of data you release, the less you have to manage (and worry about).
- Would aggregated data match the need of the person asking for data? e.g. if they just need dates, locations and categories, do you need to release text too?
Time. You might want to leave time after your deployment for the dataset to be less potentially damaging. When you release a dataset, you should also have a “data retirement” plan in place, detailing whether there’s a last date that you want that data to be available, and a process to archive it and any associated comments on it.
That’s a quick canter through some of the issues and potential processes for releasing crowdsourced data. The bottom line is to always start with the “first, do no harm” rule, and have the possibility of accidental release in mind when you assess the risks of opening up data. Please open up as much social-good data as possible, but do it responsibly too. We’ve seen too many instances of datasets that should have been kept private making it into the public domain—as well as instances of datasets that should have become public, and datasets that have been carefully pruned being criticized for release.
This article is part of the HFOSS column coordinated by Jen Wike Huger. To share your projects and stories about how free and open source software is making the world a better place, contact us at email@example.com.