Data, like code, is better open

What can you do with open data?

What can you do with open data?
Image by : 

Cory Doctorow. Modified by Opensource.com. CC BY-SA 2.0.

x

Get the newsletter

Join the 85,000 open source advocates who receive our giveaway alerts and article roundups.

Play a word association game and the word "open" will almost surely be followed by "source." And open source is certainly an important force for preserving user freedoms and access to computing. However, code isn't the only form of openness that's important.

Open data

Open data has been discussed for at least a decade. At the OSCON conference in 2007, Tim O'Reilly kicked off a bit of a ruckus when he suggested that open data might actually be more important than open code. Open data in this context mostly referred to the ability to export the user-created "Web 2.0" data, which was becoming important at that time. Tim Bray, then at Sun Microsystems, highlighted the issue when he wrote:

At the end of the day, information outlives software and transcends software and is more valuable than software.

At the same time, other aspects of open data were starting to come to the fore—including access to public data sources. Even when public data was already available to researchers and others, often it wasn't in a form that could be freely and easily accessed. For example, when I looked into using river-level information from the US Geological Survey around that time, I found that I would need to do some complicated web page scraping to get the information into a form I could import into a program. Many other types of data weren't available online at all.

This started to change in a systematic way. In May 2009, then-US chief information officer Vivek Kundra launched Data.gov. This, in turn, led to a 2013 executive order that "made open and machine-readable data the new default for government information." Many states and municipalities also expanded the data that they made available. In March 2016, the White House launched the Opportunity Project to focus on tools for visualizing and using public data in useful ways. Eight US cities—Baltimore, Detroit, Kansas City (Missouri), New Orleans, New York, Philadelphia, San Francisco, and Washington D.C.—are currently participating in the project.

Many of these data sets represent an event, a measurement, or a physical object at a specific location. As I've written about previously, such data can be visualized by using map data from a source such as OpenStreetMap and embedding it into a web page with a Javascript library like Leaflet.

To make things more concrete, let's take a look at data from one specific city: Cambridge, Massachusetts. Cambridge makes 160 datasets available. These include health inspection data, accidents, crime reports, census information, city maintained trees, pothole repair requests, and much more.

Data can be downloaded in a variety of formats (JSON, XML, CSV). Which you use will depend on your preferences and whether you want to work with the data programmatically or in a more typical end-user tool, such as a spreadsheet. You'll notice that much of this data does refer to locations, although you'll typically need to convert street addresses to geographical coordinates (i.e., latitude and longitude) using a geocoding/geoencoding database to display it using the aforementioned Leaflet and OpenStreetMaps. Nominatum is a search engine for OpenStreetMaps data. Other options include Google Maps.

Your data explorations, however, don't need to be limited to sticking pins on a map. Imagining doing more complex aggregations and correlations of different datasets using a wide range of statistical techniques and visualizations isn't hard. (D3.js is a particularly popular Javascript library for manipulating documents based on data, and is a powerful tool for displaying data in ways that can be both visually arresting and the source of genuine insights.) For example, imagine looking at how city services are provided in different neighborhoods throughout the city; these sort of patterns can be the basis for evidence-based data journalism.

That said, it's worth interjecting the caveat at this point that open data is subject to the same misinterpretation and misuse as data from any other source. Understand the provenance and limitations of any datasets that you use. In general, there is an increasingly wide range of open data available from trusted sources that have collected it using relatively rigorous techniques. However, even this sort of data can get stale—or it may simply not communicate the information you think it does based on a quick initial look.

Also be aware of the potential pitfalls associated with aggregating data at different scales, as well as broader issues related to demonstrating causality. One needs to be especially careful about aggregating data for spatial visualizations. For example, if you aggregate data and color-code to display the level of some activity by census block or city ward, that level may be influenced more by the population or size of the block, rather than by actual differences in the underlying rate of the activity.

Increasingly, a wide range of data and other information is available in a way that's easy to consume and doesn't put limits on its use. In addition to the types of local government data that I went into above, there's also expanded public access to results of federally funded research, for example. Open data in areas such as these is particularly significant because it can increase collaboration and building upon the work of others—just as with the proven success of the open source development model.

About the author

Gordon Haff - Gordon Haff is Red Hat technology evangelist, is a frequent and highly acclaimed speaker at customer and industry events, and helps develop strategy across Red Hat’s full portfolio of cloud solutions. He is the co-author of Pots and Vats to Computers and Apps: How Software Learned to Package Itself in addition to numerous other publications. Prior to Red Hat, Gordon wrote hundreds of research notes, was frequently quoted in publications like The New York Times on a wide range of IT topics, and...