Get the highlights in your inbox every week.
Data, like code, is better open
What can you do with open data?
Play a word association game and the word "open" will almost surely be followed by "source." And open source is certainly an important force for preserving user freedoms and access to computing. However, code isn't the only form of openness that's important.
Open data has been discussed for at least a decade. At the OSCON conference in 2007, Tim O'Reilly kicked off a bit of a ruckus when he suggested that open data might actually be more important than open code. Open data in this context mostly referred to the ability to export the user-created "Web 2.0" data, which was becoming important at that time. Tim Bray, then at Sun Microsystems, highlighted the issue when he wrote:
At the end of the day, information outlives software and transcends software and is more valuable than software.
At the same time, other aspects of open data were starting to come to the fore—including access to public data sources. Even when public data was already available to researchers and others, often it wasn't in a form that could be freely and easily accessed. For example, when I looked into using river-level information from the US Geological Survey around that time, I found that I would need to do some complicated web page scraping to get the information into a form I could import into a program. Many other types of data weren't available online at all.
This started to change in a systematic way. In May 2009, then-US chief information officer Vivek Kundra launched Data.gov. This, in turn, led to a 2013 executive order that "made open and machine-readable data the new default for government information." Many states and municipalities also expanded the data that they made available. In March 2016, the White House launched the Opportunity Project to focus on tools for visualizing and using public data in useful ways. Eight US cities—Baltimore, Detroit, Kansas City (Missouri), New Orleans, New York, Philadelphia, San Francisco, and Washington D.C.—are currently participating in the project.
To make things more concrete, let's take a look at data from one specific city: Cambridge, Massachusetts. Cambridge makes 160 datasets available. These include health inspection data, accidents, crime reports, census information, city maintained trees, pothole repair requests, and much more.
Data can be downloaded in a variety of formats (JSON, XML, CSV). Which you use will depend on your preferences and whether you want to work with the data programmatically or in a more typical end-user tool, such as a spreadsheet. You'll notice that much of this data does refer to locations, although you'll typically need to convert street addresses to geographical coordinates (i.e., latitude and longitude) using a geocoding/geoencoding database to display it using the aforementioned Leaflet and OpenStreetMaps. Nominatum is a search engine for OpenStreetMaps data. Other options include Google Maps.
That said, it's worth interjecting the caveat at this point that open data is subject to the same misinterpretation and misuse as data from any other source. Understand the provenance and limitations of any datasets that you use. In general, there is an increasingly wide range of open data available from trusted sources that have collected it using relatively rigorous techniques. However, even this sort of data can get stale—or it may simply not communicate the information you think it does based on a quick initial look.
Also be aware of the potential pitfalls associated with aggregating data at different scales, as well as broader issues related to demonstrating causality. One needs to be especially careful about aggregating data for spatial visualizations. For example, if you aggregate data and color-code to display the level of some activity by census block or city ward, that level may be influenced more by the population or size of the block, rather than by actual differences in the underlying rate of the activity.
Increasingly, a wide range of data and other information is available in a way that's easy to consume and doesn't put limits on its use. In addition to the types of local government data that I went into above, there's also expanded public access to results of federally funded research, for example. Open data in areas such as these is particularly significant because it can increase collaboration and building upon the work of others—just as with the proven success of the open source development model.