Open data portals should be API [First]

Government and library open data using Creative Commons tools

Image by:

Opensource.com

What is API [First]?

Not long ago, I was speaking at the National Association of Government Web Professionals. At the same conference, Mark Headd was speaking. We were speaking on different open data topics. My discussion was about the difference between open government and open data and his talk was about API [First].

Luckily they had us scheduled at different times, so I had the opportunity to see him speak on the API [First] strategy for website development using open data. His subtitle was "Open Data as a Foundation for Better Websites."

API [First] is essentially device-agnostic design. The premise Mark made was that websites often retrofit an API (application program interface) after deployment. The API may or may not be in response to scraping by those seeking data from the website. I enjoyed this talk, and I understood why thinking about APIs before thinking about a website makes sense.

This article discusses what might be in the future. Open data today is still following the Gartner "Hype-Cycle" that the World Wide Web followed 20 years ago. We will see that machines do and should consume data more than humans do. This is a hypothesis about how data is used based on anecdotal and survey evidence. I cannot demonstrate anything conclusive without more data, but I can hypothesize the following:

Data consumed by humans has lower re-use value in that it is not being redistributed
Data that is served on a web/mobile [First] platform needs more work to re-use data than platforms that are API [First]
Power users from the media, technical, and academic worlds overwhelmingly spend the most time looking at the Open Raleigh data portal
The inherent UX problems with elements of data portals make them difficult to use

Data portals are not just any website

I want to go a little further with this. Data online should be API [First], and data portals need to be replaced with something more useful and less annoying. In November, I completed a survey on Open Raleigh and received over 100 responses from the public. The public showed a favorable response to the types, frequency of updates, and the quality of the data available, but they resoundingly did not like the interface. Most of the interface comments were about latency and, sometimes, display incompatibility.

I completely agree with their conclusions. The data and the retrofitted API are great, but they are locked in an almost unusable cage. What if the platform had been built around an API instead of a web [First] strategy? We might find that developers, to fit on current and future devices with less work on the part of the City of Raleigh, could modify the data and the exploration widgets that go with it.

Figure 1: October 2014 Rows Loaded

Figure 1: October 2014 Rows Loaded

If we look at how data sites are actually used rather than how they are marketed, we start to see some patterns emerge. Open Raleigh has had 1,115,125 human page views in the last 18 months, with a majority of that in the last six months. In October 2014, we peaked at 17,000,000 API calls. In one month, we had 17 times more views machines than humans.

As you can see from the figure, September and November are not far below October. Some readers may wonder why there is a surge in API calls starting in May 2014. May through October was spent building open source service architectures on Red Hat JBOSS Switch Yard that could mine and automatically append data sets within the Open Raleigh Portal.

Open Raleigh uses a responsive web design that is friendly to most handheld devices, but the API needs a little help to push data into the portal. The portal itself releases every data set as an API endpoint. This API is a read-only API. Writing some code, we can have the Socrata portal allow us to append data sets. Socrata is not alone in the Web/Mobile [First] category. ESRI, CKAN, and to some extent, Junar are architected on the same principals. This is not a direct criticism or endorsement of any particular platform.

The consequences of getting it wrong

Discussing multi-nodal approaches and espousing an API [First] strategy may seem esoteric until one looks at anecdotal issues with some recent portal launches. Minneapolis recently launched its open data portal to scathing reviews. Most of the criticisms were centered on the performance of the site, latency, non-responsive design, and crashing web pages. Note that these were all complaints from citizens trying to use the portal through different browsers. The city blamed ESRI, but latency and poorly designed pages that do not validate are not inherent in the platform. ESRI is not an API [First] product. The city said it should have gone with Socrata. Given the city’s ability to manage a rollout, it seems clear that the lack of a multi-nodal-standards-based approach was a significant, but not a single-cause, reason for the beta failure.

CKAN also recently announced the launch of a new, comprehensive open data portal on the Ebola crisis through a tweet. This is not responsive design. When I look at it on my mobile, I see a tiny version of the full site and no way to meaningfully consume or re-use the data unless I switch to a larger interface. Now let’s think of the consequences of that:

Who is the data for? If it's for field workers, this is a huge fail. The most common field devices are tablets and mobile. Not having some kind of app consuming an API would be an obstacle toward data re-use.
How do I consume and re-use the data? Following the rabbit hole of links, I can get to geo-data about the crises data. Most of the catalog lists are CSV data sets alongside PDFs giving the context of the data. This is good in that I have metadata, bad in that I cannot query an obvious API point to merge this site"s data with other data for my own analysis.

Conclusions

This is only the tip of the iceberg. Aside from the technical issues around not using an API [First] strategy, we have policy issues around PII (in the case of Minneapolis) and UX issues in the CKAN instance. So, I conclude the following by comparing human and API consumed data:

Data consumed by humans have lower re-use value in that they are not being redistributed
Data that is served on a web/mobile [First] platform needs more work to re-use data than platforms that are API [First]