Open data portals should be API [First]

No readers like this yet.
Government and library open data using Creative Commons tools

Opensource.com

What is API [First]?

Not long ago, I was speaking at the National Association of Government Web Professionals. At the same conference, Mark Headd was speaking. We were speaking on different open data topics. My discussion was about the difference between open government and open data and his talk was about API [First].

Luckily they had us scheduled at different times, so I had the opportunity to see him speak on the API [First] strategy for website development using open data. His subtitle was "Open Data as a Foundation for Better Websites."

API [First] is essentially device-agnostic design. The premise Mark made was that websites often retrofit an API (application program interface) after deployment. The API may or may not be in response to scraping by those seeking data from the website. I enjoyed this talk, and I understood why thinking about APIs before thinking about a website makes sense.

This article discusses what might be in the future. Open data today is still following the Gartner "Hype-Cycle" that the World Wide Web followed 20 years ago. We will see that machines do and should consume data more than humans do. This is a hypothesis about how data is used based on anecdotal and survey evidence. I cannot demonstrate anything conclusive without more data, but I can hypothesize the following:

  • Data consumed by humans has lower re-use value in that it is not being redistributed
  • Data that is served on a web/mobile [First] platform needs more work to re-use data than platforms that are API [First]
  • Power users from the media, technical, and academic worlds overwhelmingly spend the most time looking at the Open Raleigh data portal
  • The inherent UX problems with elements of data portals make them difficult to use

Data portals are not just any website

I want to go a little further with this. Data online should be API [First], and data portals need to be replaced with something more useful and less annoying. In November, I completed a survey on Open Raleigh and received over 100 responses from the public. The public showed a favorable response to the types, frequency of updates, and the quality of the data available, but they resoundingly did not like the interface. Most of the interface comments were about latency and, sometimes, display incompatibility.

I completely agree with their conclusions. The data and the retrofitted API are great, but they are locked in an almost unusable cage. What if the platform had been built around an API instead of a web [First] strategy? We might find that developers, to fit on current and future devices with less work on the part of the City of Raleigh, could modify the data and the exploration widgets that go with it.

Figure 1: October 2014 Rows Loaded

Figure 1: October 2014 Rows Loaded

If we look at how data sites are actually used rather than how they are marketed, we start to see some patterns emerge. Open Raleigh has had 1,115,125 human page views in the last 18 months, with a majority of that in the last six months. In October 2014, we peaked at 17,000,000 API calls. In one month, we had 17 times more views machines than humans.

As you can see from the figure, September and November are not far below October. Some readers may wonder why there is a surge in API calls starting in May 2014. May through October was spent building open source service architectures on Red Hat JBOSS Switch Yard that could mine and automatically append data sets within the Open Raleigh Portal.

Open Raleigh uses a responsive web design that is friendly to most handheld devices, but the API needs a little help to push data into the portal. The portal itself releases every data set as an API endpoint. This API is a read-only API. Writing some code, we can have the Socrata portal allow us to append data sets. Socrata is not alone in the Web/Mobile [First] category. ESRI, CKAN, and to some extent, Junar are architected on the same principals. This is not a direct criticism or endorsement of any particular platform.

The consequences of getting it wrong

Discussing multi-nodal approaches and espousing an API [First] strategy may seem esoteric until one looks at anecdotal issues with some recent portal launches. Minneapolis recently launched its open data portal to scathing reviews. Most of the criticisms were centered on the performance of the site, latency, non-responsive design, and crashing web pages. Note that these were all complaints from citizens trying to use the portal through different browsers. The city blamed ESRI, but latency and poorly designed pages that do not validate are not inherent in the platform. ESRI is not an API [First] product. The city said it should have gone with Socrata. Given the city’s ability to manage a rollout, it seems clear that the lack of a multi-nodal-standards-based approach was a significant, but not a single-cause, reason for the beta failure.

CKAN also recently announced the launch of a new, comprehensive open data portal on the Ebola crisis through a tweet. This is not responsive design. When I look at it on my mobile, I see a tiny version of the full site and no way to meaningfully consume or re-use the data unless I switch to a larger interface. Now let’s think of the consequences of that:

  • Who is the data for? If it's for field workers, this is a huge fail. The most common field devices are tablets and mobile. Not having some kind of app consuming an API would be an obstacle toward data re-use.
  • How do I consume and re-use the data? Following the rabbit hole of links, I can get to geo-data about the crises data. Most of the catalog lists are CSV data sets alongside PDFs giving the context of the data. This is good in that I have metadata, bad in that I cannot query an obvious API point to merge this site"s data with other data for my own analysis.

Conclusions

This is only the tip of the iceberg. Aside from the technical issues around not using an API [First] strategy, we have policy issues around PII (in the case of Minneapolis) and UX issues in the CKAN instance. So, I conclude the following by comparing human and API consumed data:

  • Data consumed by humans have lower re-use value in that they are not being redistributed
  • Data that is served on a web/mobile [First] platform needs more work to re-use data than platforms that are API [First]
Jason Hare
Two decades experience analyzing user behavior interacting with web applications. Experience includes developing user interfaces using rapid prototyping and an iterative project management style to create award winning, user-centered information portals. Primary interests include Big Data and Open Data applications and community engagement in a public sector environment.

3 Comments

Hi Jason,

Indeed, I couldn't agree more. API is a key element to foster data based innovation.

At OpenDataSoft (http://www.opendatasoft.com), we are building and operating a Cloud based data management platform. This platform is built "API first". It means that any feature that can be accessed from the portal is also available as an API call. And actually, the portal is itself the first consumer of the API. Available APIs include (see http://docs.opendatasoft.com/collection/1382-using-apis for more details):
- Dataset catalog APIs (keyword and facetted search of datasets within the catalog).
- Dataset APIs (search within dataset records, geo clustering of geo dataset records, numerical aggregations of dataset records).

So, not only can you fetch raw data from the portal through API calls but you can also access high level features such as geo clustering and analytics to directly fuel advanced usages on any kind of devices. This makes it extremely easy for an application developer to quickly build a first MVP without having to build any back-end.

While APIs are a key feature of Open Data portals, one must not forget a major caveat: potential lack of interoperability. While some advanced features can still be made available through non standardized proprietary APIs, it is really key for a data platform to maximize its support of standards. In terms of message formats (JSON, GeoJSON, RSS, RDF...), of protocols (REST, OData...) and security frameworks (OpenId, OAuth, SAML...). Supporting standards is the only way to get the developers community eager to use data portals APIs.

Another good property of "API first" development is that it encourages the development of reusable frameworks based on these APIs. Indeed, an Open Data portal shouldn't exclusively target experienced developers. It should also give to any citizen the possibility to simply reuse Web components, to build their own dashboards and data visualizations.
- This is why any standard data visualization built with the OpenDataSoft platform can easily be embedded in a third party Web application (http://public.opendatasoft.com/explore/dataset/chicago_incidents_2001_p…).
- This is also why we recently launched as an Open Source library a set of reusable HTML widgets which can be easily assembled to build data dashboards (http://opendatasoft.github.io/ods-widgets/docs/#/api/ods-widgets.direct…).

So, If I may summarize this in a few statements:
- API [First] development shall be mandatory.
- Developing and supporting standards will become more and more important.
- Providing tools and frameworks to ease the reuse of datasets exposed on an Open Data portal is also a key factor of success of an Open Data policy.

Awesome post Jason. I think you're spot on about the power of open data APIs to enable reuse of open data in the context where it is most useful, such as on mobile devices and tablets. I'd add that APIs not only allow device flexibility, but also allow developers to build user experiences tailored to the particular citizen-facing problems they're trying to solve.

Data portals aren't enough. We need developers and entrepreneurs reusing open data to build applications and tools that get data to the places where it becomes actionable to citizens: into their mobile devices in context aware ways, into their cars, and into their day to day workflows. Data that lives in a data catalog is not enough - it needs to be brought into the places where citizens actually need it. That's how open data will change, improve, and hopefully even save lives.

One thing I'd add to your thesis, however. While I totally agree on an "API first" strategy, we shouldn't stop there. It is my firm belief that, at least in the case of open data, only providing access via APIs is not enough. APIs often provide a gated, use-case driven experience, while bulk open data is more useful for researchers and analysts. Also, whether we like it or not, APIs change or go away, and we're dependent on the owner of that API to provide continued access to the data contained therein. Our duty to open data and transparency should include making it possible to create archival copies of those datasets (as long as the license allows it). This is why Socrata is committed to providing not only great APIs to access open data (http://dev.socrata.com), but also to allowing bulk access through downloadable snapshots of datasets.

I actually gave a fun talk on this topic at API Days Paris (http://www.apidays.io/) last December. You can flip through my slides if you want and I think a recording may be available soon: https://socrata.github.io/presentations/conferences/2014-12-03-open-dat…

Hey Chris

Thank you for the links and the comments. I think we are onto something here. API First but not API only. While I feel that machine to machine data re-use has a higher return on investment (it certainly scales faster), humans and human readable visualizations on data are important. This is especially true for disclosures about public sector activity such as public safety and public finance.

Would you like to work together on another post regarding multi-nodal data publishing? I am all in if you are.

Best regards

Jason

In reply to by Chris Metcalf

Creative Commons LicenseThis work is licensed under a Creative Commons Attribution-Share Alike 4.0 International License.