Journalist creates open source solution to extract data from PDFs

Image by:

Opensource.com

A group of journalists are announcing the launch of their breakthrough open source solution for the problem many writers and journalists have of how to take data in PDFs or images and easily convert it to a spreadsheet or other usable format.

Editor Charles Duncan Pardo and his team of reporters at the Raleigh Public Record are like many small newsrooms—they don't have the staff to do data entry for hundreds of pages of information, nor the budget to hire some unfortunate college student to do it for them. He says:

This is a problem we at the Record have been trying to overcome for more than two years. The story started with Wake County campaign finance returns. The returns are filed as paper, and staff at the Wake County Board of Elections scan them in and put the images online. The problem is, the only way to view the data is to look at it page by page, and the only way to analyze it is to go through by hand and enter the data into a spreadsheet one row at a time.

Duncan created DocHive with his brother, fulltime programmer Edward Duncan. It uses XML to break a page up into smaller sections, separating each into its own image file, then uses optical character recognition technology (OCR) to read the couple words or numbers and insert it into a text file.

DocHive will be officially released on February 28 at the annual Computer Assisted Reporting conference organized by Investigative Reporters & Editors and the National Institute for Computer-Assisted Reporting. The code will live on GitHub and the Record is setting up a Wiki on their server to share templates and for documentation. Their choice of which license to use has not yet been determined.

The technology clearly comes as great news to journalists and other writers across the country who will now have a way to easily and quickly convert data into structured information.

According to Tyler Dukes (managing editor at Reporter’s Lab, a resource for journalists seeking technology solutions), although it’s a common struggle for journalists, few others have tried to solve it. Neither Paul Bradshaw, a British journalist and author of "Scraping for Journalists," nor Pete Warden, developer and author of "O’Reilly’s Data Source Handbook," had heard of a similar solution. He continues:

Public records provide a wealth of information reporters can use to hold the powerful accountable. But antiquated government and corporate bookkeeping can make dealing with these documents expensive, time-consuming and out of reach for many newsrooms.

By equipping reporters with the ability to automatically digitize these scanned paper records, DocHive could take hours of drudgery out of the early stages of document-driven investigative reporting and dramatically upgrade the watchdog capabilities of newsrooms large and small.

Duncan and his brother will present DocHive at the NICAR conference in Louisville, KY on March 1.