New phase of DocHive, open source tool for data extraction

Register or Login to like
Register or Login to like
paper planes

In February of this year, I reported that the Raleigh Public Record—a local, online news publication in Raleigh, NC—was in the process of creating an open source solution to extract data from PDFs. The problem many news journalists have is easily and quickly (which is very important given the nature of their job) converting data and images into a usable format from documents they use for their reports (see an example here).

The project, DocHive, is now phasing into the next cycle of development under the leadership of Edward Duncan. He tells us what he has planned for his team over the next six months. But first, I asked:


How does this project being open source help other journalists?

Often, raw data is readily available to journalists who are investigating and generating news reports. They have access to PDF forms and other structured image-based files, but news reporters, especially those working for local publications around the country and world (like the Raleigh Public Record) benefit greatly when they are able to do less raw data extraction.

Because DocHive is an open source tool, anyone can use it for free and modify it to fit the needs of their job or publication. It helps them extract information more efficiently, then enabling them to spend more time interpreting and analyzing information.

What does DocHive's next development cycle look like?

It will span approximately six months and kick off this August 2013.

The first half of development will introduce new features and enhance existing components. During August, the initial activities will be directed toward the web-based template builder system and support resources. In September, we will be introducing more ways to interact with the extracted data and front-end development.

The second half of the development cycle will cover additional testing, improving accuracy and performance, and deployment testing with multiple environments.

What's on the schedule?

Currently, I am focusing several preliminary activities, creating a more detailed schedule, initial documentation, and looking for a few more individuals to fill some development roles.

As the project moves into August, I will be focusing on the template builder upgrades and documentation while the team works with template sharing and some analytical functions.

We will be launching an online beta version of DocHive during the development cycle to get feedback from the users and improve runtime efficiency. Users for the beta version will be controlled through invitation codes from the Raleigh Public Record.

Tell us about your high and mid-level objectives.

Our high-level objectives are:

  1. Improve the overall ease of use
  2. Incorporate dynamic templates for variable content length
  3. Build comprehensive support resources and documentation
  4. Support multiple deployment environments

Our mid-level objectives are:

  1. Improve automatic template matching
  2. Enable full access to meta data
  3. Improve document post conversion reporting
  4. Introduce configurable options for easily extending the system
Jen leads a team of community managers for the Digital Communities team at Red Hat. She lives in Raleigh with her husband and daughters, June and Jewel.

1 Comment

"Hats off" to Edward Duncan for his efforts and continued perseverance.

Creative Commons LicenseThis work is licensed under a Creative Commons Attribution-Share Alike 3.0 Unported License.