Textricator: Data extraction made simple

New open source tool extracts complex data from PDF docs, no programming skills required.
429 readers like this.
We don't make software for free, we make it for freedom

Opensource.com

You probably know the feeling: You ask for data and get a positive response, only to open the email and find a whole bunch of PDFs attached. Data, interrupted.

We understand your frustration, and we’ve done something about it: Introducing Textricator, our first open source product.

We’re Measures for Justice, a criminal justice research and transparency organization. Our mission is to provide data transparency for the entire justice system, from arrest to post-conviction. We do this by producing a series of up to 32 performance measures covering the entire criminal justice system, county by county. We get our data in many ways—all legal, of course—and while many state and county agencies are data-savvy, giving us quality, formatted data in CSVs, the data is often bundled inside software with no simple way to get it out. PDF reports are the best they can offer.

Developers Joe Hale and Stephen Byrne have spent the past two years developing Textricator to extract tens of thousands of pages of data for our internal use. Textricator can process just about any text-based PDF format—not just tables, but complex reports with wrapping text and detail sections generated from tools like Crystal Reports. Simply tell Textricator the attributes of the fields you want to collect, and it chomps through the document, collecting and writing out your records.

Not a software engineer? Textricator doesn’t require programming skills; rather, the user describes the structure of the PDF and Textricator handles the rest. Most users run it via the command line; however, a browser-based GUI is available.

We evaluated other great open source solutions like Tabula, but they just couldn’t handle the structure of some of the PDFs we needed to scrape. “Textricator is both flexible and powerful and has cut the time we spend to process large datasets from days to hours,” says Andrew Branch, director of technology.

At MFJ, we’re committed to transparency and knowledge-sharing, which includes making our software available to anyone, especially those trying to free and share data publicly. Textricator is available on GitHub and released under GNU Affero General Public License Version 3.

You can see the results of our work, including data processed via Textricator, on our free online data portal. Textricator is an essential part of our process and we hope civic tech and government organizations alike can unlock more data with this new tool.

If you use Textricator, let us know how it helped solve your data problem. Want to improve it? Submit a pull request.

User profile image.
Steve (Spike) Spiker is the Data Evangelist for Measures For Justice, the co-founder and former ED of OpenOakland, a civictech organization focused on supporting open, agile and engaged government. He was previously the Director of Research & Technology with Urban Strategies Council in Oakland and runs Stealing Beauty Photography.
Avatar
developer at Measures for Justice

1 Comment

Sometimes I wish pdf would just die..

Every week i download diet, distributed as pdf. I could download whole week, but then it goes on continuously, if I want to print day per paper I need to download for each day separately. Then use pdfjoiner for quick join.. except not all days are 2 pages, some are 1 page.. so I have to filter out which to join...

As if that wasn't enough there are unnecessary borders, so i need to croppdf all files before print so it'll be more visible..

As such every week I waste around 10-15 minutes to print out 7 pieces of paper. Too bad this tool ain't gonna be much help in this case, but I can see it's uses if I ever need to do some more serious work.

All of that would be simple operation, but not with pdf.

Creative Commons LicenseThis work is licensed under a Creative Commons Attribution-Share Alike 4.0 International License.