6 open source tools for data journalism

No readers like this yet.
open up

Opensource.com

When I was in journalism school back in the late 1980s, gathering data for a story usually involved hours of poring over printed documents or microfiche.

A lot has changed since then. While printed resources are still useful, more and more information is available to journalists on the web. That’s helped fuel a boom in what’s come to be known as data journalism. At its most basic, data journalism is the act of finding and telling stories using data—like census data, crime statistics, demographics, and more.

There are a number of powerful and expensive tools that enable journalists to gather, clean, analyze, and visualize data for their stories. But many smaller or struggling news organizations, let alone independent journalists, just don’t have to budget for those tools. But that doesn’t mean they’re out in the cold.

There are a number of solid open source tools for data journalists that do the job both efficiently and impressively. This article looks at six tools that can help data journalists get the information that they need.

Grabbing the data

Much of the data that journalists find on the web they can download as a spreadsheet or as CSV or PDF files. But there’s a lot of information that’s embedded in web pages. Instead of manually copying and pasting that information, a trick just about every data journalist uses is scraping. Scraping is the act of using an automated tool to grab information embedded in a web page, often in the form of an HTML table.

If you, or someone in your organization, is of a technical bent then Scrapy might be the tool for you. Written in Python, Scrapy is a command line tool that can quickly extract structured data from web pages. Scrapy is a bit challenging to install and set up, but once it’s up and running you can take advantage of a number of useful features. Python savvy programmers can also quickly extend those features.

Spreadsheets are one of the basic tools of the data journalist. In the open source world, LibreOffice Calc is the most widely-used spreadsheet editor. Calc isn’t just for viewing and manipulating data. By taking advantage of its Web Page Query import filter, you can point Calc to a web page containing data in tables and grab one or all of the tables on page. While it’s not as fast or efficient as Scrapy, Calc gets the job done nicely.

Dealing with PDFs

Whether by accident or by design, a lot of data on the web is locked in PDF files. Many of those PDFs can contain useful information. If you’ve done any work with PDFs, you know that getting data out of them can be a chore.

That’s where DocHive, a tool developed by the Raleigh Public Record for extracting data from PDFs, comes in. DocHive works with PDFs created from scanned documents. It analyzes the PDF, separates it into smaller pieces, and then uses optical character recognition to read the text and inject the text into a CSV file. Read more about DocHive in this article.

Tabula is similar to DocHive. It’s designed to grab tabular information in a PDF and convert it to a CSV file or a Microsoft Excel spreadsheet. All you need to do is find a table in the PDF, select the table, and let Tabula do the rest. It’s fast and efficient.

Cleaning your data

Often, the data you’ll grab may contain spelling and formatting errors or problems with character encoding. That makes the data inconsistent and unreliable, and makes cleaning the data essential.

If you have a small data set, one that consists of a few hundred rows of information, then you can use LibreOffice Calc and your eyeballs to do the cleanup. But if you have larger data sets, doing the job manually will be a long, slow, inefficient process.

Instead, turn to OpenRefine. It automates the process of manipulating and cleaning your data. OpenRefine can sort your data, automatically find duplicate entries, and reorder your data. The real power of OpenRefine comes from facets. Facets are like filters in spreadsheets that let you zoom in on specific rows of data. You can use facets to ferret out blank cells and duplicate data, as well as see how often certain values appear in the data.

OpenRefine can do a lot more than that. You can get an idea about what OpenRefine can do by browsing the documentation.

Visualizing your data

Having the data and writing a story with it is all well and good. A good graphic based on that data can be a boon when trying to summarize, communicate, and understand data. That explains the popularity of infographics on the web and in print.

You don’t need to be a graphic design wizard to create an effective visualization. If your needs aren’t too complex, Data Wrapper can create effective visualizations. It's an online tool that breaks creating a visualization into four steps: copy data from a spreadsheet, describe your data, choose the type of image you want, then generate the graphic. You don’t get a wide range of image types with Data Wrapper, but the process couldn’t be easier.

Obviously, this isn’t an exhaustive list of open source data journalism tools. But the tools discussed in this article provide a solid platform for a journalism organization on a budget, or even an intrepid freelancer, to use data to generate story ideas and to back those stories up.

That idiot Scott Nesbitt ...
I'm a long-time user of free/open source software, and write various things for both fun and profit. I don't take myself all that seriously and I do all of my own stunts.

4 Comments

What is missing is secure your data, secure your sources, secure your communication!

Thanks for sharing these resources, Scott. I was in journalism school in the late 90s/early 200s and took a class called "Internet Journalism." We learned basics on how to search for information online, pre-Google. We were told to do all of our searches on ixquick.com. Later, when I became a city hall reporter, I used to have to trek downtown to city hall to pick up the city council agenda every other week. These days the full agendas are online, including supporting materials, not to mention a wealth of other data from city departments. I have mad respect for journalists who reported pre-Internet days, especially on data-driven stories. It's amazing how far we've come and how much has changed in such a short time.

Ginny, times definitely have changed. When I was in J-school in the 80s, an investigative reporter visited my class and described his typical day: head over to the hall of records (or wherever) first thing in the morning, comb through documents, take a break for lunch, comb through more documents. With a few breaks in between and attempts to ferret out sources. And more documents ...

But, as Nicolas Kayser-Bril pointed out, Journalists should be extremely careful before reusing a dataset that was proactively published by a government. Or by anyone else, for that matter.

In reply to by Ginny Hamilton

As a beginning data journalist, you’ll want to develop a sense of the tools others are using to do the work you admire. You won’t be able to learn them all at once, and you shouldn’t try. You should, however, develop a sort of ambient awareness of the tools in use. Keep a list of tools to check out. Watch the demos and browse the documentation or code. Then, when your projects create the need, you’ll remember enough to get you started. More at https://intellipaat.com/hadoop-online-training/

Creative Commons LicenseThis work is licensed under a Creative Commons Attribution-Share Alike 4.0 International License.