Join the 85,000 open source advocates who receive our giveaway alerts and article roundups.
Data journalism with open source software
6 open source tools for data journalism
Get the newsletter
When I was in journalism school back in the late 1980s, gathering data for a story usually involved hours of poring over printed documents or microfiche.
A lot has changed since then. While printed resources are still useful, more and more information is available to journalists on the web. That’s helped fuel a boom in what’s come to be known as data journalism. At its most basic, data journalism is the act of finding and telling stories using data—like census data, crime statistics, demographics, and more.
There are a number of powerful and expensive tools that enable journalists to gather, clean, analyze, and visualize data for their stories. But many smaller or struggling news organizations, let alone independent journalists, just don’t have to budget for those tools. But that doesn’t mean they’re out in the cold.
There are a number of solid open source tools for data journalists that do the job both efficiently and impressively. This article looks at six tools that can help data journalists get the information that they need.
Grabbing the data
Much of the data that journalists find on the web they can download as a spreadsheet or as CSV or PDF files. But there’s a lot of information that’s embedded in web pages. Instead of manually copying and pasting that information, a trick just about every data journalist uses is scraping. Scraping is the act of using an automated tool to grab information embedded in a web page, often in the form of an HTML table.
If you, or someone in your organization, is of a technical bent then Scrapy might be the tool for you. Written in Python, Scrapy is a command line tool that can quickly extract structured data from web pages. Scrapy is a bit challenging to install and set up, but once it’s up and running you can take advantage of a number of useful features. Python savvy programmers can also quickly extend those features.
Spreadsheets are one of the basic tools of the data journalist. In the open source world, LibreOffice Calc is the most widely-used spreadsheet editor. Calc isn’t just for viewing and manipulating data. By taking advantage of its Web Page Query import filter, you can point Calc to a web page containing data in tables and grab one or all of the tables on page. While it’s not as fast or efficient as Scrapy, Calc gets the job done nicely.
Dealing with PDFs
Whether by accident or by design, a lot of data on the web is locked in PDF files. Many of those PDFs can contain useful information. If you’ve done any work with PDFs, you know that getting data out of them can be a chore.
That’s where DocHive, a tool developed by the Raleigh Public Record for extracting data from PDFs, comes in. DocHive works with PDFs created from scanned documents. It analyzes the PDF, separates it into smaller pieces, and then uses optical character recognition to read the text and inject the text into a CSV file. Read more about DocHive in this article.
Tabula is similar to DocHive. It’s designed to grab tabular information in a PDF and convert it to a CSV file or a Microsoft Excel spreadsheet. All you need to do is find a table in the PDF, select the table, and let Tabula do the rest. It’s fast and efficient.
Cleaning your data
Often, the data you’ll grab may contain spelling and formatting errors or problems with character encoding. That makes the data inconsistent and unreliable, and makes cleaning the data essential.
If you have a small data set, one that consists of a few hundred rows of information, then you can use LibreOffice Calc and your eyeballs to do the cleanup. But if you have larger data sets, doing the job manually will be a long, slow, inefficient process.
Instead, turn to OpenRefine. It automates the process of manipulating and cleaning your data. OpenRefine can sort your data, automatically find duplicate entries, and reorder your data. The real power of OpenRefine comes from facets. Facets are like filters in spreadsheets that let you zoom in on specific rows of data. You can use facets to ferret out blank cells and duplicate data, as well as see how often certain values appear in the data.
OpenRefine can do a lot more than that. You can get an idea about what OpenRefine can do by browsing the documentation.
Visualizing your data
Having the data and writing a story with it is all well and good. A good graphic based on that data can be a boon when trying to summarize, communicate, and understand data. That explains the popularity of infographics on the web and in print.
You don’t need to be a graphic design wizard to create an effective visualization. If your needs aren’t too complex, Data Wrapper can create effective visualizations. It's an online tool that breaks creating a visualization into four steps: copy data from a spreadsheet, describe your data, choose the type of image you want, then generate the graphic. You don’t get a wide range of image types with Data Wrapper, but the process couldn’t be easier.
Obviously, this isn’t an exhaustive list of open source data journalism tools. But the tools discussed in this article provide a solid platform for a journalism organization on a budget, or even an intrepid freelancer, to use data to generate story ideas and to back those stories up.