Open source data integration with Karma

Government and library open data using Creative Commons tools

Image by:

Opensource.com

Karma is a free, an open source data integration tool that makes it easy to convert data from a variety of formats into linked data.

I recently attended a half-day workshop on Karma with Pedro Szekely, our instructor. He started by warning us that he knows very little about libraries, but a ton about data. The files we needed for the workshop were on GitHub, if you’re interested in checking it out. You can follow the tutorial steps on the Wiki, and, of course, you can find Karma itself on GitHub.

The basics

Karma is a web-based tool that runs both the server and browser right on your machine, so we had computers with the tool installed on them to play with.

Users load the ontologies for their application and data samples of each of the data files to be converted into Karma. Karma makes the conversion process easy as it provides an intuitive graphical user interface to visualize and edit the mapping of data files to ontologies.

Karma is flexible as it can import data from a wide variety of data formats (SQL, XML, JSON, CSV, Excel, AVRO, Web-Services).

Karma scales to very large dataset (40 million documents, 1 billion triples) and can refresh periodically (e.g. every hour).

Hands on

The rest of the workshop was hands-on experience with Karma.

After we loaded some sample data in to Karma, we mapped it to a few ontologies. When clicking on the title field for example, Karma even gave us four suggestions for what our titles might need to be mapped to. It knew how to make this suggestion because the tool learns (even if you made a mapping mistake in the past). This can be a huge time saver if you’re often working with the same types of data. Pedro did remind us that Karma does not know the right mapping, the user gets to choose whatever they want—even if it’s “wrong”.

Once in your data, you can use Python scripts to clean it up if you’d like. Each column has a ‘PyTransform’ option in the menu. I personally have never written Python, but it looks pretty simple and Pedro assured us that before he used Karma he also didn’t know Python but found that every question he had already had been asked and answered on StackOverflow.

Once you’re done working with your data you can then generate RDF, MySQL, JSON, or many other formats for use with web applications.

When we were editing the data in a column Pedro made a very funny comment about one of the options we had to choose from. He said “You should never do that” and when asked by it was an option at all he said "because someone asked us to add it." This is a question I find myself answering the same exact way when teaching people how to use open source tools. Open source is full of features that are there just because someone asked for it.

Conclusions

What I learned after this workshop is that Karma is awesome powerful! We have so much messy data out there that a tool like this can be very handy—and of course it’s open source which just makes it that much more appealing. I also learned that I’m probably not really cut out for working with a tool like Karma on a daily basis, but I know a lot of people who will be and I hope this summary will help them out.