Getting Groovy with data

Learn how to get started with Groovy programming and add it to your data analysis toolkit.
434 readers like this.
OpenStack Superuser

Opensource.com

Groovy is an almost perfect complement to Java, providing a compact, highly expressive and compatible scripting environment for my use. Of course, Groovy isn't totally perfect; as with any programming language, its design is based on a series of trade-offs that need to be understood in order to produce quality results. But for me, Groovy's advantages far outweigh its disadvantages, making it an indispensable part of my data analysis toolkit. In a series of articles, I'll explain how and why.

In the late 1990s, I found myself becoming increasingly interested in the Java programming language, using it more and more for stuff that was too complicated for AWK and for most of what I used to do in C. By 2005, the low cost and great functionality of Linux had convinced me to ditch my beloved but aging Sun workstation. And for my kind of work, AWK, sort(1), paste(1), and join(1) had serious competition in the Linux environment, first from Perl, then from Python. The syntax of Perl has never been to my taste, but I found Python intriguing because of its readability, its "batteries included" philosophy, and its level of integration with all sorts of other stuff, such as delimited text files, spreadsheets, databases, graphing, except for one thing—it didn't give me what felt like clean and transparent access to the whole Java universe that was becoming increasingly central to my workflow.

And then I "discovered" Groovy.

Delimited text files

In my AWK-centric data analysis universe, working with data generally means working with delimited text files. This came about through a combination of two factors. The first was that Unix text file processing facilities generally recognized that data was often encountered in delimited text files—that is, text files whose lines, delimited by newline characters, were separated into fields, delimited by a field separator character (for example, a TAB, or some other unusual character, such as the vertical bar). The second was that tools such as spreadsheets tended to provide an "export" facility that produced comma-separated value text files, whose first line was by convention the names of each field, and whose remaining lines consisted of fields of data separated by commas (or, in countries that used the comma as a decimal point, by semicolons).

AWK is pretty good at dealing with delimited text files, unless the field or line delimiters also show up within the data. Moreover, AWK is also really geared toward being used to write stanzas of code that react to the data presented, and is not nearly so attractive when the data presentation is complex (for example, hierarchical). Nor does AWK really provide any good way to read from, or write to, a relational database, or a spreadsheet, or a binary format such as dBase, without passing through an intermediate delimited-text format.

This is where a more complete programming language—such as Python or Groovy— starts to become interesting. But before getting to those kinds of direct integration examples, I'm going back to delimited text. Let's write some code! But first, let's get some data! But wait—we better install Groovy first.

Getting Groovy

The best way to find out how to install Groovy is to go to the installation instructions at groovy-lang.org. I prefer to use SDKMAN for this purpose (documented midway down the installation instructions), but you can also install the version in your repositories. Note that Java is a prerequisite. These days, I use Java 8. Again, you can install the version in your repositories.

Getting data

Now that you have Groovy, use your browser to visit the open world population data from the World Bank site. On the right, you'll see a Download button. Get the data in CSV format; it comes zipped in a directory called API_SP.POP.TOTL_DS2_en_csv_v2. Unzip this directory into a good place on your system. Then open a terminal window and cd into that directory.

Finally—some code!

Here is a simple Groovy script to read one of the CSV files you downloaded and print it to your terminal window:

String mdCountryCSV = "Metadata_Country_API_SP.POP.TOTL_DS2_en_csv_v2.csv"

new File(mdCountryCSV).withReader { reader ->
    reader.eachLine { line ->
        println line
    }
}

This script gives a good overview of what Groovy provides for Java programmers.

First the String mdCountryCSV = .... This is "just like Java"—we are declaring a String variable that is initialized to a String literal. Oh yes, Groovy allows us to drop line-ending semicolons in most cases.

Next, new File(mdCountryCSV).withReader { reader ->, which is closed by a } four lines later. The new File() part is also just like Java; however, Groovy enhances a lot of java.lang.*, java.io.*, java.util.*, and other parts of the standard Java libraries. And in this case, Groovy enhances the File class with a method called withReader. This method accepts a closure as an argument, which in this case we manifest as the block of code { reader -> ... }. The reader -> defines the argument to the closure as the variable reader.

What does this Groovy newness accomplish? Functionally, withReader creates a Reader instance and calls the closure code with that instance assigned to the variable reader, finally closing the File instance created and releasing its resources and handling any errors that occur. Effectively, this lets the Groovy programmer declare anonymous methods as parameters to other method calls. Moreover, the surrounding context is available inside the closure without any special hocus-pocus.

Next, reader.eachLine { line ->, which is closed by a } two lines later. Again, we are seeing a Groovy-enhanced Reader method, eachLine, being called with a closure as an argument, which in this case we manifest with { line -> ... }. Here the Reader instance calls the closure for each line of the file read.

Finally, println line simply prints the line read by the Reader instance. Here Groovy shows us that it's OK to omit parenthesis around arguments to method calls, and also that it in effect has an import System.out as preamble to executing the code.

Save this code block as ex01.groovy in the same directory as the data and execute it from the terminal command line with:

groovy ex01.groovy

What do you see?

At this point, it's worth noting that Groovy also quietly did away with the imports and public class definitions that need to happen in a Java program that might carry out the same task.

Dealing with fields

So far, our Groovy script has dealt with line delimiters, but has yet to split the lines into fields. A quick examination of the file Metadata_Country_API_SP.POP.TOTL_DS2_en_csv_v2.csv will show that it is the most complex kind of CSV file—it uses commas as the field separator and quotes fields that can contain field or line separators.

Look at the third line, for Angola; in the fourth field, the phrase "Based on IMF data, national accounts" appears. And in the ninth line, for Argentina, not only commas appear in the same field but also carriage-return/line-feed pairs. Finally, on line 199, that field contains a double-quote character, which is shown as two successive double quotes; a "quoted quote," which is not to be confused with two successive double quotes as the only content of a field, implying an empty field. Ugly!

In AWK, dealing with this kind of messy stuff is less than pleasant; however, in Groovy, we can make use of a fine Java library called opencsv. Download the .jar file from SourceForge. Put that .jar file in Groovy's default lookup path—in your home directory, in the .groovy/lib subdirectory.

At this point, the first program can become field-aware:


import com.opencsv.CSVReader

String mdCountryCSV = "Metadata_Country_API_SP.POP.TOTL_DS2_en_csv_v2.csv"

new File(mdCountryCSV).withReader { reader ->
    CSVReader csvReader = new CSVReader(reader)
    csvReader.each { fields ->
        println fields
    }
}

Save this as ex02.groovy in the same directory and run it with the groovy command.

What's new here?

First, we have to import the CSVReader capability. Then we create a CSVReader instance from the reader handed to us by withReader. Finally, we print the fields yielded by the CSVReader instance. Here, we use the each method that Groovy puts on every object and the line-splitting that opencsv provides in order to process the lines in the file. csvReader.each { fields -> gives us each line split into fields—that is, an array of Strings. We can refer to the first field as fields[0], the second as fields[1] and so on.

Given that the first line of this kind of CSV file provides the field names, we can adjust the above code to let us refer to the fields by name, as follows:

import com.opencsv.CSVReader String mdCountryCSV = "Metadata_Country_API_SP.POP.TOTL_DS2_en_csv_v2.csv" new File(mdCountryCSV).withReader { reader -> CSVReader csvReader = new CSVReader(reader) String[] csvFieldNames = csvReader.readNext() HashMap fieldValuesByName = new HashMap()
    csvReader.each { fieldValuesByNumber ->
        csvFieldNames.eachWithIndex { name, number ->
            fieldValuesByName[name] = fieldValuesByNumber[number]
        }
        println "fieldValuesByName[\"Country Code\"] = " +
            fieldValuesByName["Country Code"] +
            " fieldValuesByName[\"IncomeGroup\"] = " +
            fieldValuesByName["IncomeGroup"]
    }
}

Save this as ex03.groovy in the same directory and run it.

In the above code, we call the readNext from the CSVReader instance right away to get the first record, and save the field names in a String array. Then, every time we read a record, we execute the each method on the field names to copy the values from the array of fields delivered by csvReader.each() into a map where the key is the field name and the value comes from the corresponding field on the record.

The println statement shows us accessing field values by name, for example, fieldValuesByName["Country Code"].

Where to next?

That's probably plenty of Groovy to get started. Here are good references to enhance the experience:

Quite a number of Groovy books with good introductions to the language are also available.

The next installment in this Groovy series will take the themes already introduced further: making the last example groovier, reading multiple CSV files, linking their data, and writing out a composite/summary.

Do you have ideas for programming "how-to" articles? Submit your story proposals.

Chris Hermansen portrait Temuco Chile
Seldom without a computer of some sort since graduating from the University of British Columbia in 1978, I have been a full-time Linux user since 2005, a full-time Solaris and SunOS user from 1986 through 2005, and UNIX System V user before that.

2 Comments

Hi thanx for the post,

I too work with csv files quite often.... I wrote small library to help me with such tasks.

https://github.com/kayr/fuzzy-csv

example:
import static fuzzycsv.FuzzyCSVTable.parseCsv

parseCsv(('Metadata_Country_API_SP.POP.TOTL_DS2_en_csv_v2.csv' as File).text)
.select('Country Code', 'IncomeGroup')
.printTable()

Ronald, thank you very much for the super-informative comment! I look forward to trying your library, good show!

In reply to by Ronald (not verified)

Creative Commons LicenseThis work is licensed under a Creative Commons Attribution-Share Alike 4.0 International License.