Managing data with Groovy: Lookups and accumulators

Image by:

Image by Florida Memory. Modified by Opensource.com. CC BY-SA 2.0.

In my first article on getting started with the Groovy programming language, I left off with an example of reading a CSV file in Groovy. In this article, I'm going to move to a more idiomatic Groovy style (make it groovier, as some would say), cover the use of Groovy maps as lookup tables, and finish up by using maps to calculate some results.

First things first—here is the final example from the last article, in more idiomatic Groovy:

import com.opencsv.CSVReader

def mdCountryCSV = "Metadata_Country_API_SP.POP.TOTL_DS2_en_csv_v2.csv"

new File(mdCountryCSV).withReader { reader ->
    def csvReader = new CSVReader(reader)
    def fNam = csvReader.readNext()
    csvReader.each { fVal ->
        def valByNam = [fNam,fVal].transpose().collectEntries()
        println valByNam
    }
}

Save this code snippet as ex04.groovy in the same directory as the first three examples, and the data you downloaded from the World Bank, and run it. What do you see?

Some explanations are in order.

First, just to make the examples more readable, I've shortened my long variable names.

Second, Groovy takes a different view toward type checking than Java. As Burt Beckwith describes in his excellent book on Grails (and Groovy), "Programming Grails," Groovy supports optional typing, allowing the user to control typing or leave it up to the language to figure out at run-time by using the def keyword to declare variables. I will add to Burt's excellent exposition that old adage, "with great power comes great responsibility." Leaving type decisions up to the language can introduce difficult-to-find bugs, and in any case, puts off until run-time the detection of any typing errors. But this kind of language design decision also means Groovy can provide not only more concise code (less to read can mean less to debug), but also some interesting dynamic capabilities related to materializing behavior at run-time.

Third, the already-concise code used to copy the array of column names and corresponding array of column values provided by csvReader for each line of the file has been replaced with some of Groovy's nice methods for handling collections and maps:

[fNam, fVal] creates a new list composed of two elements: the array of field names and the array of field values on the current line
.transpose() converts this into a new list where each element is a pair of [name,value] for each field
.collectEntries() turns the list of pairs of [name,value] into a map where the keys are the field names and the values are the field value

Finally, the line println valByNam just dumps the map generated by the above.

Ok, enough explaining. The above code is only marginally interesting because, really, who needs to re-format the contents of the country metadata as maps / key-value pairs? Let's do something with that data!

Creating lookup tables

One thing we see in the country metadata is that the World Bank has a system for classifying countries according to their income. This column is called IncomeGroup (no space) in the file. As an example of linking two separate files together, let's:

Create a lookup table of country code versus income group from the country metadata — we'll call this iGLU;
Use that lookup table to calculate population growth rates by income group from the population data file.

To do this, we first need to declare a lookup table and populate it. We'll repurpose the above code:

import com.opencsv.CSVReader

def iGLU = [:] // income group lookup table

def mdCountryCSV = "Metadata_Country_API_SP.POP.TOTL_DS2_en_csv_v2.csv"

new File(mdCountryCSV).withReader { reader ->
    def csvReader = new CSVReader(reader)
    def fNam = csvReader.readNext()
    csvReader.each { fVal ->
        def valByNam = [fNam,fVal].transpose().collectEntries()
        iGLU[valByNam."Country Code"] = valByNam.IncomeGroup
    }
}
println iGLU

Save this code as ex05.groovy and run it.

The notation [:] creates an empty map. We could be more specific, for example, by defining a hashtable-based map that takes String arguments and produces a string result. If you're not too familiar with maps and hash tables, the Groovy documentation, section 2.2, provides more details.

Note also the use of dot-notation to access the map — in particular, that quotes around IncomeGroup are not necessary and only so around Country Code because of its embedded blank. However, if what is to the right of the dot is a variable rather than a constant, we either must surround it by parentheses or go back to the square brackets and leave out the dot.

Creating and initializing accumulators

In order to calculate the population growth by income group, we're going to have to create accumulators. Since we're using the classifications for income group supplied in the country metadata, it makes sense if our accumulators are defined as maps. Also, we'll need two: one for population in the start year, one for population in the end year. Finally, we need to decide whether to initialize these indicators at the start or every time a new income group is encountered while reading the population data. For purposes of this exercise, we're going to initialize first:

def iGSet = iGLU.values() as Set

def pop1 = [:] // population by income group in start year
def pop2 = [:] // population by income group in end year
iGSet.each { ig ->
    pop1[ig] = 0l
    pop2[ig] = 0l
}

Append this to the end of ex05.groovy (you can delete the println).

The first line of code gets all the values from the iGLU income group lookup map and converts them to a Set, which is a kind of collection where each element occurs at most once. We'll use this iGSet to iterate over the unique values of income group as defined in the country metadata.

Then we define our two accumulators, pop1, which we use to accumulate totals by income group in the start year, and pop2, which we use for the end year.

The next four lines initialize pop1 and pop2 to zero (long — that's what the 0l means). This is a good moment to note that Groovy's designers decided that unqualified integers in Groovy source code are of type BigInteger, and unqualified decimal numbers are of type BigDecimal. I tend to avoid using these (software-implemented) types.

Processing another file using lookups and accumulating

At this point, we can read the population data and accumulate it:

def populationCSV = "API_SP.POP.TOTL_DS2_en_csv_v2.csv"

new File(populationCSV).withReader { reader ->
    def csvReader = new CSVReader(reader)
    def fNam
    (1..5).each { fNam = csvReader.readNext() }
    csvReader.each { fVal ->
        def valByNam = [fNam,fVal].transpose().collectEntries()
        def country = valByNam."Country Code"
        if (country && iGLU.containsKey(country)) {
            pop1[iGLU[country]] += Long.parseLong(valByNam."2014" ?: "0")
            pop2[iGLU[country]] += Long.parseLong(valByNam."2015" ?: "0")
        }
    }
}

Append this to the end of ex05.groovy.

Similar to the handling of the metadata file (it's all just CSV after all), it's worth noting that the population CSV is not well formed — it has four title lines prior to the column heading line. Therefore, defining the list of column names is more complex. The code (1..5) uses the Groovy range to generate a list of 5 elements 1, 2, 3, 4, 5, and the each executes the closure once for each element. This is equivalent in effect to a C or Java (or Groovy!) for (int i = 0; i < 5; i++) {...} but doesn't require us to define a spurious variable. Through this code, fNam is eventually set to the fifth line — that is, the column headers.

The if statement first checks that country is non-null and non-blank and then makes sure it's found among the keys in the income group lookup table. This would be a good moment to take a break and study some Groovy semantics, especially the meaning of truth in Groovy (§5 on that page). Basically, non-null, non-empty, non-zero values are true.

Summarizing results

Finally, we're using Long values at this point because there are too many people to be enumerated in a 32-bit integer.

Now that we have the data accumulated, the one remaining step is to write it out:

iGSet.each { ig ->
    printf "group %s growth rate %.2f %%\n",
        (ig ?: "Unspecified"),100d * (pop2[ig] - pop1[ig]) / pop1[ig]
}

Append this to the end of ex05.groovy. At this point, you have a full program and can run it, producing the following output:

group High income growth rate 0.56 %
group Low income growth rate 2.73 %
group Upper middle income growth rate 0.78 %
group Unspecified growth rate 1.27 %
group Lower middle income growth rate 1.46 %

Note that (ig ?: "Unspecified") above uses the Groovy Elvis operator, which is a short form of (ig ? ig : "Unspecified"), and a great example of the DRY principle (Don't Repeat Yourself). We're using this construct so that every income group has a text string printed, even if one wasn't specified in the original data.

As a data analyst, my purpose is not to interpret these results; however, the overall process in which I take some data (in this case, publicly available) and mine it for relationships is precisely what a lot of my work is about. With this example, you can see why Groovy's concise and powerful collection and map abstractions, coupled with all the Java libraries out there, make it an indispensable tool for me.

Where to Next?

In the next installment, we will take a look at other structured data sources besides CSV files.

Comments are closed.

This work is licensed under a Creative Commons Attribution-Share Alike 4.0 International License.