Don't like loops? Try Java Streams

It's 2020 and time to learn about Java Streams.

Person drinking a hat drink at the computer

Image by:

In this article, I will explain how to not write loops anymore.

What? Whaddaya mean, no more loops?

Yep, that's my 2020 resolution—no more loops in Java. Understand that it's not that loops have failed me, nor have they led me astray (well, at least, I can argue that point). Really, it is that I, a Java programmer of modest abilities since 1997 or so, must finally learn about all this new Streams stuff, saying "what" I want to do and not "how" I want to do it, maybe being able to parallelize some of my computations, and all that other good stuff.

I'm guessing that there are other Java programmers out there who also have been programming in Java for a decent amount of time and are in the same boat. Therefore, I'm offering my experiences as a guide to "how to not write loops in Java anymore."

Find a problem worth solving

If you're like me, then the first show-stopper you run into is "right, cool stuff, but what am I solving for, and how do I apply this?" I realized that I can spot the perfect opportunity camouflaged as Something I've Done Before.

In my case, it's sampling land cover within a specific area and coming up with an estimate and a confidence interval around that estimate for the land cover across the whole area. The specific problem involves deciding whether an area is "forested" or not, given a specific legal definition: if at least 10% of the soil is covered over by tree crowns, then the area is considered to be forested; otherwise, it's something else.

It's a pretty esoteric example of a recurring problem; I'll grant you. But there it is. For the ecologists and foresters out there who are accustomed to cool temperate or tropical forests, 10% might sound kind of low, but in the case of dry areas with low-growing shrubs and trees, that's a reasonable number.

So the basic idea is: use images to stratify the area (i.e., areas completely devoid of trees, areas of predominantly small trees spaced quite far apart, areas of predominantly small trees spaced closer together, areas of somewhat larger trees), locate some samples in those strata, send the crew out to measure the samples, analyze the results, and calculate the proportion of soil covered by tree crowns across the area. Simple, right?

What the field data looks like

In the current project, the samples are rectangular areas 20 meters wide by 25 meters long, so 500 square meters each. On each patch, the field crew measured each tree: its species, its height, the maximum and minimum width of its crown, and the diameter of its trunk at trunk height (nominally 30cm above the ground). This information was collected, entered into a spreadsheet, and exported to a bar separated value (BSV) file for me to analyze. It looks like this:

Stratum#	Sample#	Tree#	Species	Trunk diameter (cm)	Crown diameter 1 (m)	Crown diameter 2 (m)	Height (m)
1	1	1	Ac	6	3.6	4.6	2.4
1	1	2	Ac	6	2.2	2.3	2.5
1	1	3	Ac	16	2.5	1.7	2.4
1	1	4	Ac	6	1.5	2.1	1.8
1	1	5	Ac	5	0.9	1.7	1.7
1	1	6	Ac	6	1.7	1.3	1.6
1	1	7	Ac	5	1.82	1.32	1.8
1	1	1	Ac	1	0.3	0.25	0.9
1	1	2	Ac	2	1.2	1.2	1.7

The first column is the stratum number (where 1 is "predominantly small trees spaced quite far apart," 2 is "predominantly small trees spaced closer together," and 3 is "somewhat larger trees"; we didn't sample the areas "completely devoid of trees"). The second column is the sample number (there are 73 samples altogether, located in the three strata in proportion to the area of each stratum). The third column is the tree number within the sample. The fourth is the two-letter species code, the fifth the trunk diameter (in this case, 10cm above ground or exposed roots), the sixth the smallest distance across the crown, the seventh the largest distance, and the eighth the height of the tree.

For the purposes of this exercise, I'm only concerned with the total amount of ground covered by the tree crowns—not the species, nor the height, nor the diameter of the trunk.

In addition to the measurement information above, I also have the areas of the three strata, also in a BSV:

stratum	hectares
1	114.89
2	207.72
3	29.77

What I want to do (not how I want to do it)

In keeping with one of the main design goals of Java Streams, here is "what" I want to do:

Read the stratum area BSV and save the data as a lookup table.
Read the measurements from the measurement BSV file.
Accumulate each measurement (tree) to calculate the total area of the sample covered by tree crowns.
Accumulate the sample tree crown area values and count the number of samples to estimate the mean tree crown area coverage and standard error of the mean for each stratum.
Summarize the stratum figures.
Weigh the stratum means and standard errors by the stratum areas (looked up from the table created in step 1) and accumulate them to estimate the mean tree crown area coverage and standard error of the mean for the total area.
Summarize the weighted figures.

Generally speaking, the way to define "what" with Java Streams is by creating a stream processing pipeline of function calls that pass over the data. So, yes, there is actually a bit of "how" that ends up creeping in… in fact, quite a bit of "how." But, it needs a very different knowledge base than the good, old fashioned loop.

I'll go through each of these steps in detail.

Build the stratum area table

The first job is to convert the stratum areas BSV file to a lookup table:

String fileName = "stratum_areas.bsv";
Stream<String> inputLineStream = Files.lines(Paths.get(fileName));  // (1)

final Map<Integer,Double> stratumAreas =   // (2)
    inputLineStream  	// (3)
        .skip(1)                   // (4)
        .map(l -> l.split("\\|"))  // (5)
        .collect(                  // (6)
            Collectors.toMap(      // (7)
                a -> Integer.parseInt(a[0]),  // (8)
                a -> Double.parseDouble(a[1]) // (9)
            )
        );
inputLineStream.close();   // (10)

System.out.println("stratumAreas = " + stratumAreas);  // (11)

I'll take this a line or two at a time, where the numbers in comments following the lines above—e.g., // (3)— correspond to the numbers below:

java.nio.Files.lines() gives a stream of strings corresponding to lines in the file.
The goal is to create the lookup table, stratumAreas, which is a Map<Integer,Double>. Therefore, I can get the double value area for stratum 2 as stratumAreas.get(2).
This is the beginning of the stream "pipeline."
Skip the first line in the pipeline since it's the header line containing the column names.
Use map() to split the String input line into an array of String fields, with the first field being the stratum # and the second being the stratum area.
Use collect() to materialize the results.
The materialized result will be produced as a sequence of Map entries.
The key of each map entry is the first element of the array in the pipeline—the int stratum number. By the way, this is a Java lambda expression—an anonymous function that takes an argument and returns that argument converted to an int.
The value of each map entry is the second element of the array in the pipeline—the double stratum area.
Don't forget to close the stream (file).

Print out the result, which looks like:

stratumAreas = {1=114.89, 2=207.72, 3=29.77}

Build the measurements table and accumulate the measurements into the sample totals

Now that I have the stratum areas, I can start processing the main body of data—the measurements. I combine the two tasks of building the measurements table and accumulating the measurements into the sample totals since I don't have any interest in the measurement data per se.

fileName = "sample_data_for_testing.bsv";
inputLineStream = Files.lines(Paths.get(fileName));

 final Map<Integer,Map<Integer,Double>> sampleValues =
    inputLineStream
        .skip(1)
        .map(l -> l.split("\\|"))
        .collect(                  // (1)
            Collectors.groupingBy(a -> Integer.parseInt(a[0]),     // (2)
                Collectors.groupingBy(b -> Integer.parseInt(b[1]), // (3)
                    Collectors.summingDouble(                      // (4)
                        c -> {                                     // (5)
                            double rm = (Double.parseDouble(c[5]) +
                                Double.parseDouble(c[6]))/4d;      // (6)
                            return rm*rm * Math.PI / 500d;         // (7)
                        })
                )
            )
        );
inputLineStream.close();

System.out.println("sampleValues = " + sampleValues);  // (8)

Again, a line or two or so at a time:

The first seven lines are the same in this task and the previous, except the name of this lookup table is sampleValues; and it is a Map of Maps.
The measurement data is grouped into samples (by sample #), which are, in turn, grouped into strata (by stratum #), so I use Collectors.groupingBy() at the topmost level to separate data into strata, with a[0] here being the stratum number.
I use Collectors.groupingBy() once more to separate data into samples, with b[1] here being the sample number.
I use the handy Collectors.summingDouble() to accumulate the data for each measurement within the sample within the stratum.
Again, a Java lambda or anonymous function whose argument c is the array of fields, where this lambda has several lines of code that are surrounded by { and } with a return statement just before the }.
Calculate the mean crown radius of the measurement.
Calculate the crown area of the measurement as a proportion of the total sample area and return that value as the result of the lambda.

Again, similar to the previous task. The result looks like (with some numbers elided):

sampleValues = {1={1=0.09083231861452731, 66=0.06088002082602869, ... 28=0.0837823490804228}, 2={65=0.14738326403381743, 2=0.16961183847374103, ... 63=0.25083064794883453}, 3={64=0.3306323635177101, 32=0.25911911184680053, ... 30=0.2642668470291564}}

This output shows the Map of Maps structure clearly—there are three entries in the top level corresponding to the strata 1, 2, and 3, and each stratum has subentries corresponding to the proportional area of the sample covered by tree crowns.

Accumulate the sample totals into the stratum means and standard errors

At this point, the task becomes more complex; I need to count the number of samples, sum up the sample values in preparation for calculating the sample mean, and sum up the squares of the sample values in preparation for calculating the standard error of the mean. I may as well incorporate the stratum area into this grouping of data as well, as I'll need it shortly to weigh the stratum results together.

So the first thing to do is create a class, StratumAccumulator, to handle the accumulation and provide the calculation of the interesting results. This class implements java.util.function.DoubleConsumer, which can be passed to collect() to handle accumulation:

class StratumAccumulator implements DoubleConsumer {
    private double ha;
    private int n;
    private double sum;
    private double ssq;
    public StratumAccumulator(double ha) { // (1)
        this.ha = ha;
        this.n = 0;
        this.sum = 0d;
        this.ssq = 0d;
    }
    public void accept(double d) { // (2)
        this.sum += d;
        this.ssq += d*d;
        this.n++;
    }
    public void combine(StratumAccumulator other) { // (3)
        this.sum += other.sum;
        this.ssq += other.ssq;
        this.n += other.n;
    }
    public double getHa() {  // (4)
        return this.ha;
    }
    public int getN() {  // (5)
        return this.n;
    }
    public double getMean() {  // (6)
        return this.n > 0 ? this.sum / this.n : 0d;
    }
    public double getStandardError() {  // (7)
        double mean = this.getMean();
        double variance = this.n > 1 ? (this.ssq - mean*mean*n)/(this.n - 1) : 0d;
        return this.n > 0 ? Math.sqrt(variance/this.n) : 0d;
    }
}

Line-by-line:

The constructor StratumAccumulator(double ha) takes an argument, the area of the stratum in hectares, which allows me to merge the stratum area lookup table into instances of this class.
The accept(double d) method is used to accumulate the stream of double values, and I use it to:

a. Count the number of values.

b. Sum the values in preparation for computing the sample mean.

c. Sum the squares of the values in preparation for computing the standard error of the mean.
The combine() method is used to merge substreams of StratumAccumulators (in case I want to process in parallel).
The getter for the area of the stratum
The getter for the number of samples in the stratum
The getter for the mean sample value in the stratum
The getter for the standard error of the mean in the stratum

Once I have this accumulator, I can use it to accumulate the sample values pertaining to each stratum:

final Map<Integer,StratumAccumulator> stratumValues =   // (1)
    sampleValues.entrySet().stream()   // (2)
        .collect(                      // (3)
            Collectors.toMap(          // (4)
                e -> e.getKey(),       // (5)
                e -> e.getValue().entrySet().stream()   // (6)
                    .map(Map.Entry::getValue)           // (7)
                    .collect(          // (8)
                        () -> new StratumAccumulator(stratumAreas.get(e.getKey())),   // (9)
                        StratumAccumulator::accept,     // (10)
                        StratumAccumulator::combine)    // (11)
            )
        );

Line-by-line:

This time, I'm using the pipeline to build stratumValues, which is a Map<Integer,StratumAccumulator>, so stratumValues.get(3) will return the StratumAccumulator instance for stratum 3.
Here, I'm using the entrySet().stream() method provided by Map to get a stream of (key, value) pairs; recall these are Maps of sample values by stratum.
Again, I'm using collect() to gather the pipeline results by stratum…
using Collectors.toMap() to generate a stream of Map entries…
whose keys are the key of the incoming stream (that is, the stratum #)…
and whose values are the Map of sample values, and I again use entrySet().stream() to convert to a stream of Map entries, one for each sample.
Using map() to get the value of the sample Map entry; I'm not interested in the key by this point.
Yet again, using collect() to accumulate the sample results into the StratumAccumulator instances.
Telling collect() how to create a new StratumAccumulator—I need to pass the stratum area into the constructor here, so I can't just use StratumAccumulator::new.
Telling collect() to use the accept() method of StratumAccumulator to accumulate the stream of sample values.
Telling collect() to use the combine() method of StratumAccumulator to merge StratumAccumulator instances.

Summarize the stratum figures

Whew! After all of that, printing out the stratum figures is pretty straightforward:

stratumValues.entrySet().stream()
    .forEach(e -> {
        StratumAccumulator sa = e.getValue();
        int n = sa.getN();
        double se66 = sa.getStandardError();
        double t = new TDistribution(n - 1).inverseCumulativeProbability(0.975d);
        System.out.printf("stratum %d n %d mean %g se66 %g t %g se95 %g ha %g\n",
            e.getKey(), n, sa.getMean(), se66, t, se66 * t, sa.getHa());
    });

In the above, once again, I use entrySet().stream() to transform the stratumValues Map to a stream, and then apply the forEach() method to the stream. ForEach() is pretty much what it sounds like—a loop! But the business of finding the head of the stream, finding the next element, and checking to see if hits the end is all handled by Java Streams. So, I just get to say what I want to do for each record, which is basically to print it out.

My code looks a bit more complicated because I declare some local variables to hold some intermediate results that I use more than once—n, the number of samples, and se66, the standard error of the mean. I also calculate the inverse T value to convert my standard error of the mean to a 95% confidence interval.

The result looks like this:

stratum 1 n 24 mean 0.0903355 se66 0.0107786 t 2.06866 se95 0.0222973 ha 114.890
stratum 2 n 38 mean 0.154612 se66 0.00880498 t 2.02619 se95 0.0178406 ha 207.720
stratum 3 n 11 mean 0.223634 se66 0.0261662 t 2.22814 se95 0.0583020 ha 29.7700

Accumulate the stratum means and standard errors into the total

Once again, the task becomes more complex, so I create a class, TotalAccumulator, to handle the accumulation and provide the calculation of the interesting results. This class implements java.util.function.Consumer<T>, which can be passed to collect() to handle accumulation:

class TotalAccumulator implements Consumer<StratumAccumulator> {
    private double ha;
    private int n;
    private double sumWtdMeans;
    private double ssqWtdStandardErrors;
    public TotalAccumulator() {
        this.ha = 0d;
        this.n = 0;
        this.sumWtdMeans = 0d;
        this.ssqWtdStandardErrors = 0d;
    }
    public void accept(StratumAccumulator sa) {
        double saha = sa.getHa();
        double sase = sa.getStandardError();
        this.ha += saha;
        this.n += sa.getN();
        this.sumWtdMeans += saha * sa.getMean();
        this.ssqWtdStandardErrors += saha * saha * sase * sase;
    }
    public void combine(TotalAccumulator other) {
        this.ha += other.ha;
        this.n += other.n;
        this.sumWtdMeans += other.sumWtdMeans;
        this.ssqWtdStandardErrors += other.ssqWtdStandardErrors;
    }
    public double getHa() {
        return this.ha;
    }
    public int getN() {
        return this.n;
    }
    public double getMean() {
        return this.ha > 0 ? this.sumWtdMeans / this.ha : 0d;
    }
    public double getStandardError() {
        return this.ha > 0 ? Math.sqrt(this.ssqWtdStandardErrors) / this.ha : 0;
    }
}

I'm not going to go into much detail on this, since it's structurally pretty similar to StratumAccumulator. Of main interest:

The constructor takes no arguments, which simplifies its use.
The accept() method accumulates instances of StratumAccumulator, not double values, hence the use of the Consumer<T> interface.
As for the calculations, they are assembling a weighted average of the StratumAccumulator instances, so they make use of the stratum areas, and the formulas might look a bit strange to anyone who's not used to stratified sampling.

As for actually carrying out the work, it's easy-peasy:

final TotalAccumulator totalValues =
    stratumValues.entrySet().stream()
        .map(Map.Entry::getValue)
        .collect(TotalAccumulator::new, TotalAccumulator::accept, TotalAccumulator::combine);

Same old stuff as before:

Use entrySet().stream() to convert the stratumValue Map entries to a stream.
Use map() to replace the Map entries with their values—the instances of StratumAccumulator.
Use collect() to apply the TotalAccumulator to the instances of StratumAccumulator.

Summarize the total figures

Getting the interesting bits out of the TotalAccumulator instance is also pretty straightforward:

int nT = totalValues.getN();
double se66T = totalValues.getStandardError();
double tT = new TDistribution(nT - stratumValues.size()).inverseCumulativeProbability(0.975d);
System.out.printf("total n %d mean %g se66 %g t %g se95 %g ha %g\n",
    nT, totalValues.getMean(), se66T, tT, se66T * tT, totalValues.getHa());

Similar to the StratumAccumulator, I just call the relevant getters to pick out the number of samples nT and the standard error se66T. I calculate the T value tT (using "n – 3" here since there are three strata), and then I print the result, which looks like this:

total n 73 mean 0.139487 se66 0.00664653 t 1.99444 se95 0.0132561 ha 352.380

In conclusion

Wow, that looks like a bit of a marathon. It feels like it, too. As is often the case, there is a great deal of information about how to use Java Streams, all illustrated with toy examples, which kind of help, but not really. I found that getting this to work with a real-world (albeit very simple) example was difficult.

Because I've been working in Groovy a lot lately, I kept finding myself wanting to accumulate into "maps of maps of maps" rather than creating accumulator classes, but I was never able to pull that off except in the case of totaling up the measurements in the sample. So, I worked with accumulator classes instead of maps of maps, and maps of accumulator classes instead of maps of maps of maps.

I don't feel like any kind of master of Java Streams at this point, but I do feel I have a pretty solid understanding of collect(), which is deeply important, along with various methods to reformat data structures into streams and to reformat stream elements themselves. So yeah, more to learn!

Speaking of collect(), in the examples I presented above, we can see moving from a very simple use of this fundamental method - using the Collectors.summingDouble() accumulation method - through defining an accumulator class that extends one of the pre-defined interfaces - in this case DoubleConsumer - to defining a full-blown accumulator of our own, used to accumulate the intermediate stratum class. I was tempted - sort of - to work backward and implement fully custom accumulators for the stratum and sample accumulators, but the point of this exercise was to learn more about Java Streams, not to become an expert in one single part of it all.

What's your experience with Java Streams? Done anything big and complicated yet? Please share it in the comments.