Create a machine learning model with Bash

Bash, Tcsh, or Zsh can help you get ready for machine learning.

Image by:

Opensource.com

Machine learning is a powerful computing capability for predicting or forecasting things that conventional algorithms find challenging. The machine learning journey begins with collecting and preparing data—a lot of it—then it builds mathematical models based on that data. While multiple tools can be used for these tasks, I like to use the shell.

A shell is an interface for performing operations using a defined language. This language can be invoked interactively or scripted. The concept of the shell was introduced in Unix operating systems in the 1970s. Some of the most popular shells include Bash, tcsh, and Zsh. They are available for all operating systems, including Linux, macOS, and Windows, which gives them high portability. For this exercise, I'll use Bash.

This article is an introduction to using a shell for data collection and data preparation. Whether you are a data scientist looking for efficient tools or a shell expert looking at using your skills for machine learning, I hope you will find valuable information here.

The example problem in this article is creating a machine learning model to forecast temperatures for US states. It uses shell commands and scripts to do the following data collection and data preparation steps:

Download data
Extract the necessary fields
Aggregate data
Make time series
Create the train, test, and validate data sets

You may be asking why you should do this with shell, when you can do all of it in a machine learning programming language such as Python. This is a good question. If data processing is performed with an easy, friendly, rich technology like a shell, a data scientist focuses only on machine learning modeling and not the details of a language.

Prerequisites

First, you need to have a shell interpreter installed. If you use Linux or macOS, it will already be installed, and you may already be familiar with it. If you use Windows, try MinGW or Cygwin.

For more information, see:

Bash tutorials here on opensource.com
The official shell scripting tutorial by Steve Parker, the creator of Bourne shell
The Bash Guide for Beginners by the Linux Documentation Project
If you need help with a specific command, type <commandname> --help in the shell for help; for example: ls --help.

Get started

Now that your shell is set up, you can start preparing data for the machine learning temperature-prediction problem.

1. Download data

The data for this tutorial comes from the US National Oceanic and Atmospheric Administration (NOAA). You will train your model using the last 10 complete years of data. The data source is at https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/by_year/, and the data is in .csv format and gzipped.

Download and unzip the data using a shell script. Use your favorite text editor to create a file named download.sh and paste in the code below. The comments in the code explain what the commands do:

#!/bin/sh
# This is called hashbang. It identifies the executor used to run this file.
# In this case, the script is executed by shell itself.
# If not specified, a program to execute the script must be specified. 
# With hashbang: ./download.sh;  Without hashbang: sh ./download.sh;

FROM_YEAR=2010
TO_YEAR=2019

year=$FROM_YEAR
# For all years one by one starting from FROM_YEAR=2010 upto TO_YEAR=2019
while [ $year -le $TO_YEAR ]
do
    # show the year being downloaded now
    echo $year
    # Download
    wget https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/by_year/${year}.csv.gz
    # Unzip
    gzip -d ${year}.csv.gz
    # Move to next year by incrementing
    year=$(($year+1))
done

Notes:

If you are behind a proxy server, consult Mark Grennan's how-to, and use:

export http_proxy=https://username:password@proxyhost:port/
export https_proxy=https://username:password@proxyhost:port/

Make sure all standard commands are already in your PATH (such as /bin or /usr/bin). If not, set your PATH.
Wget is a utility for connecting to web servers from the command line. If Wget is not installed on your system, download it.
Make sure you have gzip, a utility used for compression and decompression.

Run this script to download, extract, and make 10 years' worth of data available as CSVs:

$ ./download.sh
2010
--2020-10-30 19:10:47--  https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/by_year/2010.csv.gz
Resolving www1.ncdc.noaa.gov (www1.ncdc.noaa.gov)... 205.167.25.171, 205.167.25.172, 205.167.25.178, ...
Connecting to www1.ncdc.noaa.gov (www1.ncdc.noaa.gov)|205.167.25.171|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 170466817 (163M) [application/gzip]
Saving to: '2010.csv.gz'

     0K .......... .......... .......... .......... ..........  0% 69.4K 39m57s
    50K .......... .......... .......... .......... ..........  0%  202K 26m49s
   100K .......... .......... .......... .......... ..........  0% 1.08M 18m42s

...

The ls command lists the contents of a folder. Use ls 20*.csv to list all your files with names beginning with 20 and ending with .csv.

$ ls 20*.csv
2010.csv  2011.csv  2012.csv  2013.csv  2014.csv  2015.csv  2016.csv  2017.csv  2018.csv  2019.csv

2. Extract average temperatures

Extract the TAVG (average temperature) data from the CSVs for US regions:

extract_tavg_us.sh

#!/bin/sh

# For each file with name that starts with "20" and ens with ".csv"
for csv_file in `ls 20*.csv`
do
    # Message that says file name $csv_file is extracted to file TAVG_US_$csv_file
    # Example: 2010.csv extracted to TAVG_US_2010.csv
    echo "$csv_file -> TAVG_US_$csv_file"
    # grep "TAVG" $csv_file: Extract lines in file with text "TAVG"
    # |: pipe
    # grep "^US": From those extract lines that begin with text "US"
    # > TAVG_US_$csv_file: Save xtracted lines to file TAVG_US_$csv_file
    grep "TAVG" $csv_file | grep "^US" > TAVG_US_$csv_file
done

This script:

$ ./extract_tavg_us.sh
2010.csv -> TAVG_US_2010.csv
...
2019.csv -> TAVG_US_2019.csv

creates these files:

$ ls TAVG_US*.csv
TAVG_US_2010.csv  TAVG_US_2011.csv  TAVG_US_2012.csv  TAVG_US_2013.csv
TAVG_US_2014.csv  TAVG_US_2015.csv  TAVG_US_2016.csv  TAVG_US_2017.csv
TAVG_US_2018.csv  TAVG_US_2019.csv

Here are the first few lines for TAVG_US_2010.csv:

$ head TAVG_US_2010.csv
USR0000AALC,20100101,TAVG,-220,,,U,
USR0000AALP,20100101,TAVG,-9,,,U,
USR0000ABAN,20100101,TAVG,12,,,U,
USR0000ABCA,20100101,TAVG,16,,,U,
USR0000ABCK,20100101,TAVG,-309,,,U,
USR0000ABER,20100101,TAVG,-81,,,U,
USR0000ABEV,20100101,TAVG,-360,,,U,
USR0000ABEN,20100101,TAVG,-224,,,U,
USR0000ABNS,20100101,TAVG,89,,,U,
USR0000ABLA,20100101,TAVG,59,,,U,

The head command is a utility for displaying the first several lines (by default, 10 lines) of a file.

The data has more information than you need. Limit the number of columns by eliminating column 3 (since all the data is average temperature) and column 5 onward. In other words, keep columns 1 (climate station), 2 (date), and 4 (temperature recorded).

key_columns.sh

#!/bin/sh

# For each file with name that starts with "TAVG_US_" and ens with ".csv"
for csv_file in `ls TAVG_US_*.csv`
do
    echo "Exractiing columns $csv_file"
    # cat $csv_file: 'cat' is to con'cat'enate files - here used to show one year csv file
    # |: pipe
    # cut -d',' -f1,2,4: Cut columns 1,2,4 with , delimitor
    # > $csv_file.cut: Save to temporary file
    | > $csv_file.cut: 
    cat $csv_file | cut -d',' -f1,2,4 > $csv_file.cut
    # mv $csv_file.cut $csv_file: Rename temporary file to original file
    mv $csv_file.cut $csv_file
    # File is processed and saved back into the same
    # There are other ways to do this
    # Using intermediate file is the most reliable method.
done

Run the script:

$ ./key_columns.sh
Extracting columns TAVG_US_2010.csv
...
Extracting columns TAVG_US_2019.csv

The first few lines of TAVG_US_2010.csv with the unneeded data removed are:

$ head TAVG_US_2010.csv
USR0000AALC,20100101,-220
USR0000AALP,20100101,-9
USR0000ABAN,20100101,12
USR0000ABCA,20100101,16
USR0000ABCK,20100101,-309
USR0000ABER,20100101,-81
USR0000ABEV,20100101,-360
USR0000ABEN,20100101,-224
USR0000ABNS,20100101,89
USR0000ABLA,20100101,59

Dates are in string form (YMD). To train your model correctly, your algorithms need to recognize date fields in the comma-separated Y,M,D form (For example, 20100101 becomes 2010,01,01). You can convert them with the sed utility.

date_format.sh

for csv_file in `ls TAVG_*.csv`
do
    echo Date formatting $csv_file
    # This inserts , after year
    sed -i 's/,..../&,/' $csv_file
    # This inserts , after month
    sed -i 's/,....,../&,/' $csv_file
done

Run the script:

$ ./date_format.sh
Date formatting TAVG_US_2010.csv
...
Date formatting TAVG_US_2019.csv

The first few lines of TAVG_US_2010.csv with the comma-separated date format are:

$ head TAVG_US_2010.csv
USR0000AALC,2010,01,01,-220
USR0000AALP,2010,01,01,-9
USR0000ABAN,2010,01,01,12
USR0000ABCA,2010,01,01,16
USR0000ABCK,2010,01,01,-309
USR0000ABER,2010,01,01,-81
USR0000ABEV,2010,01,01,-360
USR0000ABEN,2010,01,01,-224
USR0000ABNS,2010,01,01,89
USR0000ABLA,2010,01,01,59

3. Aggregate states' average temperature data

The weather data comes from climate stations located in US cities, but you want to forecast whole states' temperatures. To convert the climate-station data to state data, first, map climate stations to their states.

Download the list of climate stations using wget:

$ wget ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/ghcnd-stations.txt

Extract the US stations with the grep utility to find US listings. The following command searches for lines that begin with the text "US." The > is a redirection that writes output to a file—in this case, to a file named us_stations.txt:

$ grep "^US" ghcnd-stations.txt > us_stations.txt

This file was created for pretty print, so the column separators are inconsistent:

$ head us_stations.txt
US009052008  43.7333  -96.6333  482.0 SD SIOUX FALLS (ENVIRON. CANADA)
US10RMHS145  40.5268 -105.1113 1569.1 CO RMHS 1.6 SSW
US10adam001  40.5680  -98.5069  598.0 NE JUNIATA 1.5 S
...

Make them consistent by using cat to print the file, using tr to squeeze repeats and output to a temp file, and renaming the temp file back to the original—all in one line:

$ cat us_stations.txt | tr -s ' ' > us_stations.txt.tmp; cp us_stations.txt.tmp us_stations.txt;

The first lines of the command's output:

$ head us_stations.txt
US009052008 43.7333 -96.6333 482.0 SD SIOUX FALLS (ENVIRON. CANADA)
US10RMHS145 40.5268 -105.1113 1569.1 CO RMHS 1.6 SSW
US10adam001 40.5680 -98.5069 598.0 NE JUNIATA 1.5 S
...

This contains a lot of info—GPS coordinates and such—but you only need the station code and state. Use cut:

$ cut -d' ' -f1,5 us_stations.txt > us_stations.txt.tmp; mv us_stations.txt.tmp us_stations.txt;

The first lines of the command's output:

$ head us_stations.txt
US009052008 SD
US10RMHS145 CO
US10adam001 NE
US10adam002 NE
...

Make this a CSV and change the spaces to comma separators using sed:

$ sed -i s/' '/,/g us_stations.txt

The first lines of the command's output:

$ head us_stations.txt
US009052008,SD
US10RMHS145,CO
US10adam001,NE
US10adam002,NE
...

Although you used several commands for these tasks, it is possible to perform all the steps in one run. Try it yourself.

Now, replace the station codes with their state locations by using AWK, which is functionally high performant for large data processing.

station_to_state_data.sh

#!/bin/sh

for DATA_FILE in `ls TAVG_US_*.csv`

do
    echo ${DATA_FILE}
    awk -v FILE=$DATA_FILE -F, '
        {
            state_day_sum[$1 "," $2 "," $3 "," $4] = state_day_sum[$1 "," $2 "," $3 "," $4] + $5
            state_day_num[$1 "," $2 "," $3 "," $4] = state_day_num[$1 "," $2 "," $3 "," $4] + 1
        }
        END {
            for (state_day_key in state_day_sum) {
                print state_day_key "," state_day_sum[state_day_key]/state_day_num[state_day_key]
            }
        }
    ' OFS=, $DATA_FILE > STATE_DAY_${DATA_FILE}.tmp
    sort -k1 -k2 -k3 -k4 < STATE_DAY_${DATA_FILE}.tmp > STATE_DAY_$DATA_FILE
    rm -f STATE_DAY_${DATA_FILE}.tmp
done

Here is what these parameters mean:

`-F,`	Field separator is `,`
`FNR`	Line number in each file
`NR`	Line number in both files together
`FNR==NR`	Is TRUE only in the first file `${PATTERN_FILE}`
`{ x[$1]=$2; next; }`	If `FNR==NR` is TRUE (for all lines in `$PATTERN_FILE` only)
`- x`	Variable to store `station=state` map
`- x[$1]=$2`	Adds data of `station=state` to map
`- $1`	First column in first file (station codes)
`- $2`	Second column in first file (state codes)
`- x`	Map of all stations e.g., `x[US009052008]=SD`, `x[US10RMHS145]=CO`, ..., `x[USW00096409]=AK`
`- next`	Go to next line matching `FNR==NR` (essentially, this creates a map of all stations-states from the `${PATTERN_FILE}`
`{ $1=x[$1]; print $0 }`	If `FNR==NR` is FALSE (for all lines `in $DATA_FILE` only)
`- $1=x[$1]`	Replace first field with `x[$1]`; essentially, replace station code with state code
`- print $0`	Print all columns (including replaced `$1`)
`OFS=,`	Output fields separator is `,`

The CSV with station codes:

$ head TAVG_US_2010.csv
USR0000AALC,2010,01,01,-220
USR0000AALP,2010,01,01,-9
USR0000ABAN,2010,01,01,12
USR0000ABCA,2010,01,01,16
USR0000ABCK,2010,01,01,-309
USR0000ABER,2010,01,01,-81
USR0000ABEV,2010,01,01,-360
USR0000ABEN,2010,01,01,-224
USR0000ABNS,2010,01,01,89
USR0000ABLA,2010,01,01,59

Run the command:

$ ./station_to_state_data.sh
TAVG_US_2010.csv
...
TAVG_US_2019.csv

Stations are now mapped to states:

$ head TAVG_US_2010.csv
AK,2010,01,01,-220
AZ,2010,01,01,-9
AL,2010,01,01,12
AK,2010,01,01,16
AK,2010,01,01,-309
AK,2010,01,01,-81
AK,2010,01,01,-360
AK,2010,01,01,-224
AZ,2010,01,01,59
AK,2010,01,01,-68

Every state has several temperature readings for each day, so you need to calculate the average of each state's readings for a day. Use AWK for text processing, sort to ensure the final results are in a logical order, and rm to delete the temporary file after processing.

station_to_state_data.sh

PATTERN_FILE=us_stations.txt

for DATA_FILE in `ls TAVG_US_*.csv`
do
    echo ${DATA_FILE}

    awk -F, \
        'FNR==NR { x[$1]=$2; next; } { $1=x[$1]; print $0 }' \
        OFS=, \
        ${PATTERN_FILE} ${DATA_FILE} > ${DATA_FILE}.tmp

   mv ${DATA_FILE}.tmp ${DATA_FILE}
done

Here is what the AWK parameters mean:

`FILE=$DATA_FILE`	CSV file processed as `FILE`
`-F,`	Field separator is `,`
`state_day_sum[$1 "," $2 "," $3 "," $4] = $5 state_day_sum[$1 "," $2 "," $3 "," $4] + $5`	Sum of temperature (`$5`) for the state `($1`) on year (`$2`), month (`$3`), day (`$4`)
`state_day_num[$1 "," $2 "," $3 "," $4] = $5 state_day_num[$1 "," $2 "," $3 "," $4] + 1`	Number of temperature readings for the state (`$1`) on year (`$2`), month (`$3`), day (`$4`)
`END`	In the end, after collecting sum and number of readings for all states, years, months, days, calculate averages
`for (state_day_key in state_day_sum)`	For each state-year-month-day
`print state_day_key "," state_day_sum[state_day_key]/state_day_num[state_day_key]`	Print state,year,month,day,average
`OFS=,`	Output fields separator is `,`
`$DATA_FILE`	Input file (all files with name starting with `TAVG_US_` and ending with `.csv`, one by one)
`> STATE_DAY_${DATA_FILE}.tmp`	Save result to a temporary file

Run the script:

$ ./TAVG_avg.sh
TAVG_US_2010.csv
TAVG_US_2011.csv
TAVG_US_2012.csv
TAVG_US_2013.csv
TAVG_US_2014.csv
TAVG_US_2015.csv
TAVG_US_2016.csv
TAVG_US_2017.csv
TAVG_US_2018.csv
TAVG_US_2019.csv

These files are created:

$ ls STATE_DAY_TAVG_US_20*.csv
STATE_DAY_TAVG_US_2010.csv  STATE_DAY_TAVG_US_2015.csv
STATE_DAY_TAVG_US_2011.csv  STATE_DAY_TAVG_US_2016.csv
STATE_DAY_TAVG_US_2012.csv  STATE_DAY_TAVG_US_2017.csv
STATE_DAY_TAVG_US_2013.csv  STATE_DAY_TAVG_US_2018.csv
STATE_DAY_TAVG_US_2014.csv  STATE_DAY_TAVG_US_2019.csv

See one year of data for all states (less is a utility to see output a page at a time):

$ less STATE_DAY_TAVG_US_2010.csv
AK,2010,01,01,-181.934
...
AK,2010,01,31,-101.068
AK,2010,02,01,-107.11 
...
AK,2010,02,28,-138.834
...
WY,2010,01,01,-43.5625
...
WY,2010,12,31,-215.583

Merge all the data files into one:

$ cat STATE_DAY_TAVG_US_20*.csv > TAVG_US_2010-2019.csv

You now have one file, with all states, for all years:

$ cat TAVG_US_2010-2019.csv
AK,2010,01,01,-181.934
...
WY,2018,12,31,-167.421
AK,2019,01,01,-32.3386
...
WY,2019,12,30,-131.028
WY,2019,12,31,-79.8704

4. Make time-series data

A problem like this is fittingly addressed with a time-series model such as long short-term memory (LSTM), which is a recurring neural network (RNN). This input data is organized into time slices; consider 20 days to be one slice.

This is a one-time slice (as in STATE_DAY_TAVG_US_2010.csv):

X (input – 20 weeks):
AK,2010,01,01,-181.934
AK,2010,01,02,-199.531
...
AK,2010,01,20,-157.273

y (21st week, prediction for these 20 weeks):
AK,2010,01,21,-165.31

This time slice is represented as (temperature values where the first 20 weeks are X, and 21 is y):

AK, -181.934,-199.531, ... ,
-157.273,-165.3

The slices are time-contiguous. For example, the end of 2010 continues into 2011:

AK,2010,12,22,-209.92
...
AK,2010,12,31,-79.8523
AK,2011,01,01,-59.5658
...
AK,2011,01,10,-100.623

Which results in the prediction:

AK,2011,01,11,-106.851

This time slice is taken as:

AK, -209.92, ... ,-79.8523,-59.5658, ... ,-100.623,-106.851

and so on, for all states, years, months, and dates. For more explanation, see this tutorial on time-series forecasting.

Write a script to create time slices:

timeslices.sh

#!/bin/sh

TIME_SLICE_PERIOD=20

file=TAVG_US_2010-2019.csv

# For each state in file
for state in `cut -d',' -f1 $file | sort | uniq`
do
    # Get all temperature values for the state
    state_tavgs=`grep $state $file | cut -d',' -f5`
    # How many time slices will this result in?
    # mber of temperatures recorded minus size of one timeslice
    num_slices=`echo $state_tavgs | wc -w`
    num_slices=$((${num_slices} - ${TIME_SLICE_PERIOD}))
    # Initialize
    slice_start=1; num_slice=0;
    # For each timeslice
    while [ $num_slice -lt $num_slices ]
    do
        # One timeslice is from slice_start to slice_end
        slice_end=$(($slice_start + $TIME_SLICE_PERIOD - 1))
        # X (1-20)
        sliceX="$slice_start-$slice_end"
        # y (21)
        slicey=$(($slice_end + 1))
        # Print state and timeslice temperature values (column 1-20 and 21)
        echo $state `echo $state_tavgs | cut -d' ' -f$sliceX,$slicey`
        # Increment 
        slice_start=$(($slice_start + 1)); num_slice=$(($num_slice + 1));
    done
done

Run the script. It uses spaces as column separators; make them commas with sed:

$ ./timeslices.sh > TIMESLICE_TAVG_US_2010-2019.csv; sed -i s/' '/,/g TIME_VARIANT_TAVG_US_2010-2019.csv

Here are the first few lines and the last few lines of the output .csv:

$ head -3 TIME_VARIANT_TAVG_US_2009-2019.csv
AK,-271.271,-290.057,-300.324,-277.603,-270.36,-293.152,-292.829,-270.413,-256.674,-241.546,-217.757,-158.379,-102.585,-24.9517,-1.7973,15.9597,-5.78231,-33.932,-44.7655,-92.5694,-123.338
AK,-290.057,-300.324,-277.603,-270.36,-293.152,-292.829,-270.413,-256.674,-241.546,-217.757,-158.379,-102.585,-24.9517,-1.7973,15.9597,-5.78231,-33.932,-44.7655,-92.5694,-123.338,-130.829
AK,-300.324,-277.603,-270.36,-293.152,-292.829,-270.413,-256.674,-241.546,-217.757,-158.379,-102.585,-24.9517,-1.7973,15.9597,-5.78231,-33.932,-44.7655,-92.5694,-123.338,-130.829,-123.979

$ tail -3 TIME_VARIANT_TAVG_US_2009-2019.csv
WY,-76.9167,-66.2315,-45.1944,-27.75,-55.3426,-81.5556,-124.769,-137.556,-90.213,-54.1389,-55.9907,-30.9167,-9.59813,7.86916,-1.09259,-13.9722,-47.5648,-83.5234,-98.2963,-124.694,-142.898
WY,-66.2315,-45.1944,-27.75,-55.3426,-81.5556,-124.769,-137.556,-90.213,-54.1389,-55.9907,-30.9167,-9.59813,7.86916,-1.09259,-13.9722,-47.5648,-83.5234,-98.2963,-124.694,-142.898,-131.028
WY,-45.1944,-27.75,-55.3426,-81.5556,-124.769,-137.556,-90.213,-54.1389,-55.9907,-30.9167,-9.59813,7.86916,-1.09259,-13.9722,-47.5648,-83.5234,-98.2963,-124.694,-142.898,-131.028,-79.8704

While this does work, it is not very performant. It can be optimized—can you give a try? Report in the comments below how you did it.

5. Create train, test, and validate data sets

Split the data into train, test, and validate sets.

data_sets.sh

#!/bin/sh

GEN=SEQ
# GEN=RAN

FILE=TIMESLICE_TAVG_US_2010-2019.csv

TRAIN_SET_PERCENT=70
TEST_SET_PERCENT=20
VAL_SET_PERCENT=$(( 100 - $TRAIN_SET_PERCENT - $TEST_SET_PERCENT ))

TRAIN_DATA=TRAIN_$FILE
TEST_DATA=TEST_$FILE
VAL_DATA=VAL_$FILE

> $TRAIN_DATA
> $TEST_DATA
> $VAL_DATA

for state in `cut -d',' -f1 $FILE | sort | uniq`
do
    NUM_STATE_DATA=`grep "$state" $FILE | wc -l`
    echo "$state: $NUM_STATE_DATA"

    TRAIN_NUM_DATA=$(( $NUM_STATE_DATA * $TRAIN_SET_PERCENT / 100 ))
    TEST_NUM_DATA=$(( $NUM_STATE_DATA * $TEST_SET_PERCENT / 100 ))
    VAL_NUM_DATA=$(( $NUM_STATE_DATA - $TRAIN_NUM_DATA - $TEST_NUM_DATA ))

    if [ $GEN == "SEQ" ]
    then
        echo "Sequential"
        STATE_DATA=`grep $state $FILE`
    elif [ $GEN == "RAN" ]
    then
        echo "Randomized"
        STATE_DATA=`grep $state $FILE | shuf`
    else
        echo "Unknown data gen type: " $GEN
        exit 1
    fi

    # Train set
    per=$TRAIN_SET_PERCENT
    num=$TRAIN_NUM_DATA; from=1; to=$(($from + $num - 1));
    echo Train set: $per% $num from=$from to=$to
    echo "$STATE_DATA" | head -$to >> $TRAIN_DATA

    # Test set
    per=$TEST_SET_PERCENT
    num=$TEST_NUM_DATA; from=$(($to + 1)); to=$(($from + $num - 1));
    echo Test set: $per% $num from=$from to=$to
    echo "$STATE_DATA" | head -$to | tail -$num >> $TEST_DATA

    # Validate set
    per=$VAL_SET_PERCENT
    num=$VAL_NUM_DATA; from=$(($to + 1)); to=$NUM_STATE_DATA;
    echo Validate set: $per% $num from=$from to=$to
    echo "$STATE_DATA" | tail -$num >> $VAL_DATA

    echo
done

This generates data sets that can be sequential or randomized by setting GEN=SEQ or GEN=RAN in the script, and you can shuffle data with shuf.

Run the script:

$ ./data_sets.sh
AK: 3652
Sequential
Train set: 70% 2556 from=1 to=2556
Test set: 20% 730 from=2557 to=3286
Validate set: 10% 366 from=3287 to=3652

...

WY: 3352
Sequential
Train set: 70% 2346 from=1 to=2346
Test set: 20% 670 from=2347 to=3016
Validate set: 10% 336 from=3017 to=3352

To create these data files:

$ ls *_TIMESLICE_TAVG_US_2010-2019.csv
TEST_TIMESLICE_TAVG_US_2010-2019.csv   VAL_TIMESLICE_TAVG_US_2010-2019.csv
TRAIN_TIMESLICE_TAVG_US_2010-2019.csv

Data processing with shell

When you need to process a humongous amount of data for your next machine learning project, think about shell commands and scripts. It is proven and ready to use with friendly communities to guide and help you.

This article introduces a shell for data processing, and the scripts demonstrate opportunties. Much more is possible. Do you want to take it forward? Tell us in the comments below.