Machine learning is a powerful computing capability for predicting or forecasting things that conventional algorithms find challenging. The machine learning journey begins with collecting and preparing data—a lot of it—then it builds mathematical models based on that data. While multiple tools can be used for these tasks, I like to use the shell.
A shell is an interface for performing operations using a defined language. This language can be invoked interactively or scripted. The concept of the shell was introduced in Unix operating systems in the 1970s. Some of the most popular shells include Bash, tcsh, and Zsh. They are available for all operating systems, including Linux, macOS, and Windows, which gives them high portability. For this exercise, I'll use Bash.
This article is an introduction to using a shell for data collection and data preparation. Whether you are a data scientist looking for efficient tools or a shell expert looking at using your skills for machine learning, I hope you will find valuable information here.
The example problem in this article is creating a machine learning model to forecast temperatures for US states. It uses shell commands and scripts to do the following data collection and data preparation steps:
- Download data
- Extract the necessary fields
- Aggregate data
- Make time series
- Create the train, test, and validate data sets
You may be asking why you should do this with shell, when you can do all of it in a machine learning programming language such as Python. This is a good question. If data processing is performed with an easy, friendly, rich technology like a shell, a data scientist focuses only on machine learning modeling and not the details of a language.
Prerequisites
First, you need to have a shell interpreter installed. If you use Linux or macOS, it will already be installed, and you may already be familiar with it. If you use Windows, try MinGW or Cygwin.
For more information, see:
- Bash tutorials here on opensource.com
- The official shell scripting tutorial by Steve Parker, the creator of Bourne shell
- The Bash Guide for Beginners by the Linux Documentation Project
- If you need help with a specific command, type
<commandname> --help
in the shell for help; for example:ls --help
.
Get started
Now that your shell is set up, you can start preparing data for the machine learning temperature-prediction problem.
1. Download data
The data for this tutorial comes from the US National Oceanic and Atmospheric Administration (NOAA). You will train your model using the last 10 complete years of data. The data source is at https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/by_year/, and the data is in .csv format and gzipped.
Download and unzip the data using a shell script. Use your favorite text editor to create a file named download.sh
and paste in the code below. The comments in the code explain what the commands do:
#!/bin/sh
# This is called hashbang. It identifies the executor used to run this file.
# In this case, the script is executed by shell itself.
# If not specified, a program to execute the script must be specified.
# With hashbang: ./download.sh; Without hashbang: sh ./download.sh;
FROM_YEAR=2010
TO_YEAR=2019
year=$FROM_YEAR
# For all years one by one starting from FROM_YEAR=2010 upto TO_YEAR=2019
while [ $year -le $TO_YEAR ]
do
# show the year being downloaded now
echo $year
# Download
wget https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/by_year/${year}.csv.gz
# Unzip
gzip -d ${year}.csv.gz
# Move to next year by incrementing
year=$(($year+1))
done
Notes:
- If you are behind a proxy server, consult Mark Grennan's how-to, and use:
export http_proxy=http://username:password@proxyhost:port/ export https_proxy=https://username:password@proxyhost:port/
- Make sure all standard commands are already in your PATH (such as
/bin
or/usr/bin
). If not, set your PATH. - Wget is a utility for connecting to web servers from the command line. If Wget is not installed on your system, download it.
- Make sure you have gzip, a utility used for compression and decompression.
Run this script to download, extract, and make 10 years' worth of data available as CSVs:
$ ./download.sh
2010
--2020-10-30 19:10:47-- https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/by_year/2010.csv.gz
Resolving www1.ncdc.noaa.gov (www1.ncdc.noaa.gov)... 205.167.25.171, 205.167.25.172, 205.167.25.178, ...
Connecting to www1.ncdc.noaa.gov (www1.ncdc.noaa.gov)|205.167.25.171|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 170466817 (163M) [application/gzip]
Saving to: '2010.csv.gz'
0K .......... .......... .......... .......... .......... 0% 69.4K 39m57s
50K .......... .......... .......... .......... .......... 0% 202K 26m49s
100K .......... .......... .......... .......... .......... 0% 1.08M 18m42s
...
The ls command lists the contents of a folder. Use ls 20*.csv
to list all your files with names beginning with 20 and ending with .csv.
$ ls 20*.csv
2010.csv 2011.csv 2012.csv 2013.csv 2014.csv 2015.csv 2016.csv 2017.csv 2018.csv 2019.csv
2. Extract average temperatures
Extract the TAVG (average temperature) data from the CSVs for US regions:
extract_tavg_us.sh
#!/bin/sh
# For each file with name that starts with "20" and ens with ".csv"
for csv_file in `ls 20*.csv`
do
# Message that says file name $csv_file is extracted to file TAVG_US_$csv_file
# Example: 2010.csv extracted to TAVG_US_2010.csv
echo "$csv_file -> TAVG_US_$csv_file"
# grep "TAVG" $csv_file: Extract lines in file with text "TAVG"
# |: pipe
# grep "^US": From those extract lines that begin with text "US"
# > TAVG_US_$csv_file: Save xtracted lines to file TAVG_US_$csv_file
grep "TAVG" $csv_file | grep "^US" > TAVG_US_$csv_file
done
This script:
$ ./extract_tavg_us.sh
2010.csv -> TAVG_US_2010.csv
...
2019.csv -> TAVG_US_2019.csv
creates these files:
$ ls TAVG_US*.csv
TAVG_US_2010.csv TAVG_US_2011.csv TAVG_US_2012.csv TAVG_US_2013.csv
TAVG_US_2014.csv TAVG_US_2015.csv TAVG_US_2016.csv TAVG_US_2017.csv
TAVG_US_2018.csv TAVG_US_2019.csv
Here are the first few lines for TAVG_US_2010.csv
:
$ head TAVG_US_2010.csv
USR0000AALC,20100101,TAVG,-220,,,U,
USR0000AALP,20100101,TAVG,-9,,,U,
USR0000ABAN,20100101,TAVG,12,,,U,
USR0000ABCA,20100101,TAVG,16,,,U,
USR0000ABCK,20100101,TAVG,-309,,,U,
USR0000ABER,20100101,TAVG,-81,,,U,
USR0000ABEV,20100101,TAVG,-360,,,U,
USR0000ABEN,20100101,TAVG,-224,,,U,
USR0000ABNS,20100101,TAVG,89,,,U,
USR0000ABLA,20100101,TAVG,59,,,U,
The head command is a utility for displaying the first several lines (by default, 10 lines) of a file.
The data has more information than you need. Limit the number of columns by eliminating column 3 (since all the data is average temperature) and column 5 onward. In other words, keep columns 1 (climate station), 2 (date), and 4 (temperature recorded).
key_columns.sh
#!/bin/sh
# For each file with name that starts with "TAVG_US_" and ens with ".csv"
for csv_file in `ls TAVG_US_*.csv`
do
echo "Exractiing columns $csv_file"
# cat $csv_file: 'cat' is to con'cat'enate files - here used to show one year csv file
# |: pipe
# cut -d',' -f1,2,4: Cut columns 1,2,4 with , delimitor
# > $csv_file.cut: Save to temporary file
| > $csv_file.cut:
cat $csv_file | cut -d',' -f1,2,4 > $csv_file.cut
# mv $csv_file.cut $csv_file: Rename temporary file to original file
mv $csv_file.cut $csv_file
# File is processed and saved back into the same
# There are other ways to do this
# Using intermediate file is the most reliable method.
done
Run the script:
$ ./key_columns.sh
Extracting columns TAVG_US_2010.csv
...
Extracting columns TAVG_US_2019.csv
The first few lines of TAVG_US_2010.csv
with the unneeded data removed are:
$ head TAVG_US_2010.csv
USR0000AALC,20100101,-220
USR0000AALP,20100101,-9
USR0000ABAN,20100101,12
USR0000ABCA,20100101,16
USR0000ABCK,20100101,-309
USR0000ABER,20100101,-81
USR0000ABEV,20100101,-360
USR0000ABEN,20100101,-224
USR0000ABNS,20100101,89
USR0000ABLA,20100101,59
Dates are in string form (YMD). To train your model correctly, your algorithms need to recognize date fields in the comma-separated Y,M,D form (For example, 20100101
becomes 2010,01,01
). You can convert them with the sed utility.
date_format.sh
for csv_file in `ls TAVG_*.csv`
do
echo Date formatting $csv_file
# This inserts , after year
sed -i 's/,..../&,/' $csv_file
# This inserts , after month
sed -i 's/,....,../&,/' $csv_file
done
Run the script:
$ ./date_format.sh
Date formatting TAVG_US_2010.csv
...
Date formatting TAVG_US_2019.csv
The first few lines of TAVG_US_2010.csv
with the comma-separated date format are:
$ head TAVG_US_2010.csv
USR0000AALC,2010,01,01,-220
USR0000AALP,2010,01,01,-9
USR0000ABAN,2010,01,01,12
USR0000ABCA,2010,01,01,16
USR0000ABCK,2010,01,01,-309
USR0000ABER,2010,01,01,-81
USR0000ABEV,2010,01,01,-360
USR0000ABEN,2010,01,01,-224
USR0000ABNS,2010,01,01,89
USR0000ABLA,2010,01,01,59
3. Aggregate states' average temperature data
The weather data comes from climate stations located in US cities, but you want to forecast whole states' temperatures. To convert the climate-station data to state data, first, map climate stations to their states.
Download the list of climate stations using wget:
$ wget ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/ghcnd-stations.txt
Extract the US stations with the grep utility to find US listings. The following command searches for lines that begin with the text "US
." The >
is a redirection that writes output to a file—in this case, to a file named us_stations.txt
:
$ grep "^US" ghcnd-stations.txt > us_stations.txt
This file was created for pretty print, so the column separators are inconsistent:
$ head us_stations.txt
US009052008 43.7333 -96.6333 482.0 SD SIOUX FALLS (ENVIRON. CANADA)
US10RMHS145 40.5268 -105.1113 1569.1 CO RMHS 1.6 SSW
US10adam001 40.5680 -98.5069 598.0 NE JUNIATA 1.5 S
...
Make them consistent by using cat to print the file, using tr to squeeze repeats and output to a temp file, and renaming the temp file back to the original—all in one line:
$ cat us_stations.txt | tr -s ' ' > us_stations.txt.tmp; cp us_stations.txt.tmp us_stations.txt;
The first lines of the command's output:
$ head us_stations.txt
US009052008 43.7333 -96.6333 482.0 SD SIOUX FALLS (ENVIRON. CANADA)
US10RMHS145 40.5268 -105.1113 1569.1 CO RMHS 1.6 SSW
US10adam001 40.5680 -98.5069 598.0 NE JUNIATA 1.5 S
...
This contains a lot of info—GPS coordinates and such—but you only need the station code and state. Use cut:
$ cut -d' ' -f1,5 us_stations.txt > us_stations.txt.tmp; mv us_stations.txt.tmp us_stations.txt;
The first lines of the command's output:
$ head us_stations.txt
US009052008 SD
US10RMHS145 CO
US10adam001 NE
US10adam002 NE
...
Make this a CSV and change the spaces to comma separators using sed:
$ sed -i s/' '/,/g us_stations.txt
The first lines of the command's output:
$ head us_stations.txt
US009052008,SD
US10RMHS145,CO
US10adam001,NE
US10adam002,NE
...
Although you used several commands for these tasks, it is possible to perform all the steps in one run. Try it yourself.
Now, replace the station codes with their state locations by using AWK, which is functionally high performant for large data processing.
station_to_state_data.sh
#!/bin/sh
for DATA_FILE in `ls TAVG_US_*.csv`
do
echo ${DATA_FILE}
awk -v FILE=$DATA_FILE -F, '
{
state_day_sum[$1 "," $2 "," $3 "," $4] = state_day_sum[$1 "," $2 "," $3 "," $4] + $5
state_day_num[$1 "," $2 "," $3 "," $4] = state_day_num[$1 "," $2 "," $3 "," $4] + 1
}
END {
for (state_day_key in state_day_sum) {
print state_day_key "," state_day_sum[state_day_key]/state_day_num[state_day_key]
}
}
' OFS=, $DATA_FILE > STATE_DAY_${DATA_FILE}.tmp
sort -k1 -k2 -k3 -k4 < STATE_DAY_${DATA_FILE}.tmp > STATE_DAY_$DATA_FILE
rm -f STATE_DAY_${DATA_FILE}.tmp
done
Here is what these parameters mean:
-F, |
Field separator is , |
FNR |
Line number in each file |
NR |
Line number in both files together |
FNR==NR |
Is TRUE only in the first file ${PATTERN_FILE} |
{ x[$1]=$2; next; } |
If FNR==NR is TRUE (for all lines in $PATTERN_FILE only) |
- x |
Variable to store station=state map |
- x[$1]=$2 |
Adds data of station=state to map |
- $1 |
First column in first file (station codes) |
- $2 |
Second column in first file (state codes) |
- x |
Map of all stations e.g., x[US009052008]=SD , x[US10RMHS145]=CO , ..., x[USW00096409]=AK |
- next |
Go to next line matching FNR==NR (essentially, this creates a map of all stations-states from the ${PATTERN_FILE} |
{ $1=x[$1]; print $0 } |
If FNR==NR is FALSE (for all lines in $DATA_FILE only) |
- $1=x[$1] |
Replace first field with x[$1] ; essentially, replace station code with state code |
- print $0 |
Print all columns (including replaced $1 ) |
OFS=, |
Output fields separator is , |
The CSV with station codes:
$ head TAVG_US_2010.csv
USR0000AALC,2010,01,01,-220
USR0000AALP,2010,01,01,-9
USR0000ABAN,2010,01,01,12
USR0000ABCA,2010,01,01,16
USR0000ABCK,2010,01,01,-309
USR0000ABER,2010,01,01,-81
USR0000ABEV,2010,01,01,-360
USR0000ABEN,2010,01,01,-224
USR0000ABNS,2010,01,01,89
USR0000ABLA,2010,01,01,59
Run the command:
$ ./station_to_state_data.sh
TAVG_US_2010.csv
...
TAVG_US_2019.csv
Stations are now mapped to states:
$ head TAVG_US_2010.csv
AK,2010,01,01,-220
AZ,2010,01,01,-9
AL,2010,01,01,12
AK,2010,01,01,16
AK,2010,01,01,-309
AK,2010,01,01,-81
AK,2010,01,01,-360
AK,2010,01,01,-224
AZ,2010,01,01,59
AK,2010,01,01,-68
Every state has several temperature readings for each day, so you need to calculate the average of each state's readings for a day. Use AWK for text processing, sort to ensure the final results are in a logical order, and rm to delete the temporary file after processing.
station_to_state_data.sh
PATTERN_FILE=us_stations.txt
for DATA_FILE in `ls TAVG_US_*.csv`
do
echo ${DATA_FILE}
awk -F, \
'FNR==NR { x[$1]=$2; next; } { $1=x[$1]; print $0 }' \
OFS=, \
${PATTERN_FILE} ${DATA_FILE} > ${DATA_FILE}.tmp
mv ${DATA_FILE}.tmp ${DATA_FILE}
done
Here is what the AWK parameters mean:
FILE=$DATA_FILE |
CSV file processed as FILE |
-F, |
Field separator is , |
state_day_sum[$1 "," $2 "," $3 "," $4] = $5 |
Sum of temperature ($5 ) for the state ($1 ) on year ($2 ), month ($3 ), day ($4 ) |
state_day_num[$1 "," $2 "," $3 "," $4] = $5 |
Number of temperature readings for the state ($1 ) on year ($2 ), month ($3 ), day ($4 ) |
END |
In the end, after collecting sum and number of readings for all states, years, months, days, calculate averages |
for (state_day_key in state_day_sum) |
For each state-year-month-day |
print state_day_key "," state_day_sum[state_day_key]/state_day_num[state_day_key] |
Print state,year,month,day,average |
OFS=, |
Output fields separator is , |
$DATA_FILE |
Input file (all files with name starting with TAVG_US_ and ending with .csv , one by one) |
> STATE_DAY_${DATA_FILE}.tmp |
Save result to a temporary file |
Run the script:
$ ./TAVG_avg.sh
TAVG_US_2010.csv
TAVG_US_2011.csv
TAVG_US_2012.csv
TAVG_US_2013.csv
TAVG_US_2014.csv
TAVG_US_2015.csv
TAVG_US_2016.csv
TAVG_US_2017.csv
TAVG_US_2018.csv
TAVG_US_2019.csv
These files are created:
$ ls STATE_DAY_TAVG_US_20*.csv
STATE_DAY_TAVG_US_2010.csv STATE_DAY_TAVG_US_2015.csv
STATE_DAY_TAVG_US_2011.csv STATE_DAY_TAVG_US_2016.csv
STATE_DAY_TAVG_US_2012.csv STATE_DAY_TAVG_US_2017.csv
STATE_DAY_TAVG_US_2013.csv STATE_DAY_TAVG_US_2018.csv
STATE_DAY_TAVG_US_2014.csv STATE_DAY_TAVG_US_2019.csv
See one year of data for all states (less is a utility to see output a page at a time):
$ less STATE_DAY_TAVG_US_2010.csv
AK,2010,01,01,-181.934
...
AK,2010,01,31,-101.068
AK,2010,02,01,-107.11
...
AK,2010,02,28,-138.834
...
WY,2010,01,01,-43.5625
...
WY,2010,12,31,-215.583
Merge all the data files into one:
$ cat STATE_DAY_TAVG_US_20*.csv > TAVG_US_2010-2019.csv
You now have one file, with all states, for all years:
$ cat TAVG_US_2010-2019.csv
AK,2010,01,01,-181.934
...
WY,2018,12,31,-167.421
AK,2019,01,01,-32.3386
...
WY,2019,12,30,-131.028
WY,2019,12,31,-79.8704
4. Make time-series data
A problem like this is fittingly addressed with a time-series model such as long short-term memory (LSTM), which is a recurring neural network (RNN). This input data is organized into time slices; consider 20 days to be one slice.
This is a one-time slice (as in STATE_DAY_TAVG_US_2010.csv
):
X (input – 20 weeks):
AK,2010,01,01,-181.934
AK,2010,01,02,-199.531
...
AK,2010,01,20,-157.273
y (21st week, prediction for these 20 weeks):
AK,2010,01,21,-165.31
This time slice is represented as (temperature values where the first 20 weeks are X, and 21 is y):
AK, -181.934,-199.531, ... ,
-157.273,-165.3
The slices are time-contiguous. For example, the end of 2010 continues into 2011:
AK,2010,12,22,-209.92
...
AK,2010,12,31,-79.8523
AK,2011,01,01,-59.5658
...
AK,2011,01,10,-100.623
Which results in the prediction:
AK,2011,01,11,-106.851
This time slice is taken as:
AK, -209.92, ... ,-79.8523,-59.5658, ... ,-100.623,-106.851
and so on, for all states, years, months, and dates. For more explanation, see this tutorial on time-series forecasting.
Write a script to create time slices:
timeslices.sh
#!/bin/sh
TIME_SLICE_PERIOD=20
file=TAVG_US_2010-2019.csv
# For each state in file
for state in `cut -d',' -f1 $file | sort | uniq`
do
# Get all temperature values for the state
state_tavgs=`grep $state $file | cut -d',' -f5`
# How many time slices will this result in?
# mber of temperatures recorded minus size of one timeslice
num_slices=`echo $state_tavgs | wc -w`
num_slices=$((${num_slices} - ${TIME_SLICE_PERIOD}))
# Initialize
slice_start=1; num_slice=0;
# For each timeslice
while [ $num_slice -lt $num_slices ]
do
# One timeslice is from slice_start to slice_end
slice_end=$(($slice_start + $TIME_SLICE_PERIOD - 1))
# X (1-20)
sliceX="$slice_start-$slice_end"
# y (21)
slicey=$(($slice_end + 1))
# Print state and timeslice temperature values (column 1-20 and 21)
echo $state `echo $state_tavgs | cut -d' ' -f$sliceX,$slicey`
# Increment
slice_start=$(($slice_start + 1)); num_slice=$(($num_slice + 1));
done
done
Run the script. It uses spaces as column separators; make them commas with sed:
$ ./timeslices.sh > TIMESLICE_TAVG_US_2010-2019.csv; sed -i s/' '/,/g TIME_VARIANT_TAVG_US_2010-2019.csv
Here are the first few lines and the last few lines of the output .csv:
$ head -3 TIME_VARIANT_TAVG_US_2009-2019.csv
AK,-271.271,-290.057,-300.324,-277.603,-270.36,-293.152,-292.829,-270.413,-256.674,-241.546,-217.757,-158.379,-102.585,-24.9517,-1.7973,15.9597,-5.78231,-33.932,-44.7655,-92.5694,-123.338
AK,-290.057,-300.324,-277.603,-270.36,-293.152,-292.829,-270.413,-256.674,-241.546,-217.757,-158.379,-102.585,-24.9517,-1.7973,15.9597,-5.78231,-33.932,-44.7655,-92.5694,-123.338,-130.829
AK,-300.324,-277.603,-270.36,-293.152,-292.829,-270.413,-256.674,-241.546,-217.757,-158.379,-102.585,-24.9517,-1.7973,15.9597,-5.78231,-33.932,-44.7655,-92.5694,-123.338,-130.829,-123.979
$ tail -3 TIME_VARIANT_TAVG_US_2009-2019.csv
WY,-76.9167,-66.2315,-45.1944,-27.75,-55.3426,-81.5556,-124.769,-137.556,-90.213,-54.1389,-55.9907,-30.9167,-9.59813,7.86916,-1.09259,-13.9722,-47.5648,-83.5234,-98.2963,-124.694,-142.898
WY,-66.2315,-45.1944,-27.75,-55.3426,-81.5556,-124.769,-137.556,-90.213,-54.1389,-55.9907,-30.9167,-9.59813,7.86916,-1.09259,-13.9722,-47.5648,-83.5234,-98.2963,-124.694,-142.898,-131.028
WY,-45.1944,-27.75,-55.3426,-81.5556,-124.769,-137.556,-90.213,-54.1389,-55.9907,-30.9167,-9.59813,7.86916,-1.09259,-13.9722,-47.5648,-83.5234,-98.2963,-124.694,-142.898,-131.028,-79.8704
While this does work, it is not very performant. It can be optimized—can you give a try? Report in the comments below how you did it.
5. Create train, test, and validate data sets
Split the data into train, test, and validate sets.
data_sets.sh
#!/bin/sh
GEN=SEQ
# GEN=RAN
FILE=TIMESLICE_TAVG_US_2010-2019.csv
TRAIN_SET_PERCENT=70
TEST_SET_PERCENT=20
VAL_SET_PERCENT=$(( 100 - $TRAIN_SET_PERCENT - $TEST_SET_PERCENT ))
TRAIN_DATA=TRAIN_$FILE
TEST_DATA=TEST_$FILE
VAL_DATA=VAL_$FILE
> $TRAIN_DATA
> $TEST_DATA
> $VAL_DATA
for state in `cut -d',' -f1 $FILE | sort | uniq`
do
NUM_STATE_DATA=`grep "$state" $FILE | wc -l`
echo "$state: $NUM_STATE_DATA"
TRAIN_NUM_DATA=$(( $NUM_STATE_DATA * $TRAIN_SET_PERCENT / 100 ))
TEST_NUM_DATA=$(( $NUM_STATE_DATA * $TEST_SET_PERCENT / 100 ))
VAL_NUM_DATA=$(( $NUM_STATE_DATA - $TRAIN_NUM_DATA - $TEST_NUM_DATA ))
if [ $GEN == "SEQ" ]
then
echo "Sequential"
STATE_DATA=`grep $state $FILE`
elif [ $GEN == "RAN" ]
then
echo "Randomized"
STATE_DATA=`grep $state $FILE | shuf`
else
echo "Unknown data gen type: " $GEN
exit 1
fi
# Train set
per=$TRAIN_SET_PERCENT
num=$TRAIN_NUM_DATA; from=1; to=$(($from + $num - 1));
echo Train set: $per% $num from=$from to=$to
echo "$STATE_DATA" | head -$to >> $TRAIN_DATA
# Test set
per=$TEST_SET_PERCENT
num=$TEST_NUM_DATA; from=$(($to + 1)); to=$(($from + $num - 1));
echo Test set: $per% $num from=$from to=$to
echo "$STATE_DATA" | head -$to | tail -$num >> $TEST_DATA
# Validate set
per=$VAL_SET_PERCENT
num=$VAL_NUM_DATA; from=$(($to + 1)); to=$NUM_STATE_DATA;
echo Validate set: $per% $num from=$from to=$to
echo "$STATE_DATA" | tail -$num >> $VAL_DATA
echo
done
This generates data sets that can be sequential or randomized by setting GEN=SEQ
or GEN=RAN
in the script, and you can shuffle data with shuf.
Run the script:
$ ./data_sets.sh
AK: 3652
Sequential
Train set: 70% 2556 from=1 to=2556
Test set: 20% 730 from=2557 to=3286
Validate set: 10% 366 from=3287 to=3652
...
WY: 3352
Sequential
Train set: 70% 2346 from=1 to=2346
Test set: 20% 670 from=2347 to=3016
Validate set: 10% 336 from=3017 to=3352
To create these data files:
$ ls *_TIMESLICE_TAVG_US_2010-2019.csv
TEST_TIMESLICE_TAVG_US_2010-2019.csv VAL_TIMESLICE_TAVG_US_2010-2019.csv
TRAIN_TIMESLICE_TAVG_US_2010-2019.csv
Data processing with shell
When you need to process a humongous amount of data for your next machine learning project, think about shell commands and scripts. It is proven and ready to use with friendly communities to guide and help you.
This article introduces a shell for data processing, and the scripts demonstrate opportunties. Much more is possible. Do you want to take it forward? Tell us in the comments below.
3 Comments