10 command-line tools for data analysis in Linux

Why load everything into a spreadsheet when the terminal can be faster, more powerful, and more easily scriptable?
739 readers like this.
Open data brain

Opensource.com

So you've landed on some data you want to analyze. Where do you begin?

Many people used to working in a graphical environment might default to using a spreadsheet tool, but there's another way that might prove to be faster and more efficient, with just a little more effort. And you don't need to become an expert in a statistical modeling language or a big data toolset to take advantage of these tools.

You can learn a lot about a dataset without ever leaving your terminal.

I'm talking about the Linux command line. Just using some tools that you've probably already got installed on your computer, you can learn a lot about a dataset without ever leaving your terminal. Long-time Linux users will of course laugh—they've been using many of these tools for years to parse logs and understand configuration tools. But for the Linux newcomer, the revelation that you've got a whole data analysis toolkit already at your fingertips can be a welcomed surprise.

Most of these tools aren't strictly speaking limited to Linux, either. Most hearken back to the days of Unix, and users of other Unix-like operating systems likely have them installed already or can do so with ease. Many are a part of the GNU Coreutils package, while a few are individually maintained, and with some work, you can even use them on Windows.

 

 

So let's try out a few of the many simple open source tools for data analysis and see how they work! If you'd like to follow along with these examples, go ahead and download this sample data file, from GitHub, which is a CSV (comma separated value) list of articles we published to Opensource.com in January.


head and tail

First, let's get started by getting a handle on the file. What's in it? What does its format look like? You can use the cat command to display a file in the terminal, but that's not going to do us much good if you're working with files more than a few dozen lines.

Enter head and tail. Both are utilities for showing a specified number of lines from the top or bottom of the file. If you don't specify the number of lines you want to see, you'll get 10. Let's try it with our file.

$ tail -n 3 jan2017articles.csv 
02 Jan 2017,Article,Scott Nesbitt,3 tips for effectively using wikis for documentation,1,/article/17/1/tips-using-wiki-documentation,"Documentation, Wiki",710
02 Jan 2017,Article,Jen Wike Huger,The Opensource.com preview for January,0,/article/17/1/editorial-preview-january,,358
02 Jan 2017,Poll,Jason Baker,What is your open source New Year's resolution?,1,/poll/17/1/what-your-open-source-new-years-resolution,,186

Looking at those last three lines, I can pick out a date, author name, title, and a few other pieces of information immediately. But I don't know what every column is. Let's look at the top of the file and see if it has headers to explain what each column means:

$ head -n 1 jan2017articles.csv 
Post date,Content type,Author,Title,Comment count,Path,Tags,Word count

Okay, that all makes sense now. Looks like we've got a list of articles with the date they were published, the type of content for each one, the author, title, number of comments, relative URL, the tags each article has, and the word count.

wc

That's great, but how big is this file? Are we talking about dozens of articles we want to analyze, or hundreds, or even thousands? The wc command can help with that. Short for "word count," wc can count the number of bytes, characters, words, or lines in the file. In our case, we want to know the number of lines.

$ wc -l jan2017articles.csv 
93 jan2017articles.csv

And, there it is. 93 lines in this file; since we know the first row contained headers, we can surmise that this is a list of 92 articles.

grep

Okay, now let's ask ourselves: Out of these 92 articles, how many of them were about a security topic? For our purposes, let's say we're interested in articles that mention security anywhere in the entry, whether in the title, the list of tags, or somewhere else. The grep tool can help us with that. With grep, you can search a file or other input for a particular pattern of characters. grep is an incredibly powerful tool, thanks to the regular expressions you can build to match very precise patterns. But for now, let's just search for a simple string.

$ grep -i "security" jan2017articles.csv
30 Jan 2017,Article,Tiberius Hefflin,4 ways to improve your security online right now,3,/article/17/1/4-ways-improve-your-online-security,Security and encryption,1242
28 Jan 2017,Article,Subhashish Panigrahi,How communities in India support privacy and software freedom,0,/article/17/1/how-communities-india-support-privacy-software-freedom,Security and encryption,453
27 Jan 2017,Article,Alan Smithee,Data Privacy Day 2017: Solutions for everyday privacy,5,/article/17/1/every-day-privacy,"Big data, Security and encryption",1424
04 Jan 2017,Article,Daniel J Walsh,50 ways to avoid getting hacked in 2017,14,/article/17/1/yearbook-50-ways-avoid-getting-hacked,"Yearbook, 2016 Open Source Yearbook, Security and encryption, Containers, Docker, Linux",2143

The format we used was grep, followed by the -i flag (which tells grep not to be case sensitive), followed by the pattern we wanted to search for, and then the file in which we were searching. It looks like we had four security-related articles last month. But let's imagine we got a much longer list than we could easily count. Using a pipe, we could combine grep with the wc command we just learned about above, to get a count of the total lines mentioning security.

$ grep -i "security" jan2017articles.csv | wc -l
4

In this case, wc took the output of our grep command, and used it as its input, without ever having to worry about saving it anywhere first. This is why piping input and output, in particular when combined with a little shell scripting, makes the terminal such a powerful tool for data analysis.

tr

A CSV file is a pretty helpful format for many analysis scenarios, but what if you need to convert the file to a different format for use in a different application? Maybe you need tab separators instead of commas, or maybe you want to change them to some HTML so that you can use the data output in a table. The tr command can help you with that, by translating from one type of character to another. Like the other examples, you can also pipe input and output to this command.

Let's try another multi-part example, by creating a TSV (tab separated values) file with just the articles that published on January 20.

$ grep "20 Jan 2017" jan2017articles.csv | tr ',' '\t' > jan20only.tsv

What's going on here? First, we searched for the date in question, using grep. We piped this output to the tr command, which we used to replace the commas with tabs (denoted with '\t'). But where did it go? Well, the > character redirected the output to our new file instead of the screen. All of this work in one command sequence. We can then verify that the jan20only.tsv file contains the data that we expected.

$ cat jan20only.tsv 
20 Jan 2017     Article Kushal Das      5 ways to expand your project's contributor base        2       /article/17/1/expand-project-contributor-base    Getting started 690
20 Jan 2017     Article D Ruth Bavousett        How to write web apps in R with Shiny   2       /article/17/1/writing-new-web-apps-shiny Web development 218
20 Jan 2017     Article Jason Baker     "Top 5: Shell scripting  the Cinnamon Linux desktop environment  and more"       0       /article/17/1/top-5-january-20  Top 5   214
20 Jan 2017     Article Tracy Miranda   How is your community promoting diversity?      1       /article/17/1/take-action-diversity-tech Diversity and inclusion 1007

sort

What if we wanted to learn more details about a particular column? Which article in our new list of articles is the longest? Let's build on our last example. Now that we have a list of articles from just January 20, we can use the sort command to sort by the word count column. Of course, we don't strictly speaking need an intermediate file here; we could have piped the output of the last command instead. But sometimes it's simply easier to break long steps into smaller ones rather than creating gigantic chains of commands.

$ sort -nr -t$'\t' -k8 jan20only.tsv | head -n 1
20 Jan 2017     Article Tracy Miranda   How is your community promoting diversity?      1       /article/17/1/take-action-diversity-tech Diversity and inclusion 1007

This is another long example, so let's break down what's happening. FIrst, we're using the sort command to sort by the number of words. The -nr option tells sort to do a numeric sort, and to return the results in reverse order (largest to smallest). The next -t$'\t' tells sort that the delimiter is the tab ('\t'). (You can read why you need the dollar sign here; in short, it's to tell the shell that this is a string that needs processing to turn the \n into an actual tab). The -k8 portion of the command tells sort to use the eighth column, which is the column for word count in our example.

Finally, the whole output is piped to head with instructions just to show the top line, which is our result, the article from this file with the highest word count.

sed

You might want to select specific lines of a file. sed, short for stream editor, is one way to do this. What if you wanted to combine multiple files that all had headers? You would only want one set of headers to appear for the whole file, so you would need a way to scrape out the extras. Or what if you wanted to grab only a particular range of lines? sed is your tool. It's also a great way to do a bulk find and replace a file.

Let's create a new file with no headers from our list of articles, suitable for combining with other files (if, for example, I had a different file for every month and wanted to put them together).

$ sed '1 d' jan2017articles.csv > jan17no_headers.csv

The '1 d' option tells sed to delete the first line. sed is far more powerful than this, and I'd recommend reading up further on its replacement powers.

cut

What if, instead of wanting to remove a row, I wanted to remove a column? What if I wanted to pick out just one column? Let's create a new list of authors for our example we built above.

$ cut -d',' -f3 jan17no_headers.csv > authors.txt

In this simple example, we told cut with -d',' that this is a comma-delimited file, that we wanted the third column (-f3), and to send the output to a new file called authors.txt.

uniq

That last example left us with a list of authors, but, how many unique authors are on that list? How many articles did each author write? Enter uniq. With uniq, you can easily find out. Let's sort the file, find uniques, then outputs a file that has a count of the number of articles written by each author.

sort authors.txt | uniq -c > authors-sorted.txt

Glimpsing at the file, we can now see how many articles each author had. Let's just look at the last three lines to be sure it worked.

$ tail -n3 authors-sorted.txt
      1 Tracy Miranda
      1 Veer Muchandi
      3 VM (Vicky) Brasseur

awk

Let's look at one more tool in our command-line data analysis toolbelt today, awk. awk is another one of those tools that I'm going to give far too little credit to; it's really a powerhouse worth exploring on its own. It is another great tool for replacement, but also much more. Let's go back to the TSV file we made earlier of just the January 20 articles, and use that to create a new list of just the authors of those articles, along with the number of words each author wrote.

$ awk -F "\t" '{print $3 "  " $NF}' jan20only.tsv
Kushal Das  690
D Ruth Bavousett  218
Jason Baker  214
Tracy Miranda  1007

What's going on here? The -F "\t" we pass to awk simply tells it that we're working with tab-separated data. Within the braces, we're actually telling awk to execute just a little bit of code. We're telling it to print the third column with $3, and then the last column with $NF (the "number of fields"), and place two spaces between them to make it a little more legible.


So what? Can't we do all of this faster in a spreadsheet, or just by looking at the file in some cases? Sure we can! Now stop and imagine that instead of a 93 line file, we were working with a 93,000 or even one much larger. Can your spreadsheet utility load it without crashing or slowing down significantly? Or imagine instead of one file with one month's worth of articles, you had a different file for every month of the past seven years. Suddenly, a spreadsheet isn't the best option work processing your data, but you're not nearly in the territory yet where you need a true big data tool to work with your dataset.

You could choose to load the files into a database tool and work with the data there. But is that the right choice? It might be overkill. What if you're just examining the data to get a sense of what it contains? With these simple tools and a little scripting to recurse through a directory, you can work with large amounts of data with ease. Professionals and amateurs alike who work with data on a regular basis would do well to spend some time learning these and other command line data analysis tools.

This introduction only scratches the surface of each of these tools. They are far, far more powerful than these simple examples would let on, which is why volumes of books have been written about most. I hope you'll take the time to read a man page, do a little work in the search engine, or pick up a book and learn more about this interesting suite of tools that you have ready at your fingertips.

Jason Baker
Former Red Hatter. Now a consultant and aspiring entrepreneur. Map nerd, maker, and enthusiastic installer of open source desktop and self-hosted software.

11 Comments

I never get around to it, but I keep suggesting to myself that I need to make my own references for these commands. I will find a use for one of them, spend some time sorting out the syntax, translating the man pages to something that is human-readable, and do something pretty cool. Unfortunately, if I haven't used them in a while, I forget the details, and have to relearn.

Remembering the exact argument format and sequence is definitely a challenge with these tools; I do keep my own notes for exactly that purpose. For most of the tools (excepting sed and awk, for sure), rather than wading through the man pages, running them with --help will give you enough of an idea to eek out what you need.

In reply to by Greg P

The only one I really don't use is tr. I find it easier to convert my files to XML using sed or awk script which can be easily converted Spreadsheet XML 2003 using an XSLT with custom formatting.

Thanks for this.
One comment ... the line "sort authors.txt | uniq -c > authors.txt" I think empties out the authors.txt

Good catch; I changed the output file to have a different name. Thanks!

In reply to by Rob Kellington (not verified)

I use all of these tools. You can do crazy things with them. But in most cases it makes only sense for batch work. You forgot PERL which includes all your mentioned tools possibilities. There is even a PERL-modul to create Excel sheets. ;-)

A common pipe line we use is:
grep /tmp/file | cut -d: -f1,2 | sort | uniq -c

e.g count the hits by an IP per minute:
grep 24.333.222.111 /var/log/httpd/access_log | cut -d: -f2,3 | sort | uniq -c
70 23:45
74 23:46
76 23:47
36 23:48

perl gives you all the power of sed and awk and grep combined and extended. It works very well in a pipeline.

Jason, when you say "with just a little more effort", I think you must clarify that this means learning the tools because when you become familiar with these, that translates into a lot less effort. Another important note is that you could develop your own custom tools, at this point you will be a ninja. Thanks!

Great post :-) Another command line tool that's really handy is "Petit". It can hash a file which is very useful when looking through log files where a datestamp may change on a line and you want to group the events together. For example, if you're looking for all behavior on a url or an IP address, you can summarize the logs and get a clearer picture. I have found it useful when I see a script hit a page multiple times and want a list of IP's for that one URL.

Handy little tool :-)

Two command-line I use quite often for data analysis are jq and q:
- jq is a "Command-line JSON processor". Sometimes I get lengthy and ugly JSON responses from my curl commands. I'll pipe it into jq and strip out all the JSON parts I don't want, and it's pretty-printed too.
- q allows me to run SQL-like queries on CSV files. I can even do joins! Sure, it's only useful for people that know SQL queries, which isn't as many as I'd hope, but since I know how to do them, it comes in handy.

Great additions! I wanted to focus in this article on tools that are installed by default on most Linux systems, but I think there are many potential follow-ups on additional tools. Maybe you'd like to write about these tools?

In reply to by sriley

Creative Commons LicenseThis work is licensed under a Creative Commons Attribution-Share Alike 4.0 International License.