Getting started with awk, a powerful text-parsing tool

Let's jump in and start using it.
125 readers like this.
Woman programming

WOCinTech Chat. Modified by Opensource.com. CC BY-SA 4.0

Awk is a powerful text-parsing tool for Unix and Unix-like systems, but because it has programmed functions that you can use to perform common parsing tasks, it's also considered a programming language. You probably won't be developing your next GUI application with awk, and it likely won't take the place of your default scripting language, but it's a powerful utility for specific tasks.

What those tasks may be is surprisingly diverse. The best way to discover which of your problems might be best solved by awk is to learn awk; you'll be surprised at how awk can help you get more done but with a lot less effort.

Awk's basic syntax is:

awk [options] 'pattern {action}' file

To get started, create this sample file and save it as colours.txt

name       color  amount
apple      red    4
banana     yellow 6
strawberry red    3
grape      purple 10
apple      green  8
plum       purple 2
kiwi       brown  4
potato     brown  9
pineapple  yellow 5

This data is separated into columns by one or more spaces. It's common for data that you are analyzing to be organized in some way. It may not always be columns separated by whitespace, or even a comma or semicolon, but especially in log files or data dumps, there's generally a predictable pattern. You can use patterns of data to help awk extract and process the data that you want to focus on.

Printing a column

In awk, the print function displays whatever you specify. There are many predefined variables you can use, but some of the most common are integers designating columns in a text file. Try it out:

$ awk '{print $2;}' colours.txt
color
red
yellow
red
purple
green
purple
brown
brown
yellow

In this case, awk displays the second column, denoted by $2. This is relatively intuitive, so you can probably guess that print $1 displays the first column, and print $3 displays the third, and so on.

To display all columns, use $0.

The number after the dollar sign ($) is an expression, so $2 and $(1+1) mean the same thing.

Conditionally selecting columns

The example file you're using is very structured. It has a row that serves as a header, and the columns relate directly to one another. By defining conditional requirements, you can qualify what you want awk to return when looking at this data. For instance, to view items in column 2 that match "yellow" and print the contents of column 1:

awk '$2=="yellow"{print $1}' colours.txt
banana
pineapple

Regular expressions work as well. This conditional looks at $2 for approximate matches to the letter p followed by any number of (one or more) characters, which are in turn followed by the letter p:

$ awk '$2 ~ /p.+p/ {print $0}' colours.txt
grape   purple  10
plum    purple  2

Numbers are interpreted naturally by awk. For instance, to print any row with a third column containing an integer greater than 5:

awk '$3>5 {print $1, $2}' colours.txt
name    color
banana  yellow
grape   purple
apple   green
potato  brown

Field separator

By default, awk uses whitespace as the field separator. Not all text files use whitespace to define fields, though. For example, create a file called colours.csv with this content:

name,color,amount
apple,red,4
banana,yellow,6
strawberry,red,3
grape,purple,10
apple,green,8
plum,purple,2
kiwi,brown,4
potato,brown,9
pineapple,yellow,5

Awk can treat the data in exactly the same way, as long as you specify which character it should use as the field separator in your command. Use the --field-separator (or just -F for short) option to define the delimiter:

$ awk -F"," '$2=="yellow" {print $1}' file1.csv
banana
pineapple

Saving output

Using output redirection, you can write your results to a file. For example:

$ awk -F, '$3>5 {print $1, $2} colours.csv > output.txt

This creates a file with the contents of your awk query.

You can also split a file into multiple files grouped by column data. For example, if you want to split colours.txt into multiple files according to what color appears in each row, you can cause awk to redirect per query by including the redirection in your awk statement:

$ awk '{print > $2".txt"}' colours.txt

This produces files named yellow.txt, red.txt, and so on.

In the next article, you'll learn more about fields, records, and some powerful awk variables.


This article is adapted from an episode of Hacker Public Radio, a community technology podcast.

What to read next
Seth Kenlon
Seth Kenlon is a UNIX geek, free culture advocate, independent multimedia artist, and D&D nerd. He has worked in the film and computing industry, often at the same time.
User profile image.
Dave Morriss is a retired IT Manager based in Edinburgh, Scotland. He worked in the UK higher education sector providing IT services to students and staff.

4 Comments

Very nice article for a wonderful tool.

PS. The beauty of UNIX --> tr -s '[:blank:]' ',' < colours.txt > colours.csv

Thanks for the nice article!

One minor correction:

`awk '$2=="yellow"{print $1}' file1.txt`

should be:

`awk '$2=="yellow"{print $1}' colours.txt`

Creative Commons LicenseThis work is licensed under a Creative Commons Attribution-Share Alike 4.0 International License.