How to use regular expressions in awk

In awk, regular expressions (regex) allow for dynamic and complex pattern definitions. You're not limited to searching for simple strings but also patterns within patterns.

The syntax for using regular expressions to match lines in awk is:

word ~ /match/

The inverse of that is not matching a pattern:

word !~ /match/

If you haven't already, create the sample file from our previous article:

name       color  amount
apple      red    4
banana     yellow 6
strawberry red    3
raspberry  red    99
grape      purple 10
apple      green  8
plum       purple 2
kiwi       brown  4
potato     brown  9
pineapple  yellow 5

Save the file as colours.txt and run:

$ awk -e '$1 ~ /p[el]/ {print $0}' colours.txt
apple      red    4
grape      purple 10
apple      green  8
plum       purple 2
pineapple  yellow 5

You have selected all records containing the letter p followed by either an e or an l.

Adding an o inside the square brackets creates a new pattern to match:

$ awk -e '$1 ~ /p[o]/ {print $0}' colours.txt
apple      red    4
grape      purple 10
apple      green  8
plum       purple 2
pineapple  yellow 5
potato     brown  9

Regular expression basics

Certain characters have special meanings when they're used in regular expressions.

Anchors

Anchor	Function
^	Indicates the beginning of the line
$	Indicates the end of a line
\A	Denotes the beginning of a string
\z	Denotes the end of a string
\b	Marks a word boundary

For example, this awk command prints any record containing an r character:

$ awk -e '$1 ~ /r/ {print $0}' colours.txt
strawberry red    3
raspberry  red    99
grape      purple 10

Add a ^ symbol to select only records where r occurs at the beginning of the line:

$ awk -e '$1 ~ /^r/ {print $0}' colours.txt
raspberry  red    99

Characters

Character	Function
[ad]	Selects a or d
[a-d]	Selects any character a through d (a, b, c, or d)
[^a-d]	Selects any character except a through d (e, f, g, h…)
\w	Selects any word
\s	Selects any whitespace character
\d	Selects any digit

The capital versions of w, s, and d are negations; for example, \D does not select any digit.

POSIX regex offers easy mnemonics for character classes:

POSIX mnemonic	Function
[:alnum:]	Alphanumeric characters
[:alpha:]	Alphabetic characters
[:space:]	Space characters (such as space, tab, and formfeed)
[:blank:]	Space and tab characters
[:upper:]	Uppercase alphabetic characters
[:lower:]	Lowercase alphabetic characters
[:digit:]	Numeric characters
[:xdigit:]	Characters that are hexadecimal digits
[:punct:]	Punctuation characters (i.e., characters that are not letters, digits, control characters, or space characters)
[:cntrl:]	Control characters
[:graph:]	Characters that are both printable and visible (e.g., a space is printable but not visible, whereas an a is both)
[:print:]	Printable characters (i.e., characters that are not control characters)

Quantifiers

Quantifier	Function
.	Matches any character
+	Modifies the preceding set to mean one or more times
*	Modifies the preceding set to mean zero or more times
?	Modifies the preceding set to mean zero or one time
{n}	Modifies the preceding set to mean exactly n times
{n,}	Modifies the preceding set to mean n or more times
{n,m}	Modifies the preceding set to mean between n and m times

Many quantifiers modify the character sets that precede them. For example, . means any character that appears exactly once, but .* means any or no character. Here's an example; look at the regex pattern carefully:

$ printf "red\nrd\n"
red
rd
$ printf "red\nrd\n" | awk -e '$0 ~ /^r.d/ {print}'
red
$ printf "red\nrd\n" | awk -e '$0 ~ /^r.*d/ {print}'
red
rd

Similarly, numbers in braces specify the number of times something occurs. To find records in which an e character occurs exactly twice:

$ awk -e '$2 ~ /e{2}/ {print $0}' colours.txt
apple      green  8

Grouped matches

Quantifier	Function
(red)	Parentheses indicate that the enclosed letters must appear contiguously
\|	Means or in the context of a grouped match

For instance, the pattern (red) matches the word red and ordered but not any word that contains all three of those letters in another order (such as the word order).

Awk like sed with sub() and gsub()

Awk features several functions that perform find-and-replace actions, much like the Unix command sed. These are functions, just like print and printf, and can be used in awk rules to replace strings with a new string, whether the new string is a string or a variable.

The sub function substitutes the first matched entity (in a record) with a replacement string. For example, if you have this rule in an awk script:

{ sub(/apple/, "nut", $1);
    print $1 }

running it on the example file colours.txt produces this output:

name
nut
banana
raspberry
strawberry
grape
nut
plum
kiwi
potato
pinenut

The reason both apple and pineapple were replaced with nut is that both are the first match of their records. If the records were different, then the results could differ:

$ printf "apple apple\npineapple apple\n" | \
awk -e 'sub(/apple/, "nut")'
nut apple
pinenut apple

The gsub command substitutes all matching items:

$ printf "apple apple\npineapple apple\n" | \
awk -e 'gsub(/apple/, "nut")'
nut nut
pinenut nut

Gensub

An even more complex version of these functions, called gensub(), is also available.

The gensub function allows you to use the & character to recall the matched text. For example, if you have a file with the word Awk and you want to change it to GNU Awk, you could use this rule:

{ print gensub(/(Awk)/, "GNU &", 1) }

This searches for the group of characters Awk and stores it in memory, represented by the special character &. Then it substitutes the string for GNU &, meaning GNU Awk. The 1 character at the end tells gensub() to replace the first occurrence.

$ printf "Awk\nAwk is not Awkward" \
| awk -e ' { print gensub(/(Awk)/, "GNU &",1) }'
GNU Awk
GNU Awk is not Awkward

There's a time and a place

Awk is a powerful tool, and regex are complex. You might think awk is so very powerful that it could easily replace grep and sed and tr and sort and many more, and in a sense, you'd be right. However, awk is just one tool in a toolbox that's overflowing with great options. You have a choice about what you use and when you use it, so don't feel that you have to use one tool for every job great and small.

With that said, awk really is a powerful tool with lots of great functions. The more you use it, the better you get to know it. Remember its capabilities, and fall back on it occasionally so can you get comfortable with it.

Our next article will cover looping in Awk, so come back soon!

This article is adapted from an episode of Hacker Public Radio, a community technology podcast.