How to remove duplicate lines from files with awk

Learn how to use awk '!visited[$0]++' without sorting or changing their order.
142 readers like this
142 readers like this
Coding on a computer

Suppose you have a text file and you need to remove all of its duplicate lines.

TL;DR

To remove the duplicate lines while preserving their order in the file, use:

awk '!visited[$0]++' your_file > deduplicated_file

How it works

The script keeps an associative array with indices equal to the unique lines of the file and values equal to their occurrences. For each line of the file, if the line occurrences are zero, then it increases them by one and prints the line, otherwise, it just increases the occurrences without printing the line.

I was not familiar with awk, and I wanted to understand how this can be accomplished with such a short script (awkward). I did my research, and here is what is going on:

  • The awk "script" !visited[$0]++ is executed for each line of the input file.
  • visited[] is a variable of type associative array (a.k.a. Map). We don't have to initialize it because awk will do it the first time we access it.
  • The $0 variable holds the contents of the line currently being processed.
  • visited[$0] accesses the value stored in the map with a key equal to $0 (the line being processed), a.k.a. the occurrences (which we set below).
  • The ! negates the occurrences' value:
  • The ++ operation increases the variable's value (visited[$0]) by one.
    • If the value is empty, awk converts it to 0 (number) automatically and then it gets increased.
    • Note: The operation is executed after we access the variable's value.

Summing up, the whole expression evaluates to:

  • true if the occurrences are zero/empty string
  • false if the occurrences are greater than zero

awk statements consist of a pattern-expression and an associated action.

<pattern/expression> { <action> }

If the pattern succeeds, then the associated action is executed. If we don't provide an action, awk, by default, prints the input.

An omitted action is equivalent to { print $0 }.

Our script consists of one awk statement with an expression, omitting the action. So this:

awk '!visited[$0]++' your_file > deduplicated_file

is equivalent to this:

awk '!visited[$0]++ { print $0 }' your_file > deduplicated_file

For every line of the file, if the expression succeeds, the line is printed to the output. Otherwise, the action is not executed, and nothing is printed.

Why not use the uniq command?

The uniq command removes only the adjacent duplicate lines. Here's a demonstration:

$ cat test.txt
A
A
A
B
B
B
A
A
C
C
C
B
B
A
$ uniq < test.txt
A
B
A
C
B
A

Other approaches

Using the sort command

We can also use the following sort command to remove the duplicate lines, but the line order is not preserved.

sort -u your_file > sorted_deduplicated_file

Using cat, sort, and cut

The previous approach would produce a de-duplicated file whose lines would be sorted based on the contents. Piping a bunch of commands can overcome this issue:

cat -n your_file | sort -uk2 | sort -nk1 | cut -f2-

How it works

Suppose we have the following file:

abc
ghi
abc
def
xyz
def
ghi
klm

cat -n test.txt prepends the order number in each line.

1       abc
2       ghi
3       abc
4       def
5       xyz
6       def
7       ghi
8       klm

sort -uk2 sorts the lines based on the second column (k2 option) and keeps only the first occurrence of the lines with the same second column value (u option).

1       abc
4       def
2       ghi
8       klm
5       xyz

sort -nk1 sorts the lines based on their first column (k1 option) treating the column as a number (-n option).

1       abc
2       ghi
4       def
5       xyz
8       klm

Finally, cut -f2- prints each line starting from the second column until its end (-f2- option: Note the - suffix, which instructs it to include the rest of the line).

abc
ghi
def
xyz
klm

References

That's all. Cat photo.

Duplicate cat


This article originally appeared on the iridakos blog by Lazarus Lazaridis under a CC BY-NC 4.0 License and is republished with the author's permission.

What to read next
I am a software developer. I have studied Computer Science at Athens University of Economics and Business and I live in Athens, Greece. I usually code in <strong>Ruby</strong> especially when it's on Rails but I also speak Java, Go, bash & C#. I love open source and I like writing tutorials and creating tools and utilities.

1 Comment

I never liked awk, but based of the past 2-3 years of experience of my own I have to admit that everyone who does complex text processing has to know it, because it is often simpler then a long bash pipeline with other commands.

However there is an obvious bottleneck, which is the speed. This is especially true for large files.
In my experience I often found the (long) pipelines as faster, but sometimes awk one-liners took the lead.

Here I just wanted to see whether the above awk one-liner is faster or the pipeline approach is faster.

I expected pipeline to run faster, but it really isn't:

# creating 100 million random integers from 0 to 999999
cat /dev/urandom | tr -dc '0-9' | fold -w 6 | head -n100000000 >random.txt

# awk's speed: 42.99s user 0.36s system 99% cpu 43.355 total
time (awk '!visited[$0]++' random.txt >/dev/null)

# pipeline's speed: 225.02s user 3.17s system 100% cpu 3:46.86 total
time (awk '!visited[$0]++' random.txt >/dev/null)

conclusion: awk one-liner was ~5.2 times faster then pipeline which is clearly not marginal.

Creative Commons LicenseThis work is licensed under a Creative Commons Attribution-Share Alike 4.0 International License.