Replace smart quotes with the Linux sed command

Banish "smart" quotes with your favorite version of sed.
30 readers like this.
Coding on a computer

In typography, a pair of quotation marks were traditionally oriented toward one another. They look like this:

“smart quotes”

As computers became popular in the mid-twentieth century, the orientation was often abandoned. The original character set of computers didn't have much room to spare, so it makes sense that two double-quotes and two single-quotes were reduced down to just one of each in the ASCII specification. These days the common character set is Unicode, with plenty of space for lots of fancy quotation marks and apostrophes, but many people have become used to the minimalism of just one character for both opening and closing quotes. Besides that, computers actually see the different kinds of quotation marks and apostrophes as distinct characters. In other words, to a copmuter the right double quote is different from the left double quote or a straight quote.

 

 

Replacing smart quotes with sed

Computers aren't typewriters. When you press a key on your keyboard, you're not pressing a lever with an inkstamp attached to it. You're just pressing a button that sends a signal to your computer, which the computer interprets as a request to display a specific predefined character. The request depends on your keyboard map. As a Dvorak typist, I've witnessed the confusion on people's faces when they discover "asdf" on my keyboard produces "aoeu" on the screen. You may also have pressed special combinations of keys to produce characters, such as ™ or ß or ≠, that's not even printed on your keyboard.

Each letter or character, whether it's printed on your keyboard or not, has a code. Character encoding can be expressed in different ways, but to a computer the Unicode sequences u2018 and u2019 produce and , while the codes u201c and u201d produce the and characters. Knowing these "secret" codes means you can replace them programmatically using a command like sed. Any version of sed will do, so you can use GNU sed or BSD sed or even Busybox sed.

Here's the simple shell script I use:

#!/bin/sh
# GNU All-Permissive License
SED=$(which sed)
SDQUO=$(echo -ne '\u2018\u2019')
RDQUO=$(echo -ne '\u201C\u201D')
$SED -i -e "s/[$SDQUO]/\'/g" -e "s/[$RDQUO]/\"/g" "${1}"

Save this script as fixquotes.sh and then create a separate test file containing smart quotes:

‘Single quote’
“Double quote”

Run the script, and then use the cat command to see the results:

$ sh ./fixquotes.sh test.txt
$ cat test.txt
'Single quote'
"Double quote"

Install sed

If you’re using Linux, BSD, or macOS, then you already have GNU or BSD sed installed. These are two unique reimplementations of the original sed command, and for the script in this article they are functionally the same (that's not true for all scripts, though).

On Windows, you can install GNU sed with Chocolatey.

What to read next

Awesome vim plugins for writers

Vim is one of the most popular text editors among programmers, web developers, and power users of GNU/Linux. This is not surprising, because Vim offers high-speed editing, has…

Seth Kenlon
Seth Kenlon is a UNIX geek, free culture advocate, independent multimedia artist, and D&D nerd. He has worked in the film and computing industry, often at the same time.

3 Comments

I prefer the term typographic quotes. In Scribus, we have the opposite issue - converting typewriter quotes to typographic.

Great, and explanation is where?

`echo` explanation:

* `-n` - no new line
* `-e` - enable interpretation of "\"

How the hell I could know that \u2018 it is "left single quotation mark"? These codes could be found in `gnome-characters` application or on websites like https://www.unicodepedia.com/unicode/general-punctuation/2018/left-single-quotation-mark/

So finally to variables are assigned values: $SDQUO="‘’"; $RDQUO="“”"

`sed` explanation:

* `-i` - edit file "in place" (overwrite)
* `-e` - execute script

Script 1: "s/[‘’]/\'/g"

Script 2: "s/[“”]/\"/g"

* `s/text_to_find/replace_text/g` - search (`s/`) for "text_to_find" and replace it with "replace_text" and do it for every occurrence (`/g`)
* text_to_find=`[‘’]` - characters that should be found
* replace_text=`'` and `"` - characters that should be replaced with. In scripts character are presented as `\'` and `\"` because `'` and `"` are special characters and to disable their "special powers" - they need to be preceded with `\`.

Ugh... Explain it is not easy... Probably my explanation is not precise enough and need to be... further explained:) Anyway for "beginner scripter" like me - it should be a little more understandable now;)

This is a great deconstruction of the article. For more information on how to use sed, check out these articles, as well:

https://opensource.com/article/20/12/sed

https://opensource.com/article/21/3/sed-cheat-sheet

In reply to by Danniello

Creative Commons LicenseThis work is licensed under a Creative Commons Attribution-Share Alike 4.0 International License.