A gawk script to convert smart quotes

A gawk script to convert smart quotes

Plus, get our awk cheat sheet.

Image by : 

opensource.com

x

Get the newsletter

Join the 85,000 open source advocates who receive our giveaway alerts and article roundups.

I manage a personal website and edit the web pages by hand. Since I don't have many pages on my site, this works well for me, letting me "scratch the itch" of getting into the site's code.

When I updated my website's design recently, I decided to turn all the plain quotes into "smart quotes," or quotes that look like those used in print material: “” instead of "".

Editing all of the quotes by hand would take too long, so I decided to automate the process of converting the quotes in all of my HTML files. But doing so via a script or program requires some intelligence. The script needs to know when to convert a plain quote to a smart quote, and which quote to use.

You can use different methods to convert quotes. Greg Pittman wrote a Python script for fixing smart quotes in text. I wrote mine in GNU awk (gawk).

Get our awk cheat sheet. Free download.

To start, I wrote a simple gawk function to evaluate a single character. If that character is a quote, the function determines if it should output a plain quote or a smart quote. The function looks at the previous character; if the previous character is a space, the function outputs a left smart quote. Otherwise, the function outputs a right smart quote. The script does the same for single quotes.

function smartquote (char, prevchar) {
        # print smart quotes depending on the previous character
        # otherwise just print the character as-is

        if (prevchar ~ /\s/) {
                # prev char is a space
                if (char == "'") {
                        printf("‘");
                }
                else if (char == "\"") {
                        printf("“");
                }
                else {
                        printf("%c", char);
                }
        }
        else {
                # prev char is not a space
                if (char == "'") {
                        printf("’");
                }
                else if (char == "\"") {
                        printf("”");
                }
                else {
                        printf("%c", char);
                }
        }
}

With that function, the body of the gawk script processes the HTML input file character by character. The script prints all text verbatim when inside an HTML tag (for example, <html lang="en">. Outside any HTML tags, the script uses the smartquote() function to print text. The smartquote() function does the work of evaluating when to print plain quotes or smart quotes.

function smartquote (char, prevchar) {
        ...
}

BEGIN {htmltag = 0}

{
        # for each line, scan one letter at a time:

        linelen = length($0);

        prev = "\n";

        for (i = 1; i <= linelen; i++) {
                char = substr($0, i, 1);

                if (char == "<") {
                        htmltag = 1;
                }

                if (htmltag == 1) {
                        printf("%c", char);
                }
                else {
                        smartquote(char, prev);
                        prev = char;
                }

                if (char == ">") {
                        htmltag = 0;
                }
        }

        # add trailing newline at end of each line
        printf ("\n");
}

Here's an example:

gawk -f quotes.awk test.html > test2.html

Sample input:

<!DOCTYPE html>
<html lang="en">
<head>
  <title>Test page</title>
  <link rel="stylesheet" type="text/css" href="/test.css" />
  <meta charset="UTF-8">
  <meta name="viewport" content="width=device-width" />
</head>
<body>
  <h1><a href="/"><img src="logo.png" alt="Website logo" /></a></h1>
  <p>"Hi there!"</p>
  <p>It's and its.</p>
</body>
</html>

Sample output:

<!DOCTYPE html>
<html lang="en">
<head>
  <title>Test page</title>
  <link rel="stylesheet" type="text/css" href="/test.css" />
  <meta charset="UTF-8">
  <meta name="viewport" content="width=device-width" />
</head>
<body>
  <h1><a href="/"><img src="logo.png" alt="Website logo" /></a></h1>
  <p>&ldquo;Hi there!&rdquo;</p>
  <p>It&rsquo;s and its.</p>
</body>
</html>

 

About the author

photo of Jim Hall
Jim Hall - Jim Hall is an open source software developer and advocate, probably best known as the founder and project coordinator for FreeDOS. Jim is also very active in the usability of open source software, as a mentor for usability testing in GNOME Outreachy, and as an occasional adjunct professor teaching a course on the Usability of Open Source Software. From 2016 to 2017, Jim served as a director on the GNOME Foundation Board of Directors. At work, Jim is Chief Information Officer in local...