A gawk script to convert smart quotes

A gawk script to convert smart quotes

Plus, get our awk cheat sheet.

Open source prescription.
Image by : 
Opensource.com
x

Subscribe now

Get the highlights in your inbox every week.

I manage a personal website and edit the web pages by hand. Since I don't have many pages on my site, this works well for me, letting me "scratch the itch" of getting into the site's code.

When I updated my website's design recently, I decided to turn all the plain quotes into "smart quotes," or quotes that look like those used in print material: “” instead of "".

Editing all of the quotes by hand would take too long, so I decided to automate the process of converting the quotes in all of my HTML files. But doing so via a script or program requires some intelligence. The script needs to know when to convert a plain quote to a smart quote, and which quote to use.

You can use different methods to convert quotes. Greg Pittman wrote a Python script for fixing smart quotes in text. I wrote mine in GNU awk (gawk).

Get our awk cheat sheet. Free download.

To start, I wrote a simple gawk function to evaluate a single character. If that character is a quote, the function determines if it should output a plain quote or a smart quote. The function looks at the previous character; if the previous character is a space, the function outputs a left smart quote. Otherwise, the function outputs a right smart quote. The script does the same for single quotes.

function smartquote (char, prevchar) {
        # print smart quotes depending on the previous character
        # otherwise just print the character as-is

        if (prevchar ~ /\s/) {
                # prev char is a space
                if (char == "'") {
                        printf("‘");
                }
                else if (char == "\"") {
                        printf("“");
                }
                else {
                        printf("%c", char);
                }
        }
        else {
                # prev char is not a space
                if (char == "'") {
                        printf("’");
                }
                else if (char == "\"") {
                        printf("”");
                }
                else {
                        printf("%c", char);
                }
        }
}

With that function, the body of the gawk script processes the HTML input file character by character. The script prints all text verbatim when inside an HTML tag (for example, <html lang="en">. Outside any HTML tags, the script uses the smartquote() function to print text. The smartquote() function does the work of evaluating when to print plain quotes or smart quotes.

function smartquote (char, prevchar) {
        ...
}

BEGIN {htmltag = 0}

{
        # for each line, scan one letter at a time:

        linelen = length($0);

        prev = "\n";

        for (i = 1; i <= linelen; i++) {
                char = substr($0, i, 1);

                if (char == "<") {
                        htmltag = 1;
                }

                if (htmltag == 1) {
                        printf("%c", char);
                }
                else {
                        smartquote(char, prev);
                        prev = char;
                }

                if (char == ">") {
                        htmltag = 0;
                }
        }

        # add trailing newline at end of each line
        printf ("\n");
}

Here's an example:

gawk -f quotes.awk test.html > test2.html

Sample input:

<!DOCTYPE html>
<html lang="en">
<head>
  <title>Test page</title>
  <link rel="stylesheet" type="text/css" href="/test.css" />
  <meta charset="UTF-8">
  <meta name="viewport" content="width=device-width" />
</head>
<body>
  <h1><a href="/"><img src="logo.png" alt="Website logo" /></a></h1>
  <p>"Hi there!"</p>
  <p>It's and its.</p>
</body>
</html>

Sample output:

<!DOCTYPE html>
<html lang="en">
<head>
  <title>Test page</title>
  <link rel="stylesheet" type="text/css" href="/test.css" />
  <meta charset="UTF-8">
  <meta name="viewport" content="width=device-width" />
</head>
<body>
  <h1><a href="/"><img src="logo.png" alt="Website logo" /></a></h1>
  <p>&ldquo;Hi there!&rdquo;</p>
  <p>It&rsquo;s and its.</p>
</body>
</html>

 

About the author

photo of Jim Hall
Jim Hall - Jim Hall is an open source software advocate and developer, probably best known as the founder of FreeDOS. Jim is also very active in usability testing for open source software projects like GNOME. At work, Jim is CEO of IT Mentor Group, an IT executive consulting company that helps CIOs and IT Leaders with strategic planning and organizational development.