A gawk script to convert smart quotes

Plus, get our awk cheat sheet.
231 readers like this.
A new presciption for open source health care

Opensource.com

I manage a personal website and edit the web pages by hand. Since I don't have many pages on my site, this works well for me, letting me "scratch the itch" of getting into the site's code.

When I updated my website's design recently, I decided to turn all the plain quotes into "smart quotes," or quotes that look like those used in print material: “” instead of "".

Editing all of the quotes by hand would take too long, so I decided to automate the process of converting the quotes in all of my HTML files. But doing so via a script or program requires some intelligence. The script needs to know when to convert a plain quote to a smart quote, and which quote to use.

You can use different methods to convert quotes. Greg Pittman wrote a Python script for fixing smart quotes in text. I wrote mine in GNU awk (gawk).

Get our awk cheat sheet. Free download.

To start, I wrote a simple gawk function to evaluate a single character. If that character is a quote, the function determines if it should output a plain quote or a smart quote. The function looks at the previous character; if the previous character is a space, the function outputs a left smart quote. Otherwise, the function outputs a right smart quote. The script does the same for single quotes.

function smartquote (char, prevchar) {
	# print smart quotes depending on the previous character
	# otherwise just print the character as-is

	if (prevchar ~ /\s/) {
		# prev char is a space
		if (char == "'") {
			printf("‘");
		}
		else if (char == "\"") {
			printf("“");
		}
		else {
			printf("%c", char);
		}
	}
	else {
		# prev char is not a space
		if (char == "'") {
			printf("’");
		}
		else if (char == "\"") {
			printf("”");
		}
		else {
			printf("%c", char);
		}
	}
}

With that function, the body of the gawk script processes the HTML input file character by character. The script prints all text verbatim when inside an HTML tag (for example, <html lang="en">. Outside any HTML tags, the script uses the smartquote() function to print text. The smartquote() function does the work of evaluating when to print plain quotes or smart quotes.

function smartquote (char, prevchar) {
	...
}

BEGIN {htmltag = 0}

{
	# for each line, scan one letter at a time:

	linelen = length($0);

	prev = "\n";

	for (i = 1; i <= linelen; i++) {
		char = substr($0, i, 1);

		if (char == "<") {
			htmltag = 1;
		}

		if (htmltag == 1) {
			printf("%c", char);
		}
		else {
			smartquote(char, prev);
			prev = char;
		}

		if (char == ">") {
			htmltag = 0;
		}
	}

	# add trailing newline at end of each line
	printf ("\n");
}

Here's an example:

gawk -f quotes.awk test.html > test2.html

Sample input:


<!DOCTYPE html>
<html lang="en">
<head>
  <title>Test page</title>
  <link rel="stylesheet" type="text/css" href="https://opensource.com/test.css" />
  <meta charset="UTF-8">
  <meta name="viewport" content="width=device-width" />
</head>
<body>
  <h1><a href="https://opensource.com/"><img src="https://opensource.com/logo.png" alt="Website logo" /></a></h1>
  <p>"Hi there!"</p>
  <p>It's and its.</p>
</body>
</html>

Sample output:

<!DOCTYPE html>
<html lang="en">
<head>
  <title>Test page</title>
  <link rel="stylesheet" type="text/css" href="https://opensource.com/test.css" />
  <meta charset="UTF-8">
  <meta name="viewport" content="width=device-width" />
</head>
<body>
  <h1><a href="https://opensource.com/"><img src="https://opensource.com/logo.png" alt="Website logo" /></a></h1>
  <p>&ldquo;Hi there!&rdquo;</p>
  <p>It&rsquo;s and its.</p>
</body>
</html>

 

photo of Jim Hall
Jim Hall is an open source software advocate and developer, best known for usability testing in GNOME and as the founder + project coordinator of FreeDOS.

3 Comments

I like the way you've protected the quotes inside tags.
I don't think it's possible to create a perfect script. There will always be problems, like 'twas, which should be a right single quote, but will probably be made a left single quote. I've noticed that word processors make this mistake, too.
A very tricky aspect of my Scribus script was designing for multiple languages.

Good point. My script doesn't handle leading apostrophes, when the apostrophe was written with a single straight quote. Instead my script will turn them into left single quotes (like most word processors).

I don't have many (any?) leading apostrophes on the websites I code by hand, so this isn't an issue for me. But worth noting for others using the script.

For example:
(from https://www.grammarbook.com/punctuation/apostro.asp)

When an apostrophe comes before a word or number, take care that it's truly an apostrophe (’) rather than a single quotation mark (‘).

Incorrect: ‘Twas the night before Christmas.
Correct: ’Twas the night before Christmas.

Incorrect: I voted in ‘08.
Correct: I voted in ’08.

I have to say I don't agree with grammarbook. The apostrophe in don't is typically treated as a right single quote, so to me 'twas should have a right single quote. There are proper uses of apostrophes for feet and inches, and also minutes and seconds (time and degrees), and these should not be curly quotes.

In reply to by Jim Hall

Creative Commons LicenseThis work is licensed under a Creative Commons Attribution-Share Alike 4.0 International License.