Get the highlights in your inbox every week.
Convert documents with Pandoc like a pro | Opensource.com
Convert documents with Pandoc like a pro
Pandoc is a powerhouse for converting any file into the format you want. Check out our handy cheat sheet.
Has anyone ever sent you a document in a format that just isn't quite right for you? Maybe you don't have access to the application used to create the document, or maybe you don't need the document so much as you need what's in it, or maybe you just flat out don't like the format. There's no wrong reason for disliking a file format. If it's not your preferred format, whether you find it cumbersome to use or you just don't like how its metadata is organized, then that's enough of a reason for you to convert it. However, there's rarely a good reason to convert a document manually, and Pandoc is here to ensure you never have to.
If you're on Linux, you can install pandoc from your software repository.
On Fedora or CentOS or similar:
$ sudo dnf install pandoc
On Ubuntu, Elementary, Debian, or similar:
$ apt install pandoc
Once you have it installed, you can verify with a simple version check:
$ pandoc --version
At its most basic, the pandoc command is among the easiest commands to use. You type pandoc into a terminal, provide it the file you want to convert, then type --output and a name for the output file you want. Pandoc can usually auto-detect both formats from their filename extensions and convert from one to the other.
Here's a simple example to convert from a .docx file to .odt:
$ pandoc ~/Documents/example.docx --output ~/Documents/example.odt
If you're not used to using a terminal, keep in mind that in most modern terminal applications, you can drag-and-drop a file from your desktop into the terminal to have it translated into a full path that your computer understands.
You can specify nearly any format you can think of:
$ pandoc ~/Documents/example.docx --output ~/public_html/example.html
That's right: Pandoc enables you to output many different formats from one single source format.
Find your source formatIt doesn't take long to realize that Pandoc is possibly more flexible than you are, or at least, it's more flexible than you care to be. Because it's just a piece of software, Pandoc doesn't care whether you've written your latest thesis paper in LaTeX, Docbook, Markdown, or even JSON (warning: don't write your thesis paper in JSON). It can process whatever you have handy and turn it into whatever format you need. As with so many open source projects, you have the freedom to choose which tool you like best.
If you know rudimentary HTML and want to write everything in that, then grab a good HTML editor and start writing. Pandoc will convert it to whatever your boss or client or professor needs. Or maybe you prefer Docbook, or LaTeX, CommonMark, Org mode, or just a plain old LibreOffice .odt. It doesn't matter to Pandoc. Find your favorite format, the one that lets you concentrate on getting your work done, and let Pandoc do the hard part.
It may not seem like it, but now you know all the basics of Pandoc. It's a straightforward command that converts from one document format to another. If that's all you need, you're finished with this article.
However, Pandoc is a big application with lots of options for every format it can process. If you're already a Pandoc user or you want to delve deeper into what Pandoc can do, you need to look at its command options.
From and to
The first options you need to know are the --from and --to flags. These explicitly tell Pandoc what format to process from and to, and you can use them when Pandoc's output doesn't match what you expected, or when you need to differentiate between formats that may share the same extension.
For example, CommonMark, Markdown, markdown_phpextra, markdown_strict, and markdown_github may all use either the .md or .txt extension. Both HTML and HTML5 use the .html extension, and EPUB versions 2 and 3 both use the .epub extension. Specifying exactly what format conversion you want ensures Pandoc provides you with the expected output:
$ pandoc --from docx example.docx --to commonmark example.md
Table of contents
It varies from format to format, but Pandoc doesn't always provide a table of contents. The --table-of-contents option, or --toc for short, ensures that a document with chapter breaks (or subheading markers such as h2 in HTML, ## in Markdown, and so on) are prepended with a list of chapters.
If you have chapters with subsections and sections in those subsections, then you may use --toc-depth to set how many subheadings are listed under each chapter.
Epub for eBooks
Epub, an open standard, is one of the most popular formats for eBooks. You can generate them from applications like LibreOffice, Calibre, Scribus, and many others, or you can just convert to Epub using Pandoc. If you know a little bit of CSS, you can easily style your Epub by providing a stylesheet when running Pandoc:
$ pandoc --epub-stylesheet my.css foo.md --output foo.epub
Additionally, you can set your own metadata so that Epub readers know how to sort the book. To do this, create a simple XML file in any text editor:
<dc:title>Be a Pandoc Pro</dc:title>
Save the file, and then use it as your metadata source when converting:
$ pandoc --epub-stylesheet my.css \
--epub-cover-image cover_front.jpg \
--epub-metadata data.xml \
foo.md --output foo.epub
Most POSIX systems have the ability to "print" to PDF. This makes generating PDFs easy, but sometimes it results in some quirks, like incorrect metadata. If you purchase independent and RPG eBooks, then you've surely come across an otherwise professional-quality PDF with an embedded title of "Word Document.docx" or a PDF with hyperlinks rendered in bright blue regardless of the document style (and they often aren't even active).
One way to control how your PDF renders is to use Pandoc. With Pandoc, you can use LaTeX commands in your source document to affect PDF output, and you can add your own metadata keys and values:
$ pandoc --metadata=title:"My Professional Report" foo.odt --output foo.pdf
Pandoc is a powerhouse for anyone who needs to convert document formats. Even when it fails to give you exactly what you want, it's almost always able to get you closer to what you need. Use open and standardized formats when writing content, and rest assured that Pandoc can convert to whatever else you need. The more you use Pandoc, the more you're sure to discover.
To help you along with your exploration, we've developed an updated Pandoc cheat sheet as a handy reference. The cheat sheet hardly covers everything Pandoc is capable of, but it provides some common commands in common contexts and provides a sense of the general workflow you can expect.