Electronic books, or eBooks, have been around for a long time, but convenient devices upon which to read them are a relatively recent development. Between mobile phones, tablets, and dedicated eBook readers, chances are you have some device in your life that you can use to read an electronic book upon. That's great for leveling up on how much you read, but it begs the question of what open file formats are out there for eBooks, and which ones are best.
Ebooks are great. I have been a user of them for far longer than there have been dedicated readers.
I appreciate that eBooks enable me to take several volumes of text with me everywhere I go without actually having to bear the weight of several reams of paper. Sure, I might not actually read the entire works of Jules Verne whilst waiting in line at the grocer's, but it comforts me to know that I have them just in case. I also appreciate the ability they give me to find my favorite quotes or passages, and the fact that readers who have limited sight can enlarge the text so they can read it, or that fully blind readers can have their computer read the text to them. I appreciate that I can take notes at a conference in Restructured Text and convert my notes with Pandoc to a fully hyperlinked ebook to review on my plane ride home.
First of all, let's establish that we'll look only at open file formats in this article, and why. The most obvious advantage to an open format for eBooks is that an open format can be converted to any other format, meaning that your book can be read on any device. Technology defeats itself when you get a book in one format and have a device or operating system (OS) that only reads any other format but that. This does not happen with a base level of an open format.
Open formats also ensure that anyone can create an eBook. Pen and paper are pretty universal, so if the format for eBooks is closed, then I may as well just learn shorthand.
And finally, I am only interested in fully open formats; there are those that are "open enough" to convert from, but not so open that any OS can create them, or vice versa. That doesn't work for me, especially as a PowerPC Linux user where x86 binaries to create a "sort of open" eBook are not of any use.
With that preamble out of the way, let's look at what formats are out there.
No beating around the bush: the EPUB format is the best thing to happen to eBooks since eBooks. It's simple, it's lightweight, cross-platform, and versatile. It is the very model of the open source ideal of building new technology by combining perfectly functional existing technology. An EPUB file is basically a collection of HTML files with some metadata in a zip file.
It might sound fancy and technical, but making an EPUB can be as easy as one line command with Pandoc, or exporting from Libre Office.
This is the command I use to convert conference notes to EPUB:
$ pandoc -f rst -t epub3 notes.rst \
A more complex command, used for a book recently published online:
$ pandoc -f markdown -t epub3 book.md colophon.md \
-N --epub-stylesheet=style.css \
In a pinch, you can even generate an EPUB with standard system tools. A one-line mimetype file placed at the head of a zip container, identifying the container as an EPUB document, along with a directory containing all HTML assets and files, renders a working EPUB file. A proper table of contents can get more complex, but this is enough to get you started:
$ echo "application/epub+zip > mimetype
$ zip -0Xq book.epub mimetype
$ zip -Xr9D book.epub META-INF/ OEBPS/
The EPUB format is popular and well supported. Many devices support it, or they support applications that can read it. The Firefox addon EPUB Reader allows easy access to EPUBs on anything that runs Firefox. FBReader opens them on computers and mobiles. In a pinch, an EPUB file can even be unzipped and viewed as raw HTML.
In short, there are no barriers in the EPUB format. It is open, it is accessible, and it is powerful.
- Uses HTML along with simple, open source technologies to construct a universal eBook format. Scalable, dynamic, and capable of handling most anything that HTML can handle.
- Using standard HTML and CSS for formatting, EPUBs permit the user to override style as needed, meaning that fonts, colors, and sizes can all be modified by a user as needed.
- Well-supported on many e-reader devices and most computers and mobiles via open source EPUB reader applications.
- Since EPUB is based on HTML, it is easy to convert.
- Some devices may not support EPUB without conversion.
EPUB is the format that eBooks deserve, and is, for me, the standard by which all other formats are measured. It delivers a rich, easy-to-read book in a lightweight, open, and sensible container.
The original eBook format is plain old ASCII (although hopefully one would opt for Unicode today). This is the most universal format in computing; any OS on any platform can read it, and any text processor can convert from it.
- Plain text can be created by anyone, without any special tools or knowledge.
- Plain text is a safe and reliable source for conversion into other formats.
- Most devices support viewing plain text, one way or another; Android devices can view plain text in any note or office app, computers have text editors, eReaders usually open plain text in the default reading app. Even some portable music players (and any music player loaded with Rockbox) can view plain text.
- Plain text can rewrap sentences so that the content dynamically adjusts to different sized screens.
- It's arguable that plain text is not really an eBook format. It cannot hyperlink from a table of contents to a specific chapter, it does not support images, and so on. In a sense, it does not take advantage of any of the point-and-click benefits that electronic books offer in terms of making information quicker to locate.
- You can dynamically change the layout of text for a different size screen, but if the author hard-coded line breaks after 80 characters, then no matter how your reader adjusts things, the formatting is going to either limit itself to a maximum of 80 characters per line, or have line breaks at unusual places.
It's fairly safe to say that while plain text is durably future-proof and ensures compatibility across devices, it's not the ideal format for eBooks. However, as long as the document is consistent in its layout, plain text is easily parsed to other formats. To ensure consistency in formatting, consider using "mark down" rules such as restructured text (RST).
A FictionBook (.fb2) eBook is an XML format that places an entire book in one single file, including any images. It is, therefore, not intended to store scanned documents (such as a full scan of an entire comic book or a historical facsimile), but as a mostly-text based book with one or a few images throughout.
Being XML, it inherits all the modern features that we would expect from an eBook. It can contain hyperlinks, font styles, and complex layouts. It is natively dynamic, so it will fluidly wrap text to accomodate any size screen.
Generating a FictionBook file is as easy as generating an XML file. It can be done in any text editor (or a lot of
echo statements, if you're a masochist) and the eReader will do all the conversion on-the-fly to display the book as an easy-to-read document.
Being XML, the file format is well-structured, to say the least, and is easily parsed by computers, and even by humans if the human is blind to XML tags.
- FictionBook is an open format in an ubiquitous markup language, with all content contained in literally one file.
- Rich text, with images, hyperlinks, and native dynamic content- and word- wrap.
- Easy to create and convert, as long as you are comfortable with XML and its related toolchain (such as xmllint and tools like nxml-mode).
- Dismal support on dedicated eReader devices. However, if your device runs FBReader, then you can read an .fb2 file.
- The one book/one file tradition may be phased out with .fb3, which is moving toward a zip file containing a cover image, the book file, and metadata files (like an EPUB).
The FictionBook format is not popular within the English eBook market, although it's got support in some languages, so you may or may not have the occasion to obtain an .fb2 file without seeking it out specifically.
If you do happen upon one, or if you intend to generate your own, then whether or not it's the "right" choice for you depends entirely on what you use. If you use, or are happy to start using, an application that reads .fb2 files, then it's a very nice format that is self-contained and robust.
The language of the web, HTML is a powerful document format with hyperlinking, dynamic text flow, styling, image linking, and more. It seems like it would be ideal as an eBook format, and indeed it turns out to be the basis for many of the most popular eBook formats, EPUB included.
HTML as a format is not only simple and easy to learn, it's everywhere. You can write it on any platform and, obviously, view it on any platform. It's simple and blissfully concise compared to XML:
<p style="color: #666;">
Hello <a href="https://opensource.com/%3Ca%20href%3D"http://example.com">http://example.com">World</a>.
<img src="https://opensource.com/images/tux.png" />
HTML may be easy to write and easy to read, but it turns out to have some weaknesses. You might be able to link to images, but where do you store them? eReaders tend to do poorly with directories as books, so somehow the paths to images would need to be kept intact. A work in HTML also tends to be broken into several pages, so a book of 25 chapters might have in excess of 25 files associated with it; how do you manage all of those within an eReader?
The answer is, of course, you don't. If you've downloaded a collection of HTML documents from the web and want to take them with you for later, try wrapping the HTML in a format that your eReader can treat as an eBook. The conversion is simple, as long as all the paths are correct (if you view a folder full of HTML files in Firefox and everything looks correct, then your paths are correct), you can convert from HTML using Pandoc. A simple example, assuming your target device is happy with EPUB:
$ pandoc -f html -t epub3 index.html about.html \
chapter1.html colophon.html -o book.epub
- HTML is the lingua franca of the web.
- HTML is easy to write and easy to read. It supports dynamic and hyperlinked text, style, and images. Being well-structured, it is easy to convert from or to HTML.
- Disparate files and assets which many e-readers view as distinct documents.
- Not supported by all e-readers.
HTML is a great format and ideal for ebooks in terms of file specification, but struggles with compatibility due to the tendency for eReaders to assume one file per book. If you happen to have a directory full of HTML that you want to take with you on the go, convert it to a proper eBook format for better compatibility.
The PDF file format was developed as a method to deliver content meant for the printed page. It was originally considered a "pre-flight" renderer: a digital version of exactly what a user could expect to see emerge from the printer.
Presumably for lack of anything more obvious, people eventually began to use the PDF as a means for distributing nearly any document that they did not intend for another user to directly edit.
- Preserves the print layout of a book, whether it is practical for digital screens or not.
- Well-supported format on most eReaders and devices.
- Intended for print, PDFs are not resolution independent and do not adapt well to screen sizes.
- A nearly dead-end format in terms of conversion. Manual copying and pasting of underlying text may be possible (if the text has been embedded) but otherwise you cannot convert from PDF to another format.
- Can render very large files, depending on image compression options used at the time of creation.
- Devices may support the PDF format, but are not able to display most PDFs in a readable and convenient way. Everything from image-type support to font conflicts can also hinder whether content is rendered at all.
The PDF format is, in a sense, better suited for delivering style rather than content. PDFs tend to be big, inefficient, and resolution-specific. Due to the popularity of eBooks, a "re-flow" feature has been developed, although the PDF must have been created with re-flow written into it, and even then few devices support the feature.
If you have the choice, avoid PDFs for eBooks. If you are generating the content yourself, use anything but PDF. If you are given a PDF without any say in the matter and are having a difficult time reading it on your device (constant zooming-in on the content to read sentences through a microscope lens gets strenuous after the 10th page or so), use pdftotext to extract the text from a PDF:
$ pdftotext ~/book.pdf book-text.txt
$ pdftohtml ~/book.pdf book-html.html
These are good solutions to get to the content of the PDF, but the results (and readability) vary.
Fact is, converting books from traditional typesetting to digital formats is not an easy task. If it were, Project Gutenberg would be complete by now. Even books that are scanned and run through optical character recognition require extensive clean-up and at least a once-over for errors.
Some books just can't exactly be converted as text documents, because they aren't just text documents. To preserve some books, the best solution is to scan each page and then dump all of the scans into some container.
It might seem that this is the perfect use-case for a PDF, and for compatibility's sake, but if your ereader is particularly feature-rich, or if you read your eBooks on a computer or mobile, then you may have support for Comic Book Archives or Djvu.
The DjVu format is not just an eBook format, it's also a compression format. DjVu files are frequently smaller even that jpegs, but of equal quality. For small eBooks, the difference might be trivial but for larger works, it can mean the difference between an 80mb download and a 20mb download.
Unfortunately, .djvu support is almost non-existant on dedicated eReader edevices. While it's not often a built-in feature on mobiles and computers, several applications support viewing and creating DjVu files, including the djvulibre package, the Evince document viewer, and FBReader on Android.
- Better-than-jpeg compression for facsimiles of the printed page, embeddable text; functionally, a more efficient, open, and simpler replacement for PDF.
- Not widely supported on eReaders.
- Resolution specific.
A good format with benefits in file size and efficiency, but due to limited support, it may not be practical for daily use.
Comic Book Archives
As the name implies, a Comic Book Archive is a format intended for digital storage and consumption of comic books and graphic novels. It's just as well suited for any book that you either have no text for or that you want or need to view as a graphic.
Of course, this inherently has the same problem as PDFs in the sense that a graphic cannot re-wrap itself dynamically for your eReader screen, but the format itself is remarkably open and allows for restructuring if you have the time and patience for it.
In fact, a comic book archive is nothing but a .zip or .rar folder containing a sequence of images (.cbz and .cbr, respectively). eBook readers see the file as a book and display the images in sequence, decompressing each one on-the-fly.
- Open format using only existing technology. No bloat; a very clean format.
- Mostly indifferent to image type (although your ereader may not be), so image compression levels are adjustable.
- Resolution specific.
As a format for storage and consumption, this is the ideal way to digitally archive comics and scanned facsimiles. It being nothing more than a zipped directory of images, you could even store a high-quality master edition as a master copy, and create a lower-quality "portable" version for your devices.
The eBook format conundrum boils down to this: there are source formats, there are formats for consumption (often dictated by what your e-reading device supports), and there's what a vendor or distributor is offering you.
Unfortunately, these are not always in alignment with one another.
It would be nice if the world would default to open formats, because open formats are easily converted, plus they can be generated programmatically to offer you the best choice according to your needs. That is not always the case, so it's up to you to decide what format works best for you.
The good news is that, DRM (digital restrictions placed upon books by the vendor) notwithstanding, conversion is usually an option. Take your content, process it the way you need to process it so that it works for you, and always keep the most open format as a backup.