Your guide to 7 open eBook formats

No readers like this yet.
open up

Opensource.com

Electronic books, or eBooks, have been around for a long time, but convenient devices upon which to read them are a relatively recent development. Between mobile phones, tablets, and dedicated eBook readers, chances are you have some device in your life that you can use to read an electronic book upon. That's great for leveling up on how much you read, but it begs the question of what open file formats are out there for eBooks, and which ones are best.

Why eBooks?

Ebooks are great. I have been a user of them for far longer than there have been dedicated readers.

I appreciate that eBooks enable me to take several volumes of text with me everywhere I go without actually having to bear the weight of several reams of paper. Sure, I might not actually read the entire works of Jules Verne whilst waiting in line at the grocer's, but it comforts me to know that I have them just in case. I also appreciate the ability they give me to find my favorite quotes or passages, and the fact that readers who have limited sight can enlarge the text so they can read it, or that fully blind readers can have their computer read the text to them. I appreciate that I can take notes at a conference in Restructured Text and convert my notes with Pandoc to a fully hyperlinked ebook to review on my plane ride home.

Why open?

First of all, let's establish that we'll look only at open file formats in this article, and why. The most obvious advantage to an open format for eBooks is that an open format can be converted to any other format, meaning that your book can be read on any device. Technology defeats itself when you get a book in one format and have a device or operating system (OS) that only reads any other format but that. This does not happen with a base level of an open format.

Open formats also ensure that anyone can create an eBook. Pen and paper are pretty universal, so if the format for eBooks is closed, then I may as well just learn shorthand.

And finally, I am only interested in fully open formats; there are those that are "open enough" to convert from, but not so open that any OS can create them, or vice versa. That doesn't work for me, especially as a PowerPC Linux user where x86 binaries to create a "sort of open" eBook are not of any use.

With that preamble out of the way, let's look at what formats are out there.


EPUB

No beating around the bush: the EPUB format is the best thing to happen to eBooks since eBooks. It's simple, it's lightweight, cross-platform, and versatile. It is the very model of the open source ideal of building new technology by combining perfectly functional existing technology. An EPUB file is basically a collection of HTML files with some metadata in a zip file.

It might sound fancy and technical, but making an EPUB can be as easy as one line command with Pandoc, or exporting from Libre Office.

This is the command I use to convert conference notes to EPUB:

$ pandoc -f rst -t epub3 notes.rst \

-o allThingsOpen2015.epub

A more complex command, used for a book recently published online:

$ pandoc -f markdown -t epub3 book.md colophon.md \

-N --epub-stylesheet=style.css \

--epub-metadata=metadata.xml

--epub-cover-image=./images/cover-front.svg \

--epub-embed-font=kabel.ttf \

--epub-embed-font=Nouveau_IBM.ttf \

-o slackermedia.epub

In a pinch, you can even generate an EPUB with standard system tools. A one-line mimetype file placed at the head of a zip container, identifying the container as an EPUB document, along with a directory containing all HTML assets and files, renders a working EPUB file. A proper table of contents can get more complex, but this is enough to get you started:

$ echo "application/epub+zip > mimetype

$ zip -0Xq book.epub mimetype

$ zip -Xr9D book.epub META-INF/ OEBPS/

The EPUB format is popular and well supported. Many devices support it, or they support applications that can read it. The Firefox addon EPUB Reader allows easy access to EPUBs on anything that runs Firefox. FBReader opens them on computers and mobiles. In a pinch, an EPUB file can even be unzipped and viewed as raw HTML.

In short, there are no barriers in the EPUB format. It is open, it is accessible, and it is powerful.

Advantage

  • Uses HTML along with simple, open source technologies to construct a universal eBook format. Scalable, dynamic, and capable of handling most anything that HTML can handle.
  • Using standard HTML and CSS for formatting, EPUBs permit the user to override style as needed, meaning that fonts, colors, and sizes can all be modified by a user as needed.
  • Well-supported on many e-reader devices and most computers and mobiles via open source EPUB reader applications.
  • Since EPUB is based on HTML, it is easy to convert.

Disadvantage

  • Some devices may not support EPUB without conversion.

Verdict

EPUB is the format that eBooks deserve, and is, for me, the standard by which all other formats are measured. It delivers a rich, easy-to-read book in a lightweight, open, and sensible container.


Plain text

The original eBook format is plain old ASCII (although hopefully one would opt for Unicode today). This is the most universal format in computing; any OS on any platform can read it, and any text processor can convert from it.

Advantages

  • Plain text can be created by anyone, without any special tools or knowledge.
  • Plain text is a safe and reliable source for conversion into other formats.
  • Most devices support viewing plain text, one way or another; Android devices can view plain text in any note or office app, computers have text editors, eReaders usually open plain text in the default reading app. Even some portable music players (and any music player loaded with Rockbox) can view plain text.
  • Plain text can rewrap sentences so that the content dynamically adjusts to different sized screens.

Disadvantages

  • It's arguable that plain text is not really an eBook format. It cannot hyperlink from a table of contents to a specific chapter, it does not support images, and so on. In a sense, it does not take advantage of any of the point-and-click benefits that electronic books offer in terms of making information quicker to locate.
  • You can dynamically change the layout of text for a different size screen, but if the author hard-coded line breaks after 80 characters, then no matter how your reader adjusts things, the formatting is going to either limit itself to a maximum of 80 characters per line, or have line breaks at unusual places.

Verdict

It's fairly safe to say that while plain text is durably future-proof and ensures compatibility across devices, it's not the ideal format for eBooks. However, as long as the document is consistent in its layout, plain text is easily parsed to other formats. To ensure consistency in formatting, consider using "mark down" rules such as restructured text (RST).


FictionBook

A FictionBook (.fb2) eBook is an XML format that places an entire book in one single file, including any images. It is, therefore, not intended to store scanned documents (such as a full scan of an entire comic book or a historical facsimile), but as a mostly-text based book with one or a few images throughout.

Being XML, it inherits all the modern features that we would expect from an eBook. It can contain hyperlinks, font styles, and complex layouts. It is natively dynamic, so it will fluidly wrap text to accomodate any size screen.

Generating a FictionBook file is as easy as generating an XML file. It can be done in any text editor (or a lot of echo statements, if you're a masochist) and the eReader will do all the conversion on-the-fly to display the book as an easy-to-read document.

Being XML, the file format is well-structured, to say the least, and is easily parsed by computers, and even by humans if the human is blind to XML tags.

Advantages

  • FictionBook is an open format in an ubiquitous markup language, with all content contained in literally one file.
  • Rich text, with images, hyperlinks, and native dynamic content- and word- wrap.
  • Easy to create and convert, as long as you are comfortable with XML and its related toolchain (such as xmllint and tools like nxml-mode).

Disadvantages

  • Dismal support on dedicated eReader devices. However, if your device runs FBReader, then you can read an .fb2 file.
  • The one book/one file tradition may be phased out with .fb3, which is moving toward a zip file containing a cover image, the book file, and metadata files (like an EPUB).

Verdict

The FictionBook format is not popular within the English eBook market, although it's got support in some languages, so you may or may not have the occasion to obtain an .fb2 file without seeking it out specifically.

If you do happen upon one, or if you intend to generate your own, then whether or not it's the "right" choice for you depends entirely on what you use. If you use, or are happy to start using, an application that reads .fb2 files, then it's a very nice format that is self-contained and robust.


HTML

The language of the web, HTML is a powerful document format with hyperlinking, dynamic text flow, styling, image linking, and more. It seems like it would be ideal as an eBook format, and indeed it turns out to be the basis for many of the most popular eBook formats, EPUB included.

HTML as a format is not only simple and easy to learn, it's everywhere. You can write it on any platform and, obviously, view it on any platform. It's simple and blissfully concise compared to XML:

<p style="color: #666;">

Hello <a href="https://opensource.com/%3Ca%20href%3D"http://example.com">http://example.com">World</a>.

<img src="https://opensource.com/images/tux.png" />

</p>

HTML may be easy to write and easy to read, but it turns out to have some weaknesses. You might be able to link to images, but where do you store them? eReaders tend to do poorly with directories as books, so somehow the paths to images would need to be kept intact. A work in HTML also tends to be broken into several pages, so a book of 25 chapters might have in excess of 25 files associated with it; how do you manage all of those within an eReader?

The answer is, of course, you don't. If you've downloaded a collection of HTML documents from the web and want to take them with you for later, try wrapping the HTML in a format that your eReader can treat as an eBook. The conversion is simple, as long as all the paths are correct (if you view a folder full of HTML files in Firefox and everything looks correct, then your paths are correct), you can convert from HTML using Pandoc. A simple example, assuming your target device is happy with EPUB:

$ pandoc -f html -t epub3 index.html about.html \

chapter1.html colophon.html -o book.epub

Advantages

  • HTML is the lingua franca of the web.
  • HTML is easy to write and easy to read. It supports dynamic and hyperlinked text, style, and images. Being well-structured, it is easy to convert from or to HTML.

Disadvantages

  • Disparate files and assets which many e-readers view as distinct documents.
  • Not supported by all e-readers.

Verdict

HTML is a great format and ideal for ebooks in terms of file specification, but struggles with compatibility due to the tendency for eReaders to assume one file per book. If you happen to have a directory full of HTML that you want to take with you on the go, convert it to a proper eBook format for better compatibility.


PDF

The PDF file format was developed as a method to deliver content meant for the printed page. It was originally considered a "pre-flight" renderer: a digital version of exactly what a user could expect to see emerge from the printer.

Presumably for lack of anything more obvious, people eventually began to use the PDF as a means for distributing nearly any document that they did not intend for another user to directly edit.

Advantages

  • Preserves the print layout of a book, whether it is practical for digital screens or not.
  • Well-supported format on most eReaders and devices.

Disadvantages

  • Intended for print, PDFs are not resolution independent and do not adapt well to screen sizes.
  • A nearly dead-end format in terms of conversion. Manual copying and pasting of underlying text may be possible (if the text has been embedded) but otherwise you cannot convert from PDF to another format.
  • Can render very large files, depending on image compression options used at the time of creation.
  • Devices may support the PDF format, but are not able to display most PDFs in a readable and convenient way. Everything from image-type support to font conflicts can also hinder whether content is rendered at all.

Verdict

The PDF format is, in a sense, better suited for delivering style rather than content. PDFs tend to be big, inefficient, and resolution-specific. Due to the popularity of eBooks, a "re-flow" feature has been developed, although the PDF must have been created with re-flow written into it, and even then few devices support the feature.

If you have the choice, avoid PDFs for eBooks. If you are generating the content yourself, use anything but PDF. If you are given a PDF without any say in the matter and are having a difficult time reading it on your device (constant zooming-in on the content to read sentences through a microscope lens gets strenuous after the 10th page or so), use pdftotext to extract the text from a PDF:

$ pdftotext ~/book.pdf book-text.txt

Or pdftohtml.

$ pdftohtml ~/book.pdf book-html.html

These are good solutions to get to the content of the PDF, but the results (and readability) vary.


Fact is, converting books from traditional typesetting to digital formats is not an easy task. If it were, Project Gutenberg would be complete by now. Even books that are scanned and run through optical character recognition require extensive clean-up and at least a once-over for errors.

Some books just can't exactly be converted as text documents, because they aren't just text documents. To preserve some books, the best solution is to scan each page and then dump all of the scans into some container.

It might seem that this is the perfect use-case for a PDF, and for compatibility's sake, but if your ereader is particularly feature-rich, or if you read your eBooks on a computer or mobile, then you may have support for Comic Book Archives or Djvu.

DjVu

The DjVu format is not just an eBook format, it's also a compression format. DjVu files are frequently smaller even that jpegs, but of equal quality. For small eBooks, the difference might be trivial but for larger works, it can mean the difference between an 80mb download and a 20mb download.

Unfortunately, .djvu support is almost non-existant on dedicated eReader edevices. While it's not often a built-in feature on mobiles and computers, several applications support viewing and creating DjVu files, including the djvulibre package, the Evince document viewer, and FBReader on Android.

Advantages

  • Better-than-jpeg compression for facsimiles of the printed page, embeddable text; functionally, a more efficient, open, and simpler replacement for PDF.

Disadvantages

  • Not widely supported on eReaders.
  • Resolution specific.

Verdict

A good format with benefits in file size and efficiency, but due to limited support, it may not be practical for daily use.


Comic Book Archives

As the name implies, a Comic Book Archive is a format intended for digital storage and consumption of comic books and graphic novels. It's just as well suited for any book that you either have no text for or that you want or need to view as a graphic.

Of course, this inherently has the same problem as PDFs in the sense that a graphic cannot re-wrap itself dynamically for your eReader screen, but the format itself is remarkably open and allows for restructuring if you have the time and patience for it.

In fact, a comic book archive is nothing but a .zip or .rar folder containing a sequence of images (.cbz and .cbr, respectively). eBook readers see the file as a book and display the images in sequence, decompressing each one on-the-fly.

Advantages

  • Open format using only existing technology. No bloat; a very clean format.
  • Mostly indifferent to image type (although your ereader may not be), so image compression levels are adjustable.

Disadvantages

  • Resolution specific.

Verdict

As a format for storage and consumption, this is the ideal way to digitally archive comics and scanned facsimiles. It being nothing more than a zipped directory of images, you could even store a high-quality master edition as a master copy, and create a lower-quality "portable" version for your devices.


Overview

The eBook format conundrum boils down to this: there are source formats, there are formats for consumption (often dictated by what your e-reading device supports), and there's what a vendor or distributor is offering you.

Unfortunately, these are not always in alignment with one another.

It would be nice if the world would default to open formats, because open formats are easily converted, plus they can be generated programmatically to offer you the best choice according to your needs. That is not always the case, so it's up to you to decide what format works best for you.

The good news is that, DRM (digital restrictions placed upon books by the vendor) notwithstanding, conversion is usually an option. Take your content, process it the way you need to process it so that it works for you, and always keep the most open format as a backup.

Tags
Seth Kenlon
Seth Kenlon is a UNIX geek, free culture advocate, independent multimedia artist, and D&D nerd. He has worked in the film and computing industry, often at the same time.

17 Comments

Great article, Seth! It occurs to me that plain text is the truest to the original form, in that its disadvantages are the same as for dead-tree books.

Thanks!

The greatest / only frustration I have had with plain text is inconsistency in formatting. As long as the plain text is consistent, it can be parsed and up-converted to rst or markdown, and then to epub, which brings the otherwise humble format into modern e-reading convenience standards.

The problem is when you get fancy stylized plain text with surprising indentaitons and ascii art and stuff like that, Then it's a matter of desperately trying to normalise the text into some parse-able format, but in the end it's basically manual conversion. Or you just live with plain text, funky line breaks and all.

I guess the moral of that story is that we as content creators should NEVER assume anything about how people will be consuming the deliverable. For every 7 people who will use an e-reader, there will be those 3 whacko's using a phone, a web browser, and a TI-85.

In reply to by bcotton

Nice article, but how can I convert an HTML file(s) (with css and images) to an EPUB format?? All of my efforts with LibreOffice and writer2epub have so far failed. The result shows up in FBReader as the raw HTML with all of its tags, etc.

Pandoc tends to be my (and, from the look of it, Seth's) go-to app for converting between document formats. EPUB only supports a subset of HTML tags and CSS rules and regarding image support, many dedicated ereaders are limited to just JPEG support (so avoid PNG and GIF).

As a secondary option, you may also want to try Calibre. It has some basic conversion capability built into it and, if I recall correctly, it can parse HTML. If you can't get up and running with Pandoc, that might be a decent fallback.

In reply to by dru (not verified)

Yep, I agree with Jason. Pandoc makes it pleasantly simple to convert HTML to epub.

It's easiest if all your html is in one file, but that is not always the case, so you might have to point pandoc to each file...

Assuming a Linux shell:

$ pandoc -f html -t epub3 index.html preface.html chapeter1.html chapter2.html -N --epub-stylesheet=stylesheet.css --epub-metadata=metadata.xml --epub-cover-image=cover.jpg --epub-embed-font=liberation.ttf --epub-embed-font=Nouveau_IBM.ttf -o mybook.epub

The stylesheet is optional, and the metadata file is something you would need to generate manually. It doesn't need to be at all complex:

$ cat metadata.xml
Book of Foo and Bar
Seth Kenlon

The result will be an epub that appears in any ebook management software as "Book of Foo and Bar" by "Seth Kenlon", and in your filesystem as "mybook.epub".

I use this kind of command fairly regularly; I pull a directory off of the web and convert the pages to ebook so that I can read it whilst offline. I'm reading the GNU Gawk manual in exactly this way.
Hope that helps.

In reply to by Jason van Gumster

Thanks, pandoc works really well !! I appreciate your quick replies!

In reply to by sethkenlon

An amazing article Seth. I have not comprehended it all. I've only used ePub and and PDF to date. But, this is a great resource and I'm sharing it with my edtech friends who may be attempting to convert files.

For the record, the easiest method I have found to produce really nice epubs is to use RST or Markdown as an input format, then run that through pandoc to convert to epub. Really easy and the learning curves are really small.

In reply to by Don Watkins

Why did you include text but not RTF or ODT? Both are arguably as open as txt, and they offer a lot more features, Also, none of the three are ebook formats

And are you sure FB2 is an open format? i thought it was proprietary with a freely available spec and tools.

Thanks for reading the article!

On ODT:
I did not include ODT because it did not occur to me to include it. I like the idea of using ODT as a quick-and-dirty ebook format; sorry I left it out!

On Ebooks:
Don't get too caught up in the modern ebook market when looking for definitions of the term "ebook". Ebooks were around long before e-readers existed, and many of us were reading books electronically (electronic books / e-books / ebooks) long before anyone thought to develop a special format. So I'm arguing that yes, plain text, RTF, and ODT can all be ebook formats if that's how they are being used, just as a handwritten manuscript is a "book" with or without mass publication or fancy binding.

On RTF:
The original draft of this article did include RTF, but we chose not to include it because it's functionally equivalent to something a lot more elegant (like .fb2) but with an uglier spec (compare a 'hello world' in RTF to a 'hello world' in XML). RTF is being abandoned by Microsoft, and it was never open sourced (it is subject to an "open promise") so its validity as an "open" file format is debatable; programmers reverse-engineering something until MS promises not to persecute them for doing so is not exactly the same thing as a developer posting the spec online out of the desire for people to know what it is that they are using.

On fb2:
Yes, .fb2 is open. It is a schema for XML, sort of like Docbook. There is no further "source" or claim of ownership that requires an "open promise". The full schema is posted online at http://gribuser.ru/xml/fictionbook/2.0/xsd/FictionBook2.xsd and is available for anyone to use or modify (although, if you modify it, it's no longer fictionbook, just like docbook is no longer docbook if you change it). If it were proprietary, then its schema would not be available, but it could probably be reverse-engineered from the resulting XML source; but none of this is the case. The schema was published by the developer, Dmitry, for everyone to use.

This is pretty standard practise for schemas, though I guess the question of how schemas are licensed might make for an interesting topic for an article by a legal geek [not me], sometime. Its author, Dmitry Gribov, has a website and email address that might provide further clarification for the very curious.

In reply to by n hoffleder (not verified)

Great article, Seth!

Another plus for EPUB is that if you're planning on trying to sell an EPUB file (of your own writing/notes, of course), nearly all vendors/distributors will accept an EPUB (well... EPUB2 at least). They may auto-convert to their closed format on the backend, so you'll have to make sure that conversion goes cleanly, but for the most part, it's pretty seamless.

" it begs the question"

No, it doesn't. (Go look up this phrase).

You assume that I'm using the word "beg" archaically! join my gang of modernist grammarians and you too can abuse myriad ancient phrases.

In reply to by n hoffleder (not verified)

I think you missed a few others, including one that you actually mentioned in passing but didn't call out as one of the options. That was markdown (and its variants). A number of e-reader apps can consume and display it these days.

The article itself raises (not begs) the question, "What would a good open source tool chain for creating ebooks look like?" Once the final output format(s) are defined, deciding how you want to write and what you're going to write can create a fascinating selection of options. I smell an opportunity for another article. :-)

Great point! I guess the "problem" with markdown-style formats would be images and other includes;; where do they get neatly stored so that they are available when the markdown text is rendered by the e-reader? But if there are no includes, there is no problem and md or rst or whatever seems like a great option.

The tool chain for open source ebook production is all I know, so I never really thought of it as article potential. Between pandoc, docbook and various XML processors, latex, and even Libre Office, that just might be an article worth writing!

In reply to by sgtrock (not verified)

nice article thanks for sharing with us

Creative Commons LicenseThis work is licensed under a Creative Commons Attribution-Share Alike 4.0 International License.