Use this Python script to encode binary in Base94

Image by:

Public Domain via LibreShot

Humans transfer information in many different ways. On the internet, the primary format is text, which is how you're reading this article. However, there is other data on the internet, such as images and sound files and so on. It might seem easy to post an image online or attach a document to an email, until you realize that HTTP/1.1 and SMTP are text-based protocols. Data transferred over such protocols must be represented as a subset of ASCII text (specifically, characters 33 to 126).

A digital image is already encoded as binary data by the computer that displays it. In other words, a digital image isn't like a physical photograph that's printed on paper: it's a collection of computer shorthand that is decoded by whatever image viewer you're using to look at it, whether that image viewer is a web browser, a photo-editing application, or any software that can display images.

To re-encode an image into ASCII, it's common to use Base64, a system of binary-to-text encoding rules that can represent binary data as an ASCII string. Here's a single black pixel, saved in the webp format, in Base64:

UklGRiYAAABXRUJQVlA4IBoAAAAwAQCdASoBAAEAAIAOJaQAA3AA/v9gGAAAAA==

This article is about binary/text converters, their most popular implementations, and a non-standard approach that uses a variable alphabet. It's a theoretical prototype, meant as a purely academic exercise, because the time and space complexities make it applicable only to small files (up to a few tens of kilobytes). However, this implementation allows you to choose any base, with no dependencies on powers of two (such as seven or 77).

Converting binary files

The main purpose of a converter like this is to put a binary file into a form applicable to send over a channel with a limited range of supported symbols. A good example is any text-based network protocol, where all transmitted binary data must be reversibly converted to a pure text form and include no control symbols in the data. The ASCII codes from 0 to 31 are considered control characters, and they are lost when transmitting over any logical channel that doesn't allow endpoints to transmit full eight-bit bytes (binary) with codes from 0 to 255.

The standard solution nowadays for this purpose is the Base64 algorithm, as demonstrated above and defined in the IETF's RFC 4648. This RFC also describes Base32 and Base16 as possible variations. The key point here is that they all share the same characteristic: they are all powers of two. The wider a range of supported symbols (codes), the more space-efficient the conversion result. It will be bigger, but the question is how much bigger. For example, Base64 encoding gives an approximately 33% bigger output, because three input (eight valued bits) bytes are translated to four output (six valued bits, 2⁶=64) bytes. So, the ratio is always 4/3; that is, the output is bigger by 1/3 or 33.(3)%. Practically speaking, Base32 is very inefficient because it implies translating five input (eight valued bits) bytes to eight output (five valued bits, 2⁵=32) bytes, and the ratio is 8/5; that is, the output is bigger by 3/5 or 60%. In this context, it is hard to consider any sort of efficiency of Base16, as its output size is bigger by 100% (each byte with eight valued bits is represented by two four-valued bits bytes, also know as the nibbles, 2⁴=16). It is not even a translation, but rather just a representation of an eight-bit byte in the hexadecimal view.

These input and output byte ratios were calculated for the Base64/32/16 encodings using the least common multiple (LCM). You can calculate it, and for that, you need one more function: the greatest common divisor (GCD).

Base64 (Input: eight bits, Output: six bits):
- LCM(8, 6) = 8*6/GCD(8,6) = 24 bit
- Input: 24/8 = 3 bytes
- Output: 24/6 = 4 bytes
- Ratio (Output/Input): 4/3
Base32 (Input: eight bits, Output: five bits):
- LCM(8, 5) = 8*5/GCD(8,5) = 40 bit
- Input: 40/8 = 5 bytes
- Output: 40/5 = 8 bytes
- Ratio (Output/Input): 8/5
Base16 (Input: eight bits, Output: four bits):
- LCM(8, 4) = 8*4/GCD(8,4) = 8 bit
- Input: 8/8 = 1 byte
- Output: 8/4 = 2 bytes
- Ratio (Output/Input): 2/1

Limited character sets

What if a channel is able to transmit only a few (like nine or 17) different symbols? That is, if you have a file represented by a 256-symbol alphabet (a normal eight-bit byte), then you're not limited by computational power or memory constraints of either the encoder or decoder. But what if you can send only seven different symbols instead of 256? Base64, 32, and 16 are not suitable in such a scenario. If you only have seven symbols you can use to encode, then Base7 is the only possible output format.

Or, what if an amount of transmitted data is a concern for a channel? Base64 always increases the data by 33%, no matter what is transmitted. For instance, Base94 increases output by only 22%.

It might seem that Base94 is not the limit. If the first 32 ASCII codes are control characters and there are 256 codes in total, what stops you from using an alphabet of 256-32=224 symbols? The reason: Not all of those 224 ASCII codes have printable characters. In general, only seven bits (0 to 127) are standardized, and the rest (128 to 255) are used for a variety of locales, such as Koi8-R, Windows-1251, and so on. That means only 128-32=96 are available in the standardized range. In addition, the ASCII code 32 is the space character, and 127 doesn't have a visible character either. So 96-2 gives you 94 printable characters, each of which has the same association with their codes on all possible machines.

Using this method of encoding, you could feasibly transmit essentially any data with the most rudimentary of symbols. You could even, quite literally, send a photograph using only smoke signals!

Use this Python script

This solution is pretty simple, but this simplicity includes a significant computational constraint. The whole input file can be treated as one big number with a base 256. It might be a really big number and require thousands of bits. Then you just need to convert this big number to a different base. That's it.

Python 3 makes it even simpler! Usually, conversions between different bases are done with an intermediate Base10. The good news is that Python 3 has built-in support for big number calculations (it is integrated with Int), and the Int class has a method that reads any number of bytes and automatically represents them in a big Base10 number with a desired endian. So all these complications can be implemented in just two lines of code, which is pretty amazing!

with open('input_file', 'rb') as f:
    in_data = int.from_bytes(f.read(), 'big')

Here, variable in_data is the big number with Base10. Most of the computation happens and most of the time is consumed in just these two lines. From this initial encoding, you can convert the data to any other base, as it's usually done with normal small decimal numbers.

Here's a sample script, which I keep updated at on my GitHub repository.

#!/usr/bin/env python3

from sys import argv
from math import ceil

base = 42 
abc = '''!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~'''

def to_base(fn_src, fn_dst, base=base):
    out_data = []

    # represent a file as a big decimal number
    with open(fn_src, 'rb') as f:
        in_data = int.from_bytes(f.read(), 'big')
    
    # convert a big decimal number to a baseN
    d, r = in_data % base, in_data // base
    out_data.append(abc[d])
    while r: 
        d, r = r % base, r // base
        out_data.append(abc[d])

    # write a result as a string to a file
    with open(fn_dst, 'wb') as f:
        f.write(''.join(out_data).encode())

def from_base(fn_src, fn_dst, base=base):
    out_data = 0

    # read one long string at once to memory
    with open(fn_src, 'rb') as f:
        in_data = f.read().decode()

    # convert a big baseN number to decimal
    for i, ch in enumerate(in_data):
        out_data = abc.index(ch)*(base**i) + out_data

    # write a big decimal number to a file as a sequence of bytes
    with open(fn_dst, 'wb') as f:
        f.write(out_data.to_bytes(ceil(out_data.bit_length()/8), 'big'))

def usage():
    print(f'usage: {argv[0]} <-e|-d> src dst [base={base}]')
    raise SystemExit(1)

def main():
    if len(argv) == 5:
        base = int(argv[4])
    elif len(argv) < 4:
        usage()

    if argv[1] not in ('-e', '-d'):
        usage()
    elif argv[1] == '-e':
        to_base(argv[2], argv[3], base)
    elif argv[1] == '-d':
        from_base(argv[2], argv[3], base)
    else:
        usage()

if __name__ == '__main__':
    main()