How to manage binary blobs with Git

Image by:

Jason Baker. CC BY-SA 4.0.

Read:

Part 1: What is Git?
Part 2: Getting started with Git
Part 3: Creating your first Git repository
Part 4: How to restore older file versions in Git
Part 5: 3 graphical tools for Git
Part 6: How to build your own Git server
Part 7: How to manage binary blobs with Git

In the previous six articles in this series we learned how to manage version control on text files with Git. But what about binary files? Git has extensions for handling binary blobs such as multimedia files, so today we will learn how to manage binary assets with Git.

One thing everyone seems to agree on is Git is not great for big binary blobs. Keep in mind that a binary blob is different from a large text file; you can use Git on large text files without a problem, but Git can't do much with an impervious binary file except treat it as one big solid black box and commit it as-is.

Say you have a complex 3D model for the exciting new first person puzzle game you're making, and you save it in a binary format, resulting in a 1 gigabyte file. You git commit it once, adding a gigabyte to your repository's history. Later, you give the model a different hair style and commit your update; Git can't tell the hair apart from the head or the rest of the model, so you've just committed another gigabyte. Then you change the model's eye color and commit that small change: another gigabyte. That is three gigabytes for one model with a few minor changes made on a whim. Scale that across all the assets in a game, and you have a serious problem.

Contrast that to a text file like the .obj format. One commit stores everything, just as with the other model, but an .obj file is a series of lines of plain text describing the vertices of a model. If you modify the model and save it back out to .obj, Git can read the two files line by line, create a diff of the changes, and process a fairly small commit. The more refined the model becomes, the smaller the commits get, and it's a standard Git use case. It is a big file, but it uses a kind of overlay or sparse storage method to build a complete picture of the current state of your data.

However, not everything works in plain text, and these days everyone wants to work with Git. A solution was required, and several have surfaced.

Git Large File Storage (LFS) is an open source project from GitHub that began life as a fork of git-media. Git-media and git-annex are extensions to Git meant to manage large files. They are two different approaches to the same problem, and they each have advantages. These aren't official statements from the projects themselves, but in my experience, the unique aspects of each are:

Git-annex favors a distributed model; you and your users create repositories, and each repository gets a local .git/annex directory where big files are stored. The annexes are synchronized regularly so that all assets are available to all users as needed. Unless configured otherwise with annex-cost, git-annex prefers local storage before off-site storage.
Git-portal is also distributed, and like Git-annex has the option to synchronize to a central location. It uses common utilities that you likely already have installed (specifally Bash, Git, rsync).
Git-LFS is a centralised model, a repository for common assets. You tell Git-LFS where your large files are stored, whether that's a hard drive, a server, or a cloud storage service, and each user on your project treats that location as the central master location for large assets.

git-portal

Git-portal offers simplified blob management for Git using standard UNIX tools like Bash, Git itself, and optionally rsync. It copies big binary files to local or remote storage, replacing them with symlinks that you can commit along with the rest of your project files.

Git-portal is simple at the expense of being a little more manual sometimes (it doesn't have automated garbage collection, for instance). It's ideal for users who need to manage big files that aren't normally agreeable to Git management, but don't want to have to learn a whole new system. Git-portal mimics Git itself, so there's only a minimal learning curve.

You can install Git-portal from its project page on Gitlab.

All Git-portal commands mirror Git itself. To use Git-portal in a project:

$ git-portal init

To add a file:

$ git-portal add bigfile.png

Once a file's been "sent through the portal" (in the project's terminology), your interactions with Git are exactly the same as usual. You add files as usual, relatively oblivious to the fact that some of them are secretly symlinks to the _portal directory.

Everything in the _portal directory, which is entirely ignored by Git, can be backed-up separately to any kind of storage you like, whether it's a USB drive or a remote server. Because Git-portal provides some Git hooks (automated scripts triggered by specific Git actions, like a push or a pull), you can even set up a special Git remote entry so your _portal directory is synchronized automatically with a remote location:

$ git remote add _portal alice@myserver.com:/home/alice/umbrella.git/_portal

Git-portal has the advantage of being a simple and Linux-native system. With a fairly minimal stack of common utilities and only a few extra commands to remember, you can manage large project dependencies and even share them between collaborators. Git-portal has been used on media projects, indie video games, and a gaming blog to manage everything from minor splash images to big PDFs to even bigger 3d models.

git-annex

git-annex has a slightly different workflow, and defaults to local repositories, but the basic ideas are the same. You should be able to install git-annex from your distribution's repository, or you can get it from the website as needed. As with git-media, any user using git-annex must install it on their machine.

The initial setup is simpler than git-media. To create a bare repository on your server run this command, substituting your own path:

$ git init --bare --shared /opt/jupiter.git

Then clone it onto your local computer, and mark it as a git-annex location:

$ git clone seth@example.com:/opt/jupiter.clone 
Cloning into 'jupiter.clone'... warning: You appear to have cloned
an empty repository. Checking connectivity... done. 
$ git annex init "seth workstation" init seth workstation ok

Rather than using filters to identify media assets or large files, you configure what gets classified as a large file by using the git annex command:

$ git annex add bigblobfile.flac
add bigblobfile.flac (checksum) ok
(Recording state in Git...)

Committing is done as usual:

$ git commit -m 'added flac source for sound fx'

But pushing is different, because git annex uses its own branch to track assets. The first push you make may need the -u option, depending on how you manage your repository:

$ git push -u origin master git-annex 
To seth@example.com:/opt/jupiter.git 
* [new branch] master -> master 
* [new branch] git-annex -> git-annex

As with git-media, a normal git push does not copy your assets to the server, it only sends information about the media. When you're ready to share your assets with the rest of the team, run the sync command:

$ git annex sync --content

If someone else has shared assets to the server and you need to pull them, git annex sync will prompt your local checkout to pull assets that are not present on your machine, but that exist on the server.

Git annex is a finely-tuned solution, flexible and user-friendly. It's robust and well-tested.

git-lfs

git-lfs is written in Go, and you can install it from source code or as a downloadable binary. Instructions are on the website. Each user who wants to use git-lfs needs to install it, but it is cross-platform, so that generally doesn't pose a problem.

After installing git-lfs, you can set what filetypes you want Git to ignore and Git-LFS to manage. For example, for Git-LFS to track all .png files:

$ git lfs track "*.png"

Any time you add a filetype for Git-LFS tracking, you must add and then commit the file .gitattributes:

$ git add .gitattributes
$ git commit -m "LFS track *.png"

When you stage a file of those types, the file is copied to .git/lfs.

Git-LFS is a project by Github and it's heavily tied in to Github's infrastructure. If you want to run a Git server that allows for Git-LFS integration, you can study up on the Git-LFS server specification (also written in Go) and implement the API.

Both git-portal and git-annex are flexible and can use local repositories instead of a server, so they're just as useful for managing private local projects, too.

Large files and Git

While Git is meant for text files, that doesn't mean you can't use it just because you have binary assets. There are very good solutions to help you manage large files. There really aren't many excuses for not using Git if you want to, so try it out today!