How to manage binary blobs with Git

5 readers like this.
Databases as a service

Jason Baker. CC BY-SA 4.0.

Read:

In the previous six articles in this series we learned how to manage version control on text files with Git. But what about binary files? Git has extensions for handling binary blobs such as multimedia files, so today we will learn how to manage binary assets with Git.

One thing everyone seems to agree on is Git is not great for big binary blobs. Keep in mind that a binary blob is different from a large text file; you can use Git on large text files without a problem, but Git can't do much with an impervious binary file except treat it as one big solid black box and commit it as-is.

Say you have a complex 3D model for the exciting new first person puzzle game you're making, and you save it in a binary format, resulting in a 1 gigabyte file. You git commit it once, adding a gigabyte to your repository's history. Later, you give the model a different hair style and commit your update; Git can't tell the hair apart from the head or the rest of the model, so you've just committed another gigabyte. Then you change the model's eye color and commit that small change: another gigabyte. That is three gigabytes for one model with a few minor changes made on a whim. Scale that across all the assets in a game, and you have a serious problem.

Contrast that to a text file like the .obj format. One commit stores everything, just as with the other model, but an .obj file is a series of lines of plain text describing the vertices of a model. If you modify the model and save it back out to .obj, Git can read the two files line by line, create a diff of the changes, and process a fairly small commit. The more refined the model becomes, the smaller the commits get, and it's a standard Git use case. It is a big file, but it uses a kind of overlay or sparse storage method to build a complete picture of the current state of your data.

However, not everything works in plain text, and these days everyone wants to work with Git. A solution was required, and several have surfaced.

Git Large File Storage (LFS) is an open source project from GitHub that began life as a fork of git-media. Git-media and git-annex are extensions to Git meant to manage large files. They are two different approaches to the same problem, and they each have advantages. These aren't official statements from the projects themselves, but in my experience, the unique aspects of each are:

  • Git-annex favors a distributed model; you and your users create repositories, and each repository gets a local .git/annex directory where big files are stored. The annexes are synchronized regularly so that all assets are available to all users as needed. Unless configured otherwise with annex-cost, git-annex prefers local storage before off-site storage.

  • Git-portal is also distributed, and like Git-annex has the option to synchronize to a central location. It uses common utilities that you likely already have installed (specifally Bash, Git, rsync).

  • Git-LFS is a centralised model, a repository for common assets. You tell Git-LFS where your large files are stored, whether that's a hard drive, a server, or a cloud storage service, and each user on your project treats that location as the central master location for large assets.

git-portal

Git-portal offers simplified blob management for Git using standard UNIX tools like Bash, Git itself, and optionally rsync. It copies big binary files to local or remote storage, replacing them with symlinks that you can commit along with the rest of your project files.

Git-portal is simple at the expense of being a little more manual sometimes (it doesn't have automated garbage collection, for instance). It's ideal for users who need to manage big files that aren't normally agreeable to Git management, but don't want to have to learn a whole new system. Git-portal mimics Git itself, so there's only a minimal learning curve.

You can install Git-portal from its project page on Gitlab.

All Git-portal commands mirror Git itself. To use Git-portal in a project:

$ git-portal init

To add a file:

$ git-portal add bigfile.png

Once a file's been "sent through the portal" (in the project's terminology), your interactions with Git are exactly the same as usual. You add files as usual, relatively oblivious to the fact that some of them are secretly symlinks to the _portal directory.

Everything in the _portal directory, which is entirely ignored by Git, can be backed-up separately to any kind of storage you like, whether it's a USB drive or a remote server. Because Git-portal provides some Git hooks (automated scripts triggered by specific Git actions, like a push or a pull), you can even set up a special Git remote entry so your _portal directory is synchronized automatically with a remote location:

$ git remote add _portal alice@myserver.com:/home/alice/umbrella.git/_portal

Git-portal has the advantage of being a simple and Linux-native system. With a fairly minimal stack of common utilities and only a few extra commands to remember, you can manage large project dependencies and even share them between collaborators. Git-portal has been used on media projects, indie video games, and a gaming blog to manage everything from minor splash images to big PDFs to even bigger 3d models.

git-annex

git-annex has a slightly different workflow, and defaults to local repositories, but the basic ideas are the same. You should be able to install git-annex from your distribution's repository, or you can get it from the website as needed. As with git-media, any user using git-annex must install it on their machine.

The initial setup is simpler than git-media. To create a bare repository on your server run this command, substituting your own path:

$ git init --bare --shared /opt/jupiter.git

Then clone it onto your local computer, and mark it as a git-annex location:

$ git clone seth@example.com:/opt/jupiter.clone 
Cloning into 'jupiter.clone'... warning: You appear to have cloned
an empty repository. Checking connectivity... done. 
$ git annex init "seth workstation" init seth workstation ok 

Rather than using filters to identify media assets or large files, you configure what gets classified as a large file by using the git annex command:

$ git annex add bigblobfile.flac
add bigblobfile.flac (checksum) ok
(Recording state in Git...) 

Committing is done as usual:

$ git commit -m 'added flac source for sound fx'

But pushing is different, because git annex uses its own branch to track assets. The first push you make may need the -u option, depending on how you manage your repository:

$ git push -u origin master git-annex 
To seth@example.com:/opt/jupiter.git 
* [new branch] master -> master 
* [new branch] git-annex -> git-annex 

As with git-media, a normal git push does not copy your assets to the server, it only sends information about the media. When you're ready to share your assets with the rest of the team, run the sync command:

$ git annex sync --content

If someone else has shared assets to the server and you need to pull them, git annex sync will prompt your local checkout to pull assets that are not present on your machine, but that exist on the server.

Git annex is a finely-tuned solution, flexible and user-friendly. It's robust and well-tested.

git-lfs

git-lfs is written in Go, and you can install it from source code or as a downloadable binary. Instructions are on the website. Each user who wants to use git-lfs needs to install it, but it is cross-platform, so that generally doesn't pose a problem.

After installing git-lfs, you can set what filetypes you want Git to ignore and Git-LFS to manage. For example, for Git-LFS to track all .png files:

$ git lfs track "*.png"

Any time you add a filetype for Git-LFS tracking, you must add and then commit the file .gitattributes:

$ git add .gitattributes
$ git commit -m "LFS track *.png"

When you stage a file of those types, the file is copied to .git/lfs.

Git-LFS is a project by Github and it's heavily tied in to Github's infrastructure. If you want to run a Git server that allows for Git-LFS integration, you can study up on the Git-LFS server specification (also written in Go) and implement the API.

Both git-portal and git-annex are flexible and can use local repositories instead of a server, so they're just as useful for managing private local projects, too.

Large files and Git

While Git is meant for text files, that doesn't mean you can't use it just because you have binary assets. There are very good solutions to help you manage large files. There really aren't many excuses for not using Git if you want to, so try it out today!

Seth Kenlon
Seth Kenlon is a UNIX geek, free culture advocate, independent multimedia artist, and D&D nerd. He has worked in the film and computing industry, often at the same time.

8 Comments

The difficulty I have with any of the large file support schemes is that (as far as I can tell), you're not actually versioning any of your assets. Sure, you have large files and you have a log of the changes you've made, but unless I'm misunderstanding something fundamental, there's no mechanism for rolling back those large binary files to previous versions. That's a dealbreaker for me since the ability to branch and roll back are my two primary reasons for using versioning.

In short, please tell me I'm wrong. Does Git LFS, git-media, or git-annex support branching and rolling back large binary files and the documentation is just really obtuse? Or have I, sadly, read the documentation correctly?

Took a bit of research, but it turns out that I was mistaken. Large files *are* versioned... the revisions are just stored separately. I get that now. When you do a checkout, you get only the files that you need for that revision/branch. However, if you're doing ad-hoc versioning without a central server (that is, versioning a project in place and not necessarily sharing/collaborating with anyone), that could be a bit troublesome to set up... maybe. I guess there's only one way to know for sure. :)

In reply to by Alexander Teterkin

Time to get pedantic: there's version control and then there's version control. For the blobs I manage, having git be "aware" that something changed yesterday when everything else was at 33md6465b3 does me little good. I don't need the versions of my graphic or video to match up with yesterday's commit. I don't need the ability to step through my project and see the different stages of the visuals. I want backups in case I totally ruin a graphic, but aside from that I don't need versions. It's cool, however, that git-lfs does version large files, but it isn't something I need. In fact, I actually rolled my own BASH solution to simplify the process when git-media and git-annex are overkill.

In reply to by Jason van Gumster

Did you know ? "blob" stands for "binary large object".

"Binary blobs" are thus binary binary large objects

That's a dealbreaker for me since the ability to branch and roll back are my two primary reasons for using versioning. For the blobs I manage, having git be "aware" that something changed yesterday when everything else was at 33md6465b3 does me little good

Creative Commons LicenseThis work is licensed under a Creative Commons Attribution-Share Alike 4.0 International License.