Using rsync to back up your Linux system

Find out how to use rsync in a backup scenario.
629 readers like this.
Hard drives

WIlliam Warby. Modified by Jason Baker. Creative Commons BY-SA 2.0.

Backups are an incredibly important aspect of a system administrator’s job. Without good backups and a well-planned backup policy and process, it is a near certainty that sooner or later some critical data will be irretrievably lost.

All companies, regardless of how large or small, run on their data. Consider the financial and business cost of losing all of the data you need to run your business. There is not a business today ranging from the smallest sole proprietorship to the largest global corporation that could survive the loss of all or even a large fraction of its data. Your place of business can be rebuilt using insurance, but your data can never be rebuilt.

By loss, here, I don't mean stolen data; that is an entirely different type of disaster. What I mean here is the complete destruction of the data.

Even if you are an individual and not running a large corporation, backing up your data is very important. I have two decades of personal financial data as well as that for my now closed businesses, including a large number of electronic receipts. I also have many documents, presentations, and spreadsheets of various types that I have created over the years. I really don't want to lose all of that.

So backups are imperative to ensure the long-term safety of my data.

Backup options

There are many options for performing backups. Most Linux distributions are provided with one or more open source programs specially designed to perform backups. There are many commercial options available as well. But none of those directly met my needs so I decided to use basic Linux tools to do the job.

In my article for the Open Source Yearbook last year, Best Couple of 2015: tar and ssh, I showed that fancy and expensive backup programs are not really necessary to design and implement a viable backup program.

Since last year, I have been experimenting with another backup option, the rsync command which has some very interesting features that I have been able to use to good advantage. My primary objectives were to create backups from which users could locate and restore files without having to untar a backup tarball, and to reduce the amount of time taken to create the backups.

This article is intended only to describe my own use of rsync in a backup scenario. It is not a look at all of the capabilities of rsync or the many ways in which it can be used.

The rsync command

The rsync command was written by Andrew Tridgell and Paul Mackerras and first released in 1996. The primary intention for rsync is to remotely synchronize the files on one computer with those on another. Did you notice what they did to create the name there? rsync is open source software and is provided with almost all major distributions.

The rsync command can be used to synchronize two directories or directory trees whether they are on the same computer or on different computers but it can do so much more than that. rsync creates or updates the target directory to be identical to the source directory. The target directory is freely accessible by all the usual Linux tools because it is not stored in a tarball or zip file or any other archival file type; it is just a regular directory with regular files that can be navigated by regular users using basic Linux tools. This meets one of my primary objectives.

One of the most important features of rsync is the method it uses to synchronize preexisting files that have changed in the source directory. Rather than copying the entire file from the source, it uses checksums to compare blocks of the source and target files. If all of the blocks in the two files are the same, no data is transferred. If the data differs, only the block that has changed on the source is transferred to the target. This saves an immense amount of time and network bandwidth for remote sync. For example, when I first used my rsync Bash script to back up all of my hosts to a large external USB hard drive, it took about three hours. That is because all of the data had to be transferred. Subsequent syncs took 3-8 minutes of real time, depending upon how many files had been changed or created since the previous sync. I used the time command to determine this so it is empirical data. Last night, for example, it took just over three minutes to complete a sync of approximately 750GB of data from six remote systems and the local workstation. Of course, only a few hundred megabytes of data were actually altered during the day and needed to be synchronized.

The following simple rsync command can be used to synchronize the contents of two directories and any of their subdirectories. That is, the contents of the target directory are synchronized with the contents of the source directory so that at the end of the sync, the target directory is identical to the source directory.

rsync -aH sourcedir targetdir

The -a option is for archive mode which preserves permissions, ownerships and symbolic (soft) links. The -H is used to preserve hard links. Note that either the source or target directories can be on a remote host.

Now let's assume that yesterday we used rsync to synchronized two directories. Today we want to resync them, but we have deleted some files from the source directory. The normal way in which rsync would do this is to simply copy all the new or changed files to the target location and leave the deleted files in place on the target. This may be the behavior you want, but if you would prefer that files deleted from the source also be deleted from the target, you can add the --delete option to make that happen.

Another interesting option, and my personal favorite because it increases the power and flexibility of rsync immensely, is the --link-dest option. The --link-dest option allows a series of daily backups that take up very little additional space for each day and also take very little time to create.

Specify the previous day's target directory with this option and a new directory for today. rsync then creates today's new directory and a hard link for each file in yesterday's directory is created in today's directory. So we now have a bunch of hard links to yesterday's files in today's directory. No new files have been created or duplicated. Just a bunch of hard links have been created. Wikipedia has a very good description of hard links. After creating the target directory for today with this set of hard links to yesterday's target directory, rsync performs its sync as usual, but when a change is detected in a file, the target hard link is replaced by a copy of the file from yesterday and the changes to the file are then copied from the source to the target.

So now our command looks like the following.

rsync -aH --delete --link-dest=yesterdaystargetdir sourcedir todaystargetdir

There are also times when it is desirable to exclude certain directories or files from being synchronized. For this, there is the --exclude option. Use this option and the pattern for the files or directories you want to exclude. You might want to exclude browser cache files so your new command will look like this.

rsync -aH --delete --exclude Cache --link-dest=yesterdaystargetdir sourcedir todaystargetdir

Note that each file pattern you want to exclude must have a separate exclude option.

rsync can sync files with remote hosts as either the source or the target. For the next example, let's assume that the source directory is on a remote computer with the hostname remote1 and the target directory is on the local host. Even though SSH is the default communications protocol used when transferring data to or from a remote host, I always add the ssh option. The command now looks like this.

rsync -aH -e ssh --delete --exclude Cache --link-dest=yesterdaystargetdir remote1:sourcedir todaystargetdir

This is the final form of my rsync backup command.

rsync has a very large number of options that you can use to customize the synchronization process. For the most part, the relatively simple commands that I have described here are perfect for making backups for my personal needs. Be sure to read the extensive man page for rsync to learn about more of its capabilities as well as the options discussed here.

Performing backups

I automated my backups because – “automate everything.” I wrote a BASH script that handles the details of creating a series of daily backups using rsync. This includes ensuring that the backup medium is mounted, generating the names for yesterday and today's backup directories, creating appropriate directory structures on the backup medium if they are not already there, performing the actual backups and unmounting the medium.

I run the script daily, early every morning, as a cron job to ensure that I never forget to perform my backups.

My script, rsbu, and its configuration file, rsbu.conf, are available at https://github.com/opensourceway/rsync-backup-script

Recovery testing

No backup regimen would be complete without testing. You should regularly test recovery of random files or entire directory structures to ensure not only that the backups are working, but that the data in the backups can be recovered for use after a disaster. I have seen too many instances where a backup could not be restored for one reason or another and valuable data was lost because the lack of testing prevented discovery of the problem.

Just select a file or directory to test and restore it to a test location such as /tmp so that you won't overwrite a file that may have been updated since the backup was performed. Verify that the files' contents are as you expect them to be. Restoring files from a backup made using the rsync commands above simply a matter of finding the file you want to restore from the backup and then copying it to the location you want to restore it to.

I have had a few circumstances where I have had to restore individual files and, occasionally, a complete directory structure. Most of the time this has been self-inflicted when I accidentally deleted a file or directory. At least a few times it has been due to a crashed hard drive. So those backups do come in handy.

The last step

But just creating the backups will not save your business. You need to make regular backups and keep the most recent copies at a remote location, that is not in the same building or even within a few miles of your business location, if at all possible. This helps to ensure that a large-scale disaster does not destroy all of your backups.

A reasonable option for most small businesses is to make daily backups on removable media and take the latest copy home at night. The next morning, take an older backup back to the office. You should have several rotating copies of your backups. Even better would be to take the latest backup to the bank and place it in your safe deposit box, then return with the backup from the day before.

David Both
David Both is an Open Source Software and GNU/Linux advocate, trainer, writer, and speaker. He has been working with Linux and Open Source Software since 1996 and with computers since 1969. He is a strong proponent of and evangelist for the "Linux Philosophy for System Administrators."

21 Comments

Great article David! I love rsync and though I haven't used it lately we did use it when I was a system administrator to backup our our Lotus Domino server which ran Red Hat Enterprise. I too wrote a bash script to do that and it worked flawlessly. We did experience a catastrophic disk failure once and with the backups we had we were well prepared to restore our email data intact.

If you want to keep several days worth of backups, your storage requirements will grow dramatically with this approach. A tool called rdiff-backup, based on rsync, gets around this issue.

Agreed, I use rdiff-backup because I found my rsync backups were getting cluttered with old files, and sometimes the lack of versioned backups was problematic. I'm a big fan of rdiff-backup. I don't think it actually leverages rsync, as such, but librsync. It's a great tool.

In reply to by WRD (not verified)

I think that rsync is great but tools like dar, attic, bup, rdiff-backup or obnam are better. I use obnam, as it does "snapshot backups" with deduplication.
Full disclosure: I'm part of the project (translator of the manual).

In reply to by WRD (not verified)

Thanks for sharing,
What's the code you are using to handle yesterday/today source?
Show us the script!))

I will be happy to share my script as it is GPL'ed. I just need to make it available for download and I will add another comment to this article when it is.

Thanks for your interest.

In reply to by Vicvicvic (not verified)

I would like to comment on off-site backups. Back in the 1970's I was an IT manager and I set up a backup and recovery system for the company I worked for. Initially I set up the off-site backup as is recommended in this article. The latest backup was shipped to an off-site location in a city which was 90 miles away by company courier over night. The most recent on-site backup was the -1 generation which was brought back by the courier before 8:00 am.

This system had a big problem on the recovery side. When we needed to recover a file we either had to send somebody on a 180 mile round trip to get the backup or else restore the -1 backup and rerun everything necessary to bring the recovered file up to date.

After a couple of long delays caused by long recovery times from production failures I modified the system. We duplicated the various backup jobs and sent one current backup off-site and kept one on site. That modification worked significantly better than the original backup and recovery system design.

------------------------------
Steve Stites

Thanks for posting your code, David. I'm especially interested in how you worked out a clean-up routine for older archives. I have a similar bash/rsync script that uses hardlinks as well. Very similar results but not as clean-looking as yours... and I need to manually go in and remove older backups to free up space.... I've got some studying to do!

Great article! I would just like to point out - if you are working with servers that are not under your control and plan to use rsync with --delete option, make sure that they store the same data. In case one of them stores only new data and deletes the one already synced, it might be a catastrophe if you use --delete option.

Cheers!

rsnapshot.... it uses rsync for the backup, very good setup.

The only real trouble with rsync (underlying), is that it can still take considerable time with large file systems and remote systems over slow links. You might want to think about snapshots as well (not just the rsnapshot ones, but file system ones).

We can also mention lsyncd that is a "live rsync like", making the sync on real time, opening the road for another usage of sync.

Many years ago I was using a tool called SnapBack2, which basically used rsync in the manner you described, complete with the rotating directories of snapshots on a daily, weekly, and monthly basis.

I fell out of practice with using it, but I wrote up a how-to for setting it up, and I should probably follow it again: http://www.gbgames.com/blog/2005/05/snapback2-how-to/

I use a combination of DropBox and SpiderOak to backup various things offsite, but not everything I care about is backed up to those services, and I shouldn't tempt fate by delaying the setup of an automated backup system any more.

When rsync-ing to USB (or any other) hard drive, there is still reading of all the data on the both disks. Speed-up can be because of "remembering" what was changed since the last writing to the disk.

I've been using Rsync to backup servers since 2010. It's a great way to do backups, and is very cost efficient compared to commercial backup systems. I use it to backup Linux and Windows machines, and even used to backup Netware 6.5 boxes years ago (that was interesting, given how weak Netware's command-line processing was).

I have 4 old IBM servers with removable SATA drives (2-4 TB each) that I use for the destination backup servers. I swap the SATA drives each morning. Using 4 servers helps spread the load when the backups run, although I do have to manually specify which backup goes to which server when I write the script. (That does cause a bit of a problem when I do restores, since I have to remember where a given file or directory was backed-up to. So there are drawbacks to this method.)

It also costs a pretty penny to buy enough SATA drives and swap trays/carriers to have 2 weeks worth of Monday thru Thursday disks, and another 9 disks for Fridays (2 months worth of Fridays), and then to do that for all 4 backup servers. But IMHO, it's worth it since that gives many copies of (most) of the files.

An added bonus is that the SATA backup disks can be mounted in any Linux box and the files retrieved without needing any software (or license keys) other than the normal Linux tools. No more waiting to retrieve the file list from some huge catalog file, just copy the needed files from the proper directory and the restore is done.

The biggest problem I've run into with RSync is retrying files that are stuck or failed. Rsync has options for this, but IMHO, they aren't on par with what is in the commercial backup programs. It's one of those things I've been meaning to go deeper into, but, ehh, it's been 7 years and I haven't yet.

I've been using rsync with hard links, rotating directories for over a decade, now. I wrote a bash script that runs the backup automatically when I insert a USB drive or can backup to a local drive via a cronjob - simply create a sym link in /etc/cron.daily to the script executable. A person can either install it using make (shudder) or build it as an RPM and install it. It's well documented and easy to configure for what directories should be backed up, when, as well as how many rotations should be retained.

The code used to be on sourceforge (shudder, again), googlecode, but I just uploaded it to my github account, here:

https://github.com/yocum137/ddback

Enjoy!

Great work David, great content!

Great article, and BTRFS can be an ideal complement to do better than links managed by RSYNC.
I have been experimenting for several months RSYNC and BTRFS for my home systems.
- RSYNC is responsible of optimized file transfer.
- BTRFS is responsible of optimized storage with high level features directly useable for backup standard features.

Daily backup images are done with snapshot and dedup feature (sub volumes)
FS corruption is very difficult because copy on write feature participate natively and efficiently to journalization.
Compression permit to avoid ZIPs and the backup structure is a mirror of the source as you propose in your article.
You can also easely do high availibility backup storage by adding disks, BTRFS natively knows how to organize them as a RAID10 cluster (or other RAID types)
There is many other great features in BTRFS. With this two tools you can do what big backup tools do in a cheap and simple way.

I like LSyncd also which can do a real time sync or you can delay the sync to a designated time span. I have one server where archive materials are stored and LSyncd syncs these files to a second server as a backup provided the second server is on the network. This is a use case where immediate syncs work very well.

Creative Commons LicenseThis work is licensed under a Creative Commons Attribution-Share Alike 4.0 International License.