Automating Incremental Backups (to AWS)

We should all have some sort of off-site backup of our data. My approach to achieving this involved writing a script which I will talk about in this article. Perhaps others will find it useful!

Background

I run a Network-Attached Storage (NAS) computer at home using ZFS on Linux.

ZFS supports quick and easy snapshotting thanks to its copy-on-write design. These snapshots allow outputting diffs between any two arbitrary snapshots. This is how the incremental nature of my backups work!

Requirements

I have goals for these backups that leads me to conclude that I must roll my own solution. Here are some of the goals:

Self-Encryption

I have no intent to believe any claims like...

Your data is private and encrypted by us and we promise we will never look at it!

or even...

We encrypt it with your encryption key and only you will ever know what the key is. Therefore, your data is private!

Aside from completely impractical wishful thinking, I will never be able to fully vet (and continue vetting every update!) that their software isn't saving my key or doing any other nefarious thing. So, my approach is designed in a way that I'd be comfortable sending my backups straight to <insert your adversarial organization of choice here>!

Fault-Tolerant

Let's assume the cloud provider can have bit-rot that they can't/don't fix. We must handle this ourselves.

Compatible

Perhaps a legacy mindset, but you never know if a cloud provider will decide to have a max file size limit that is low enough to be a problem. So to be safe, I am following the lead of FAT32's max file size and am ensuring my backups are split into 4GiB files.

Also, none of the tools used to create/read the backup should be proprietary to any closed-source platform.

Considerations

I have about 10TB of data to back up and that is sure to grow.

Internet

Your internet connection can be a bottleneck, both in speed and whether you have a data cap. I am fortunate enough to have symmetric fiber internet with no cap, but if you have limited internet access then this article isn't suitable. Some providers allow you to physically ship hard drives with your data, which you should consider.

Storage (local)

In my design, we're creating an entire backup of the data right on the system we're backing up. This requires the system to have as much free space available as the size of the dataset that we are backing up. If you're backing up everything at once, this means your system’s capacity needs to be double the size of your data.

However, you can plug in an external drive(s) as a staging ground for the backup files. This could even be considered your onsite backup!

I also suggest separating your data into logical sets like "cameras", "media", etc. This allows you to snapshot individual datasets and only be limited to needing enough free space to match your largest dataset.

Storage (cloud)

Let's spend as little as possible (of course!). Generally, the cheapest tier is called cold storage and is cheaper because they remove the drives (maybe even tape?) and shelve them until you request the data. The drawback is that you have to wait for your data to become available, which is usually counted in hours. But is that a problem?

Ideally, I will never need to download this backup. And if something catastrophic happens to my NAS, I probably have a day(s) of lead time to wait for the delivery of replacement hardware. Likewise, cloud cold storage tiers would take several hours of lead time for them to plug the drives back in for me to begin downloading. I just need to coordinate those two lead times and the access delay with cold storage becomes a non-issue.

Here's the comparison of the main contenders.

ProviderTierStorage ($/TB/yr)Download ($/TB)
AWSGlacier Deep Archive$11.88$2.50
AzureArchive$11.88$20.00
Google CloudArchive$14.40$50.00
BackblazeB2$60.00$10.00

AWS and Azure are both the cheapest for storage, but the download pricing gives AWS the win.

"But you only compared four!"

Yes, I was not exhaustive in my search, but I have checked other providers like IDrive and Carbonite. Their pricing models are more of a flat rate with a storage cap. IDrive for example wants $99.50/yr for 10TB ($9.95/TB/yr) or $199.50/yr for 20TB ($9.98/TB/yr).

This comes out to be cheaper than AWS or Azure if you want to use exactly their limit. But there is no mention of prorated pricing! Needing to pay for 20TB of storage for, say, 10.1TB of data does not sit well with me.

Overview

Self-Encryption

Let's encrypt the data ourselves. We can use OpenSSL to encrypt it using a command like:

openssl enc -aes-256-cbc -md sha512 -pbkdf2 -iter 250000 -pass pass:my_password

Fault-Tolerant

Par2 is a tool to generate parity files. Let's generate them for our data files. It'd use a command like:

par2create -r5 -n1 -m3000 -q my_filename

Which would generate two files:

my_filename.par2
my_filename.vol000+100.par2

Compatible

The backups should be split into 4GiB files. We can simply use the Linux split command:

split -b 4G --suffix-length=6 - my_filename_prefix_

Which would generate files like:

my_filename_prefix_aaaaaa
my_filename_prefix_aaaaab
my_filename_prefix_aaaaac
...

Uploading

Finally, we need a command to upload the files to AWS, which can look like:

s3 sync --exclude * \
        --include my_filename_prefix_* \
        my_local_directory \
        s3://path/to/incremental/directory \
        --storage-class DEEP_ARCHIVE

Design

Naming and file structure

Let's have sensible names/prefixes to make this work. The date format shall go from broad-to-specific, ie. year, then month, then day. The only order that should exist.

For the ZFS snapshots, I settled on the name format offsite-YYYYMMDD.

The incremental backup would need to specify the snapshot date range that it pertains to, so for a given dataset my_dataset, the prefix format is my_dataset-FYYYYMMDD-TYYYYMMDD-part-. Where F is for "from" and T is for "to".

On AWS, there are two distinct directories to have.

Incremental backups need a baseline backup to increment on:

my_aws_bucket/my_dataset/baseline-YYYYMMDD/

Then the increments themselves would go into

my_aws_bucket/my_dataset/incrementals/FYYYYMMDD-TYYYYMMDD/

Altogether, here's a realistic example with actual dates:

ZFS Snapshots

$ zfs list -t all | grep "cameras.*offsite"
tank1/cameras@offsite-20201123
tank1/cameras@offsite-20220610
tank1/cameras@offsite-20220620
...
tank1/cameras@offsite-20230726

Incremental backup files and their parity files

$ ls -alh | grep cameras
-rw-r--r--  1 root root 4.0G Jul 26 15:01 cameras-F20230719-T20230726-part-aaaaaa
-rw-r--r--  1 root root  40K Jul 26 15:02 cameras-F20230719-T20230726-part-aaaaaa.par2
-rw-r--r--  1 root root 206M Jul 26 15:02 cameras-F20230719-T20230726-part-aaaaaa.vol000+100.par2
-rw-r--r--  1 root root 3.9G Jul 26 15:01 cameras-F20230719-T20230726-part-aaaaab
-rw-r--r--  1 root root  40K Jul 26 15:03 cameras-F20230719-T20230726-part-aaaaab.par2
-rw-r--r--  1 root root 199M Jul 26 15:03 cameras-F20230719-T20230726-part-aaaaab.vol000+100.par2

On AWS S3

Baseline files:

$ s3 ls my_bucket/cameras/baseline-20201123/
2020-11-25 23:25:10          0
2020-11-27 20:09:56 4294967296 cameras-T20201123-part-aaaaaa
2020-11-27 20:09:56      40428 cameras-T20201123-part-aaaaaa.par2
2020-11-27 20:09:56  215037572 cameras-T20201123-part-aaaaaa.vol000+100.par2
2020-11-27 20:09:58 4294967296 cameras-T20201123-part-aaaaab
2020-11-27 20:09:56      40428 cameras-T20201123-part-aaaaab.par2
2020-11-27 20:09:56  215037572 cameras-T20201123-part-aaaaab.vol000+100.par2
2020-11-27 20:09:56 4294967296 cameras-T20201123-part-aaaaac
2020-11-27 20:10:16      40428 cameras-T20201123-part-aaaaac.par2
2020-11-27 20:10:18  215037572 cameras-T20201123-part-aaaaac.vol000+100.par2
...

Incremental files:

$ s3 ls my_bucket/cameras/incrementals/F20230719-T20230726/
2023-07-26 22:03:55 4294967296 cameras-F20230719-T20230726-part-aaaaaa
2023-07-26 22:03:55      40436 cameras-F20230719-T20230726-part-aaaaaa.par2
2023-07-26 22:03:55  215037628 cameras-F20230719-T20230726-part-aaaaaa.vol000+100.par2
2023-07-26 22:03:55 4152776288 cameras-F20230719-T20230726-part-aaaaab
2023-07-26 22:03:55      40436 cameras-F20230719-T20230726-part-aaaaab.par2
2023-07-26 22:03:55  207928428 cameras-F20230719-T20230726-part-aaaaab.vol000+100.par2

Script Procedure

It is a relatively straightforward operation:

  1. Find the latest backup that is on AWS (date A).

  2. Create a new snapshot (date B).

  3. Export the incremental snapshot from date A to date B.

    1. Pipe to openssl for encryption.

    2. Pipe to split to become 4GiB files.

  4. Run par2create on each split file.

  5. Upload all of it to AWS.

Automating

We'll just use crontab. I set mine to run weekly:

0 15 * * 3 bash -lc "/path/to/backup.sh"

Storing the encryption keys

We don't want to put our password directly into the command. You need to create a JSON file to contain the passwords and pass it in with the --config flag.

The JSON format is:

{
    "my_dataset_A": {
        "pass": "my_password"
    },
    "my_dataset_B": {
        "pass": "my_password2"
    },
}

WARNING: Put special care into escaping the special characters in your password!

And that's all for now! Visit the GitHub project to explore the script yourself!