treehash.py – Amazon S3 Glacier Tree Hash
I am planning to implement a bare-bones system to backup pictures, videos, financial records, and software development projects. Among Google, Microsoft, and Amazon, I find that the long term storage services offered by Amazon to be cost effective at $0.0036 per GB per month or $3.60/mo for 1TB. Amazon S3 Glacier also provides a mechanism for fetching an inventory of files uploaded to their service. This inventory includes a tree hash or checksum for each “archive” that is uploaded. (At $0.00099 per GB per month or $0.99/mo for 1 TB, an even more cost effective alternative is using the S3 Glacier Deep Archive storage class for data stored in Amazon S3 buckets. Amazon S3 Glacier is different from Amazon S3 with the S3 Glacier Deep Archive storage class in that the former deals with vaults and archives whereas the latter deals with buckets and storage classes. Unfortunately, the Amazon S3 service does not provide a reliable mechanism for retrieving checksum data.)
Amazon claims their S3 services achieve 99.999999999% (“eleven nines”) durability. I am uncertain that I can achieve the same level of durability independently. As part of my backup system, I need to periodically check for differences between my local files and my backups. To confirm that local copies of archives uploaded to Amazon Web Services are identical, I implemented a standalone Python script that generates the Amazon S3 Glacier tree hash checksum: treehash.py.
The script can be used with a command line interface as follows:
python3 treehash.py inputfile.bin