A co-worker of mine has a reputation of saying that backups aren’t regularly tested aren’t actual backups. While this seems a bit harsh, it’s absolutely correct. I present to you (from my home environment) yet another reason why this is true…
As you can see from the above screenshot I have a file named nc-data-backup-20211221.tar.gz that has two different hashes on two different systems. While this isn’t terribly surprising, what is surprising is that it was successfully copied over from the remote system, validated and then experienced corruption while being written to disk. This wasn’t a one-time incident either… it was consistently repeatable over a dozen times as was trying to find the solution.
Over the course of testing I was able to finally identify the problematic element (not actual root cause) as the Cache Mode on my QEMU/KVM Disks.
When using the default caching mode (writeback) I am experiencing 100% failure rates on large file transfers. I haven’t isolated the actual root cause of this behavior, however, I did test will all the available cache modes as shown below.
Those tests were mostly intended to give me an idea of the performance impact I would see using other caching modes as well as if I would experience continued errors. Below are the results.
So, there you have it. Had I not tested my restore process on a new VM, in the event of an actual failure I would have had a high risk of experiencing total data loss of all my important documents, photos, etc. Still a far cheaper price to pay than an enterprise failure, but an important lesson none-the-less.