RAID and snapshots of servers

Tags: ACG

We have had a number of questions about the snapshot process on EC2 RAID disks.  Below I hope to address some of the issues raised.

What is RAID?

Let's start with what a RAID set is. A RAID set is a group of disks (2 to 18+) that are controlled and addressed so that they appear to the operating system as a Volume. This can be done in different ways to achieve different results: faster I/O, a larger volume, or increased fault tolerance (but never all three at once). All enterprise storage makes use of one form of RAID or another. In the context of EC2 we are only considering Operating System managed RAID, and only mirrored or striped sets in-line with AWS recommendations (external site, opens in new tab).

The core concept of RAID is that disk access is basically slow, by spreading the packets across multiple disks, we can achieve reads and writes in parallel, which is faster. (I am intentionally not going to discuss fault tolerance options of RAID here as it is complicated and is not particularly relevant to this very basic discussion because AWS disk is already highly fault tolerant and so OS level RAID does not really add much to the fault tolerance of EBS.)

To achieve this the Operating System cuts the data up into chunks (4 to 128kB depending on the service) and spreads the write and read actions over the set of disks in parallel.  In this simple form if one of those chunks got lost, you could not reassemble the original data (it would have holes). 

There are options to make RAID sets more fault tolerant, but not to the extent that it would compensate for out-of-sync snapshots.

Snapshot

If you decide to take a snapshot of the RAID set, you need all the snapshots of the individual disks to be in sync so that when you restore the snapshots all of the chunks match up and there are no holes. Imagine the mess if half the data blocks that make up a busy payroll table were out of sync by a few seconds. Yuck!

When you take an AWS snapshot of an EC2 EBS volume, it is normally on action on the 'individual' virtual disks, not the set as a whole. There is no coordination to ensure that block are in sync.  Consequently it is almost certain that you will end up with mismatched parts of data when you try to use the snapshots to restore the RAID set. Not good! Try it in the lab.

The solution

  1. As of May 2019 AWS offer a consistent Snapshot option:
  2. Use a backup service or agent that interacts with the OS to ensure that all chunks are captured in sync.
  3. Stop the Instance, or at least deactivate (quiesce) the RAID disks so that the OS brings the volumes into sync and stops writing to them until the backup is complete. This can be done very simple with a bit of scripting, but you want to ensure it is rock solid if you are going to rely on it in production.

I hope that helps clarify some of your questions.

If you are interested in the topic of RAID there are plenty of good articles on the options, methods and mathematics or RAID solutions:

back to top


If you need help, please contact Pluralsight Support.