EBS is a block storage service offered by AWS. If you are running an EC2 instance, you will definitely use it because it acts as a storage disk for your server. However, it is not safe against errors, and you should still make regular backups.
“Fault-tolerant” does not mean safe
Of course, EBS is quite wrong tolerant on the backend. AWS is not a bunch of wilder people running a JBOD array; they have planned for errors on one device, so a single bad device will not delete your server.
However, EBS failures can and can happen, as EBS volumes have an annual shortage rate (AFR) of between 0.1% -0.2%. This is not much, and it is very low compared to a single hard drive ~ 4%, but it is nothing. Your EBS volume is not likely to just fail with you, but if you run lots of them, chances are you may run into some problems here and there.
The simple fix, of course, is to make backups. EBS provides a good tool for this – the snapshot function. You can create a snapshot that acts as a backup stored in the S3, which is much more durable. In the event of an EBS error, you can restore from backup. You do not need to automate this yourself, as EBS Lifecycle Manager can handle it for you, but it is not enabled by default. Of course, you have to pay the extra storage costs to store data in S3, but it is cheaper than EBS.
AWS does not try to hide this fact and recommends that you make regular backups of snapshots. Most people will also recommend making backups in general, but it’s easy to catch the magic of the cloud and forget about this fact. At the end of the day, it’s just someone else’s computer and can fail like everyone else. An extreme example of this is in September 2019, when an AWS US-EAST-1 data center had a power outage and generator failure, and took out EBS servers and data with it.
Amazon AWS had a power outage, their backup generators failed, killing their EBS serverl, which took all our information with it. Then it took four days to find out and tell about it.
Reminder: The cloud is just a computer in Reston with a poor power supply.
– Andy Hunt (@PragmaticAndy) September 3, 2019
The primary driving force behind the architecture with high availability and cloud computing in general is to ensure that when isolated errors inevitably occur, it does not take down the entire application. You should still take steps to prevent errors in the first place, but sometimes, as with hard drives, it’s a hardware problem, not something you can fix with code.
The S3, on the other hand, is very secure, with 99.999999999% durability (it’s eleven nine). If you store 10,000,000 objects in S3, you can on average expect to lose a single object once every 10,000 years. This is because unlike the EBS, the S3 is fully replicated across at least three accessibility zones and is constantly monitored for drive failures within each zone. Even if an entire data center goes up in flames, your S3 buckets and the still images in them should still be safe.
How do EBS still images work?
EBS still images are incremental backups. Each subsequent backup only stores the data that has changed, so you will not play crazy storage costs with regular still images.
Turning them on is pretty easy. From the EC2 console, go to the Elastic Block Store> Lifecycle Manager in the sidebar and create a new policy.
You can set the schedule for this policy as well as for the still image retention policy. You usually do not need extended backups, so a handful of them depending on the snapshot should be fine.
If you are serious about high availability, you can also enable Quick Image Recovery, which makes recovery completely instantaneous. But it is quite expensive, so this is not something that everyone should activate.