RAID

Modern disks are remarkably reliable, but nothing lasts forever. What happens to your computer if the disk fails? You have backups, of course, right? But what if you don't want to have even a brief outage?

Every server uses some kind of redundant storage system, the most important of which is RAID1 (redundant array of independent (inexpensive) disks). RAID also improves availability. It is based on:

  • redundancy: By storing data more than once, it can be restored if one of the disks fails.
  • striping: Write the data spread over several disks. We can now have reads (and writes) that are simultaneous, thereby improving throughput, if not the time of a transaction.

RAID comes in various levels:

Raid level 0 is striping only, with zero redundancy. Essentially, N disks are treated like one big disk that is N times as big. (E.g. five 1-TB drives are treated as one 5-TB drive). By striping the data across the disks, we can increase bandwidth (throughput) because data can be written to and read from multiple disks simultaneously.

Raid level 1 is mirroring. Buy two disks, write everything to both disks. Read from either disk. Reading is faster and more efficient or at least concurrent; writing is slightly slower, since data has to be written twice (but in parallel). In general, buy twice as many disks as the amount of storage you want to have. Of course, this doubles the cost of your disk drives.

RAID level 5 is very common. (Note that the number 5 has nothing to do with the number of disks; Raid 5 can be done with a minimum of 3 disks and no maximum.) Picture N+1 disks storing N disks worth of data. Stripe by pages so that consecutive pages are on different disks. Read in parallel. When you write a page, write it on one disk and on another disk write a parity block, consisting of the parity bits computed over consecutive N pages. The parity blocks are spread out over all the disks (unlike RAID level 4). By a clever trick, we don't need to read N pages in order to update one:

NewParity = (OldData XOR NewData) XOR OldParity

If parity is baffling to you, see the next section.

Raid 5 has increased read throughput and very good write performance. You can even have parallel writes. Individual disks can be hot-swappable. The increased cost of disk drives is only 1/Nth, so it's cheaper than alternatives like RAID 10, though not as good performance.

RAID level 10 (1+0) is a combination of RAID 1 (mirroring) and RAID 0 (concatenation), and is becoming popular as disks continue to get cheaper.

Can also dump database to disk periodically. The write-ahead log only needs to work from the last complete dump.

But, can you dump when the system is active (a fuzzy dump)? Necessary sometimes, but very complex.

Parity

RAID 5 is based on a trick with parity, which you may have heard of in CS 240. Parity is a redundant bit of information, namely whether the total number of ones in a binary pattern is odd or even.

Let's start with simple 4-bit quantities, half of a byte, called a nibble. Suppose we want to transmit the number 9 across a wire or store it to an unreliable medium. In binary, 9 is 1001:

1001   # data

Suppose we adopt even parity. We want to make the total number of ones even, so for 9 (1001), the parity bit is 0.

We transmit or store 10010. (Arbitrarily, we put the parity bit on the right; it could be anywhere.)

10010   # data and parity

Suppose later, we read or receive those five bits but one of them (any one of them) is unknown. Let's suppose that it's the second one. So we got

1?010   # data and parity

Since we know that the parity is supposed to be even, we can reconstruct the unknown bit.

Let's jump up an enormous number of levels and talk about storing whole pages of bits on 5 independent disks. Each page might be 4K of bits. Suppose we want to store 4 such pages.

First, we compute a 5th page consisting of parity bits computed for each of the 4K bits on the four pages. So, essentially 4K computations like we did above for the bitpattern 1001. (The first bit (1) comes from the first page, the second bit (0) comes from the second page, the third bit (0) comes from the third page, and the fourth bit (1) comes from the fourth page.

So, now we have four data pages and one parity page. Let's store them on five independent disks:

DDDDP

Now, suppose that one of the disks dies. Could be any of the five disks. Suppose, again, it happens to be the second disk:

D?DDP

We can reconstruct the missing page of data from the other four disks!

This is how RAID 4 and RAID 5 work. The extra trick with RAID 5 is that we don't always use the same disk for the parity page, but the parity pages are spread equally over all the disks.

A computer with a RAID system can suffer a loss of an entire disk and continue to function normally. (Well, it might tell the sysadmin that one of the disks needs to be replaced.) Some RAID systems can allow you to "hot-swap" a new disk in and rebuild its data without even needing to be rebooted. Amazing.

Real World Information

The old Tempest used RAID1, RAID5, and RAID50!.

The new tempest puts most of its disk space on academicstore. Don Nightingale told me (4/17/2020):

It's three 12-disk raid 6 arrays (each with a hot spare disk) connected via esata to a controllor. All raid is done in software.

This configuration is a good example of choosing data resilience over performance.


  1. basically, I'm taking this opportunity to talk about RAID, which is not really a database or web application topic, but which you should know a little about.