And now on to our guest editorial by Mike Pepe...
Choosing the right way to protect your data can be a daunting task. Many system administrators may simply opt to add more drives to a system and implement mirroring and consider the task done. However there are many options available to you and understanding the implications of one data protection scheme over another will help you make the best choice.
The basic forms of RAID
RAID (Redundant Array of Independent Disks) has been around for almost as long as there have been hard drives. The most commonly encountered RAID types are defined by their number; 0, 1, 5 and (most recently) 6. Let's quickly review these RAID levels and what they really mean:
- RAID Level 0 (that's a zero) is sometimes called "non-RAID" or simply "striping". In this scheme, data is read and written across some number (n) disks simultaneously. This improves read and write performance up to n times that of a single drive, and also gives you n times the capacity of a single drive. The big downside to RAID 0 is that the failure rate of such an array increases n times that over a single drive. When a drive does fail, the array will become unavailable and the data irrevocably lost. In some circumstances, however, when raw performance or capacity are the only concern: RAID 0 may be a good choice.
- RAID Level 1 is also often called "mirroring". A volume using RAID-1 will contain (at least) two drives, and the data will be read and written to all drives simultaneously and in the same order, often at the sector level. Read performance may be improved up to n times, however, as there are now multiple copies of the data to potentially read from. Write performance may suffer, as it may take n times as long to commit a write to all the copies when compared to a single standalone drive. Capacity remains fixed to the size of one drive, no matter how many copies you make.
- RAID Level 5 may also be called "parity" by some folks. A RAID-5 array must consist of at least three drives. In this type of array, data is written to n-1 drives, and a "parity" unit is calculated and written to the remaining drive. The drive holding the parity chunk per stripe is rotated through the physical drives, distributing it evenly across the drives in the array. The primary advantage of using a RAID-5 array is that the failure of any single drive does not produce any data loss; the missing data can be reconstructed from the parity and the array can continue to operate, albeit with some degradation in performance. Capacity of the array is n-1 drives, and read performance can be improved by up to n-1 times that of a single drive. Write speeds can suffer in RAID-5 arrays since there must be a parity computation before the stripe can be committed to disk.
- RAID Level 6 improves upon the ideas of RAID-5 by providing another, different parity calculation and distributing it across the available drives. You need at least 4 drives for RAID-6, and the capacity of the array will be n-2. The chief advantage is that a RAID-6 array can sustain two disk failures without loss of data, again at the penalty of having to computationally reconstruct the missing data. Writes similarly suffer as with RAID-5 due to the need to compute two different parity chunks for every write to the array.
RAID-10 and compound RAID levels
RAID Level 10 is a compound RAID level. More precisely, it's RAID 1+0 (or sometimes, RAID 0+1) – and combines both striping and mirroring. A RAID-10 must consist of at least 4 drives: (two mirrors of two two-drive stripes is the minimum) but can consist of any number of drives, which we'll call stripe width (n) multiplied by the number of mirrored copies (m).
RAID-10 arrays combine excellent performance characteristics as well as good data integrity. There are potentially a great number of drives to pull data from, meaning there is a theoretical read performance of n * m times over that of a single drive. Their biggest downfalls are in capacity: which is only that of n drives, cost: since you must purchase n times m drives, and in write performance depending on how well you data stripes across the drives, which is something we will explain later.
Other "compound" RAID levels are possible, for instance, striping across multiple RAID-5 arrays (RAID 50) or mirroring two RAID-5's (RAID 51) although not every controller supports these more complex scenarios. These compound RAIDs are not officially defined and therefore may not be portable across systems or controllers.
Different types of RAID controllers
Now that we've reviewed the different ways which we can use multiple disk drives for varying degrees of performance, capacity and reliability you may have already decided on what the best scheme is for your application: but RAID type is only one part of the equation. How you control these disks is also important. We can bundle RAID controllers into three distinct categories.
Hardware RAID controllers offer the best performance since they are, in effect, self-contained computers dedicated to running RAID arrays. The controller manages all the aspects of the RAID, and the host system is free to do other tasks while the RAID controller manages everything behind the scenes. Hardware RAID controllers often have their own cache to improve performance, and often have an option for a battery back-up to prevent data loss if the contents of a write cache were not written to disk. All this power has a price, however in this case literally. The best high-end RAID controllers can be very expensive. There are other potential pitfalls as well, which we will discuss a little later.
Software-based RAID uses your host operating system to virtualize your storage into RAID (or RAID-like) groups. For instance, creating a mirror (RAID-1) of your boot disk in Windows Disk Administrator is a simple example of a software RAID. On the other end of the complexity spectrum, Windows Server 2012 introduces a storage management system called Storage Spaces. Using Spaces, you can make a pool out of your storage and apply different protection schemes to your data on a folder-by-folder basis rather than at the partition or disk level. Software RAID has the advantage of being the least expensive option in most cases since the functionality is part of the operating system and requires no additional hardware, or the addition of relatively low-cost host bus adapters to connect disks to your system if you need more ports. Software RAID also has the potential to be the most flexible. For instance, it is possible in Windows to create a RAID-1 mirror using half the capacity of two disks, and then create a RAID-0 volume out of the remaining storage. You'd then have a volume for data that needs protection and one for data that's not critical: all on the same two disks. The main disadvantage of a software RAID setup is that your operating system must manage it, therefore performance may suffer as your CPU time is used for disk I/O rather than for your application. We'll also examine the real-world implications of this later.
Somewhere in the middle are "hybrid" RAID controllers. These sort of controllers are marginally more expensive (or in some cases, the same price as) non-RAID host bus adapters. They generally have firmware that host CPU actually runs to provide the RAID controller functionality, and OS drivers that do the same. In that sense, they are not much different than a software RAID. However, these devices may have some form of caching or dedicated hardware to help speed up operation of a RAID array: for instance, a hardware parity calculator for RAID-5 and 6 arrays. So these devices sit somewhere in between the functionality provided by software and hardware and therefore the pros and cons of both may apply.
Choosing the right RAID controller
So which one is best? Most people would assume that a high-end hardware RAID controller is obviously the best choice, but that's not always the case. At the entry-level server spectrum, a high-end RAID controller can be more costly than an entire server! Some of them have their own out of band network configuration and can be rather complex devices for the non-techie to get working. Interchangeability also is a potential issue for the hardware and hybrid RAID controllers: if your controller fails, you'd likely need at least something from the same product family with similar firmware installed to insure you can read/recover the disks. Good luck to the system administrator who has to try and track down a specific version of a RAID controller that hasn't been made in half a decade!
Contrast this to a software RAID where there's a very good chance that any machine running the same operating system can have transplanted disks from a failed server back up and running very quickly. Recoverability in the event of a crisis may be better here, unless you keep a spare RAID controller card handy. The software/hybrid solutions do utilize your system's resources to a much greater degree than the hardware solutions, but except in the most demanding and critical systems the few percentage points of processor utilization is hardly likely to be noticed.
Price versus performance is a second key decision point, but let's talk more about recoverability. We touched on this earlier with a key advantage of a software-based RAID: the RAID volume should be readable in any machine running the same operating system, whereas with a hardware RAID controller there's a good chance that your RAID volume would not be readable with another brand or type of controller. However there is one exception to this; a RAID-1 that consumes an entire disk; often these volumes are simple block-by-block copies of what would normally be written to a single disk. In many cases it is indeed possible to take one of the copies of a failed mirror volume and put it into any random machine and read it.
Recovery time and reliability
Recovery time and reliability are another point of consideration. As of today, a 4TB drive is the largest available capacity. Average transfer rate on a drive of this size is somewhere around 180 megabytes per second, which means it would take, on average, over seven and a half hours to completely fill this drive up. (In the real world, the time to rebuild an active RAID-5 using drives of this size would be two or more times that!)
Why is this important? Let's consider a RAID-5 built with five 4TB drives. One drive fails and is replaced, and the rebuild process begins. Since hard drives are electro-mechanical devices, there is an engineered in error rate. In this case our drives have a 1 in 10E14 chance of an uncorrectable bit error during any read. In order to reconstruct the RAID-5, we must read a total of 16TB of data, which is 1.28x10E14 bits! There's a very real chance that during the rebuild, we'll encounter an uncorrectable error on one of the remaining drives: if the controller deems that drive bad, we'll have a RAID-5 array with two dead drives and the entire array will fail, and our data disappears.
RAID-6 will help here, since it will continue operating even if two drives fail. However given the high likelihood of an error, even RAID-6 starts to look less and less attractive.
The value of triple redundancy
Given that there is a statistical chance of catastrophic failure of a parity-based RAID group, you should always remember a few things; first and foremost: RAID is not a replacement for a sound backup (and recovery) strategy. Make sure you have backups in place, and test them periodically to make sure that they are recoverable. Secondly, consider triple-redundant options using RAID-10 striping and mirroring.
It's probably safe to say that many people have encountered random silent corruption in their daily lives. It's that picture that won't display anymore, or the video that's broken at some point in playback. Sure these things can happen with single drives and single copies of data, but they do appear even when disk mirroring is in place. Why would that be? Consider the following scenario: a server running a RAID-1 array with two drives crashes or loses power. A random spurious write corrupts a random sector on the hard drive. When the machine comes back up, the controller detects a dirty shutdown and re-mirrors the drive, and encounters a data difference. Which block is the correct one? It's entirely possible the RAID controller doesn't know, and there's potentially a 50% chance that it'll guess wrong, permanently corrupting the file.
What if there were three copies instead of two? Well, in that case, the RAID can take a vote; if two of the blocks agree, it's probably the "right" data. Add a checksumming filesystem, such as Windows Server 2012's ReFS and Storage Spaces on top of that with triple mirroring, and the chances of silent corruption in your data drop dramatically.
Stripe size and RAID performance
Also consider performance of your stripes. RAID types that stripe data across disks have what is known as a "stripe size". A common stripe size is 64k, meaning that data is written to each drive in 64k chunks. As an example, a 4 drive RAID-5 would then commit data to disks in chunks of 256k (4 drives, 64k each). There is nothing wrong with this, as long as your files are generally larger than 256k. If they are not, updating the smaller files within this stripe will require a read of all 256k, a modification to the data, recalculation of parity, and then a 256k write back to all the drives! If you have a lot of very small files, the performance penalty to write or modify them can be enormous.
A few guidelines concerning RAID
Armed with these basic guidelines, the data protection scheme you choose is a balance between needed capacity, performance, and the ever-present constraints of budget. However here are some guidelines based on some real world experience:
- Mirrors of single, whole drives is simplest. The ability to take one drive out of an array and read it elsewhere can be a real timesaver over restoring from backup.
- If your application demands utmost performance, consider investing in a hardware-based RAID controller and RAID-10. Otherwise, a hybrid or software-based RAID-10 solution may be sufficient.
- If capacity needs are high and performance and budget requirements are low, a parity-based solution may be a good fit. Consider using RAID-6, particularly if the array will have large numbers of high capacity drives.
- If data integrity is the utmost importance, consider a three-copy mirror and ReFS using Storage Spaces. Background data scrubbing and majority-vote-wins concepts will significantly reduce the chance of spurious data corruption.
- And most importantly: Make sure you have a good backup strategy, and you know you can restore!
About Mike Pepe
Mike Pepe joined Microsoft in 2006 after working in the IT field for ten years providing clustering, backup, and storage solutions for the telecommunications industry. He is currently a Service Engineer working on datacenter-scale automation and service design for Bing.is a Service Engineer for the Bing Information Platform at Microsoft.
Send us feedback
Got any comments or stories concerning RAID solutions and controllers? Let us know at email@example.com