Wednesday, July 13, 2011

Greyhole Vs Raid 5/6

If you're here you're probably wondering about Greyhole. It lets you span drives, but why is it any better/worse/different than Raid5/6? Well, it's different for several reasons. It's a fundamentally different approach to pooling drives. Let's get visual for a moment.

A rough diagram of how Greyhole works. (As of 0.9.9)

An over-simplification of a raid configuration
The above two diagrams (excuse me glossing over details -- we're here to get theoretical, not technical) show a simple overview of Greyhole and Raid. Can you spot the differences? The most important distinction: Pooling is done above the filesystem level in Greyhole. Okay, so what does this mean? Well, in a raid system, your storage space is pooled before you create a file system. This means the filesystem can only function when it has all the drives together.

What difference does this distinction make?

Modularity
  • Individual drives from a raid 5/6 array are not readable on other machines
    • The drives only have part of a filesystem
  • Since the Greyhole pooling is done above the filesystem level, the individual drives are readable on other machines
    • If you were to take any Greyhole pool drive from your server and hook it up to another pc, all the data that lives on that drive is right there and easily accessible.
Flexibility
  • Since Greyhole is really just creating a logical mapping between symlinks and files it has moved to pool drives, you can add new pool volumes instantly
  • On a raid volume you'd need to reshape your entire array (for large arrays this can take > 24 hours) and then you'd need to expand the file system to take up the extra space that now exists in your logical raid volume
    • The same thing holds for removing drives, however, this will take significantly longer for Greyhole than adding since it must migrate all data onto other volumes
  • Unlike raid, since Greyhole is simply flipping files on to different pool volumes, it can use different sized drives
    • Got a 100gb, a 1tb, and a 3tb drive? No problem
    • Note: That in the above example you would not be able to successfully create  2x file copy redundancy if all the drives were full (think about it, 1.9tb of files would have no other drives to make copies on than the 3tb)
  • Raid requires volumes of the same size (unless you use something like lvm to combine smaller partitions to the size of other volumes) to run, it has to calculate parity across all the data and to do that in a consistent way the amount of bytes, etc, must be the same.
    • You could partition larger drives out, but if you put two partitions from the same drive in an array you've just completely negated your fault tolerance, if that larger drive died raid 5 would be shot and raid 6 would be at the end of its fault tolerance
Fault Tolerance
  • Greyhole and Raid 5/6 each have their pros and cons for parity / redundancy
  • Raid 5/6 can handle one / two drive failures and still keep your data intact, and it can do this while only sacrificing one / two drives to parity out of total number of drives you're using
    • This is efficient, but it can also take a long time for large volumes to repair once new drive(s) are added in to replace the failed ones
    • Once you step beyond one / two failures all the data is dead and gone completely
      • Since your data is spanned across all your volumes, the likelihood of any data being wholly sound while the array is completely degraded is unlikely and once that degraded array goes down, it won't be coming back up
  • Greyhole lets you set X file copies per Samba share, so you can set it to two and Greyhole will create two copies of every file you transfer to it
    • This is, of course, less efficient, instead of using 1/x (or 2/x) space for parity, you're now just using x space, or a one to one backup
    • You can potentially lose data with just two failures
      • If you have data on drive 1 and Greyhole creates a backup copy on drive 2 and then both those drives fail, you've lost said data
        • Note: This assumes you have 2x file copies set, if you set 3x file copies it would take 3 failures to lose data -- but then you would need 3x the hard drive space for 1x data
        • See the chart at the end of this section for a visual explanation of how two failures could result in data loss even when you're creating 2x file copies
      • In the above situation all data would be safe in raid 6, but would be gone in raid 5
  • Surviving drives are not impacted by failures in Greyhole
    • If you have five drives in a Greyhole pool and drive three dies, the other four drives are fine and still accessible
*Note: If you diligently keep backups, fault tolerance is not the biggest concern in the world.
Notice that Your_Picture.jpg is on Drive 1 and Drive 2, even though you've got redundancy, if both those drives died, Your_Picture.jpg is forever lost :(
* Note that the above configuration is illustrating a share with 2 file copies and the 'most_available_space' dir selection algorithm 


Interface
  • As of right now Greyhole relies on Samba to capture file system events
    • This means all file operations must occur through Samba or Greyhole will not know about them
    • If you want to manipulate the Greyhole volume locally on your server you must locally mount the volume with cifs
    • Note: There's an open ticket on the Greyhole github about using an alternative mechanism for logging file system events which would decouple Greyhole from Samba, but this has not been done as of 0.9.9
  • Since raid 5/6 is done below the filesystem level, this is not an issue and you can modify locally, through Samba, etc
Performance
  • Greyhole will give you little if any read performance boosts
    • It's possible you can end up reading multiple files from different drives at the same time and get a speed boost this way, but that's about it
  • Raid 5 / 6 stripe data across all your drives at a low level
    • When you read a file there's a high likely hood that the file actually spans across many or all the drives in your raid volume
      • This means if you read a file the system doesn't just have to wait on the current read to go through, it can go on to the next drive and wait on it as well, and the next, etc, giving you a good boost in read speeds
  • Greyhole writes are no slower than writing directly to a drive*
    • *Once the file is written to the landing zone Greyhole will have to copy it again to one of your pool drives, so in reality it takes something like twice the amount of time to write a file, plus any overhead for creating meta data -- though this is largely transparent to the end user!
  • Raid 5 / 6 has a good amount of overhead associated with writing data since it must create parity for the data, which requires complex calculations and read operations across all the disks
So what does all this mean, really? Well, Greyhole is generally a far more flexible framework for pooling. It is limited though, in its dependence on Samba. It also won't give you any significant performance boosts. It does allow you to instantly grow your pool. Raid 5 / 6 is more efficient, but has a far more rigid structure and in extreme failure scenarios results in total data loss. In cases of minor failure each system has its merits depending on your viewpoint.

The bottom line for me, when dealing with a large volume, raid 6 was a hassle. Reshaping or recovering the array took about 40 hours, even over eSata with a reasonably fast dual core processor. With Greyhole things are far more flexible. If my sata controller decides to spontaneously reset a port, I do not have to worry about an array falling over and the ensuing force reassembly. In fact, when the drives have reset, Greyhole hasn't even blinked. For a home user scenario, if you're willing to deal with 2x the space usage*, (more akin to Raid 10) then Greyhole is a clear winner for it's flexibility. If you're willing to put up with raid's rigidity and you cannot abide the space required for one to one redundancy in Greyhole then raid 5 or 6 is by no means a bad choice. Like most things it comes down to your situation and preferences, but I hope this has given you a basis for making an informed decision between the two.

* You do not have to use 2x storage space. If you are confident in your backup strategy you could set 1 file copy and use all your storage space for files and no redundancy. In this case any failure will result in data loss -- which isn't a big deal if your backups are up to date.

Recap


Greyhole

  1. Support for varied volume size
  2. Flexible architecture -- easily and quickly add / remove drives
    1. Any individual drive can be read from other machines
  3. Better worst-case fault tolerance
  4. Only provides for one to one backups (2x the space)
  5. No performance gains
  6. Coupled to Samba
    1. Requires locally mounting through Samba to change storage pool files on the server
Raid
  1. Generally provides a read performance boost
  2. More efficient fault tolerance
    1. Raid 5 requires 1 drive for parity no matter how many total and raid 6 requires 2 drives for parity no matter how many total
    2. In raid 5 any single drive failure can be tolerated and the array will rebuild once you replace the failed volume, raid 6 can handle any two drive failures
  3. Is like any other volume and does not require samba or any other interface for interaction with array files
  4. Rigid requirements
    1. Drives are only readable when all are together in an assembled array
    2. Individual drives cannot be read on other machines
    3. Requires identically sized volumes
  5. Rebuild / reshaping times for large volumes can be slow
  6. If you surpass the fault tolerance (2 or 3 failures depending on your raid level) your data is completely gone.

4 comments:

  1. When you say Greyhole "Only provides for one to one backups", I'm unclear what this means. In the text you say you can chose the number of backups, and in the diagram it shows 2 copies of a file. I'm not sure how this is different from RAID-5 in that regard? Of RAID-6 where you choose 3 copies of everything?

    ReplyDelete
  2. It's different in that you don't actually have two full copies or three full copies of everything when you use Raid 5 and Raid 6 respectively, you just have parity. You're only dedicating 1/x or 2/x drives worth of space (where x is the total number of drives) to this parity. If a drive goes down you can recreate it with math magic. As a simple example (which is most certainly NOT how the parity is calculated) 1 + 2 + 1 + x = 7, we can determine that x was 3 and thus recreate a given byte from the missing drive and repeat until all bytes are accounted for. What this means is that no actual full file backup is ever created in Raid 5/6, this is only done in Raid 1 (and Raid 1+0). Greyhole can create one or more backups of your files but it can't do parity, so each backup takes a full factor of space. (If you want 1 back up it'll take 2x the space, 2 backups it'll take 3x the space.) Incidentally it's for this reason that you'll see pasted across the internet that Raid 5 and Raid 6 are not the same as having backups.

    ReplyDelete
  3. Hi Andrew, What are the implications of your root drive going down but all your gh pooled drives are fine. Is it posible to recover the gh pool if the mysql data is lost? When setting up a gh pool are there other backups that could be done to help recover from a loss of the root drive?

    ReplyDelete
  4. If the landing zone goes down it doesn't really hurt anything. Same goes for the mysql data. You'll have to set a new landing zone or get a working mysql instance up, obviously, but there's very little damage done.

    In the case of a landing zone failure you'd set a new landing zone, reset samba and greyhole and then run a greyhole fsck. Greyhole will walk the graveyards (or metastores depending on your version) and figure out what symlinks are missing (in this case all of them) from the landing zone and recreate them. If this was your system drive that went down the same solution applies although you'd need to recreate your greyhole config.

    As far as mysql goes I believe that holds queued operations and scheduled fscks. I could be wrong -- I haven't looked at the code in a year or so and I definitely haven't kept up with the latest developments -- but I don't think there's anything critical that is lost if the mysql data gets wiped.

    I suppose you could backup the greyhole config to hedge against a system drive failure, but I see no value added in backing up the landing zone or mysql data.

    ReplyDelete

Followers