20120426

Cluster File Systems, Beginning Trials

My first foray into cluster file systems taught me a great deal, though I still have a great deal to learn.  Since coming to understand what OCFS2 can do for me, I have employed it thus (in a sandbox environment):

  • Create one iSCSI target - this would be our backing store, or "SAN".
  • Create three different accessing nodes - my iSCSI initiators.
  • Configure the three initiators into the same OCFS2 cluster.
  • Use OCFS2 to manage the iSCSI store effectively.
In short, it worked.  All three initiators connected to the iSCSI target simultaneously, and, using OCFS2, were able to read/write the file system, together, in real-time.

After browsing through an Ubuntu document on high-availability iSCSI, I think the road-map will be as follows:
  1. Set up a data-store server (server 1) with DRBD and iSCSI Enterprise Target.
  2. Set up a second data-store server (server 2) to mirror the first server.
  3. Configure High-Availability services.
  4. Configure one or more new iSCSI initiators to use this store and OCFS2.
  5. Test fail-over (during reading and writing).
  6. Test live-migration of some sample VMs.
This begs the question: what are the limits of this cluster?  Well, the physical file system limits are the limits of the technology and what I'm able to attach.  To that end, I can expand the systems to be quite large, although performance may suffer.  Increasing the size of the data-store cluster is an option, and a very viable one even without a technology to bind the cluster members together (in terms of their storage).  As far as any member of the OCFS2 cluster is concerned, there can be any number of drive targets on the cluster, which equates (in our case) to any number of iSCSI targets.  

Adding more data-store servers means basically adding one or more new iSCSI targets.  All the initiators will access all of the targets, and data-migration from one target to another can happen anywhere, at any time.  Live-migration will continue to work, and the only thing we really lose out on is increased redundancy - that is to say, the redundancy does not improve, but it also does not necessarily diminish.  As the servers are already intended to be highly-available, here I think we've reached the threshold to the point of diminishing returns.

I will hopefully post configuration file samples soon, so that this information may live on.

20120425

OFCS2 - The Short Explanation

What is a Shared Disk Cluster File System?  OCFS2.  What follows is the translation by example.  Some of it also becomes stream of consciousness as I try to work out exactly what I need, what I don't need, and where I should best invest my resources.

The Reason for OCFS2

Suppose you have a SAN, implemented perhaps by iSCSI target.  Suppose now that you have multiple machines that want to access this SAN simultaneously.  You could create multiple targets for them, or separate LUNs.  But if they actually needed a shared file system (for the purpose of, say, VM live migration from one host to another - both hosts must be able to access the VM's hard drive image for that to work), then you need a file system like NFS.  But NFS, although good, has some issues.

OCFS2 is a shared-disk-access file system, meaning that it represents a file system that multiple machines can access simultaneously.  However, that is the limit of its capabilities.  It doesn't provide the storage, it merely shares it.  (Don't get me wrong - that's a big deal, in-and-of itself.)  So, with our iSCSI target and perhaps 3 or 4 servers connected to it, with only one LUN to Rule Them All, we can use OCFS2 to ensure they don't stomp one another or corrupt the entire file system.  In the example of the SAN, the target media and the participating hosts are different machines.

This also works for dual-primary DRBD clusters.  In this case you have a file system that is being replicated between two hosts by DRBD, and both hosts have read/write access to it.  Typically with DRBD, only one host is the primary, and the other host is a silent, unparticipating standby.  Now, DRBD only provides the block device, on which we must put a file system.  Using OCFS2 as the file system on our DRBD device allows our two hosts to both be primaries, both read and write, and avoid file system corruption.  So, in this case, the target media and the participating hosts are the same machines.

Scalability

Let's talk scalability now.  We can scale up the number of participating OCFS2 nodes, meaning we can have lots and lots of hosts all accessing the same file system.  Grand.  What about the backing storage?  Well, since OCFS2 doesn't provide it, OCFS2 doesn't care.  That being said, whatever backing storage we use, it must be accessible to all the participating machines.  So scaling up our iSCSI target means scaling up the servers, in quality and/or quality.

By Quality
I'm going to assume for the rest of this article we're talking about Linux-driven solution.  Certainly there are tons of sexy expensive toys you can buy to solve these problems.  My problem is I have no sexy expensive toy budget.  That being said....

If I have two DRBD-drive storage hosts, I can beef up the (mdadm-governed) RAID stores on them.  I can swap out the drives with larger drives and rebuild my way up to larger space.  With LVM I can even link multiple RAID arrays together for even larger storage arrays.  This would have to be chained as such:

   RAID arrays (as PVs) -> VG -> LG -> DRBD -> OCFS2

Store Failure Tolerance:

  • RAID: Single (or dual for level 6) drive failure per array
  • DRBD: Single host failure
With the right networking equipment, this could be a very fast and reliable configuration, and provide nearly continuous storage up-time.  Barring massive power-outages, I would wager that this could serve at least 4-9's.



By Quantity
Increasing the number of storage servers could potentially provide additional redundancy, or at least increased performance.  The key to redundancy is, of course, redundant copies of the data.   The downside to increasing the quantity of servers is, of course, managing to chain all that storage together.  The more links we have in the storage management chain, the slower that storage operates.  Worse yet, finding a technology that effectively chains together storage is rather difficult, and is not without its risks.

We could, for instance, throw GlusterFS on the stack, since it can tether nodes together in an LVM-fashion and create one unified file system.  Is that worth the trouble?  Is it worth the risk, considering the state of the technology?  That's not to say it's a bad system, but it seems almost as though the sheer cost to increase the size of the cluster does not necessarily justify that much flexibility.  And, of course, there are other thoughts that must now go into a separate post.