What is a Shared Disk Cluster File System? OCFS2. What follows is the translation by example. Some of it also becomes stream of consciousness as I try to work out exactly what I need, what I don't need, and where I should best invest my resources.
The Reason for OCFS2
Suppose you have a SAN, implemented perhaps by iSCSI target. Suppose now that you have multiple machines that want to access this SAN simultaneously. You could create multiple targets for them, or separate LUNs. But if they actually needed a
shared file system (for the purpose of, say, VM live migration from one host to another - both hosts must be able to access the VM's hard drive image for that to work), then you need a file system like NFS. But NFS, although good, has some issues.
OCFS2 is a shared-disk-access file system, meaning that it represents a file system that multiple machines can access simultaneously. However, that is the limit of its capabilities. It doesn't provide the storage, it merely shares it. (Don't get me wrong - that's a big deal, in-and-of itself.) So, with our iSCSI target and perhaps 3 or 4 servers connected to it, with only one LUN to Rule Them All, we can use OCFS2 to ensure they don't stomp one another or corrupt the entire file system. In the example of the SAN, the target media and the participating hosts are different machines.
This also works for dual-primary DRBD clusters. In this case you have a file system that is being replicated between two hosts by DRBD, and both hosts have read/write access to it. Typically with DRBD, only one host is the primary, and the other host is a silent, unparticipating standby. Now, DRBD only provides the block device, on which we must put a file system. Using OCFS2 as the file system on our DRBD device allows our two hosts to both be primaries, both read and write, and avoid file system corruption. So, in this case, the target media and the participating hosts are the same machines.
Scalability
Let's talk scalability now. We can scale up the number of participating OCFS2 nodes, meaning we can have lots and lots of hosts all accessing the same file system. Grand. What about the backing storage? Well, since OCFS2 doesn't provide it, OCFS2 doesn't care. That being said, whatever backing storage we use, it must be accessible to all the participating machines. So scaling up our iSCSI target means scaling up the servers, in quality and/or quality.
By Quality
I'm going to assume for the rest of this article we're talking about Linux-driven solution. Certainly there are tons of sexy expensive toys you can buy to solve these problems. My problem is I have no sexy expensive toy budget. That being said....
If I have two DRBD-drive storage hosts, I can beef up the (mdadm-governed) RAID stores on them. I can swap out the drives with larger drives and rebuild my way up to larger space. With LVM I can even link multiple RAID arrays together for even larger storage arrays. This would have to be chained as such:
RAID arrays (as PVs) -> VG -> LG -> DRBD -> OCFS2
Store Failure Tolerance:
- RAID: Single (or dual for level 6) drive failure per array
- DRBD: Single host failure
With the right networking equipment, this could be a very fast and reliable configuration, and provide nearly continuous storage up-time. Barring massive power-outages, I would wager that this could serve at least 4-9's.
By Quantity
Increasing the number of storage servers could potentially provide additional redundancy, or at least increased performance. The key to redundancy is, of course, redundant copies of the data. The downside to increasing the quantity of servers is, of course, managing to chain all that storage together. The more links we have in the storage management chain, the slower that storage operates. Worse yet, finding a technology that effectively chains together storage is rather difficult, and is not without its risks.
We could, for instance, throw GlusterFS on the stack, since it can tether nodes together in an LVM-fashion and create one unified file system. Is that worth the trouble? Is it worth the risk, considering the state of the technology? That's not to say it's a bad system, but it seems almost as though the sheer cost to increase the size of the cluster does not necessarily justify that much flexibility. And, of course, there are other thoughts that must now go into a separate post.