BURNING MIDNIGHTm.at.work: 2015/11

This post recaps most of the routes I have taken along the way toward a cluster-appropriate, highly-available virtualization backing-store solution. It will also explore my latest endeavor: Ceph with ZFS.

Goals

My initial goal had been to have a cluster-aware backing store on which to keep virtual machine images. This store would serve multiple hypervisors. The store needed to serve the same images to all hypervisors (to support rapid live migration), and needed to be highly-available. It also needed to provide sufficient performance. Throughout my experiments and production systems, I have used gigabit connections, often bonded using various kinds of Linux bonding modes. I won't recap them all here; suffice to say that my current standard is Balance-ALB. That doesn't mean it's the best, only what I have settled on for sake of having bonding that actually used more than one slave interface at a time.

As my solutions evolved, so too did my goals. Eventually I wanted to make the maximum use of my hardware. I wanted every storage server to actually serve data, to maximize both drive I/O and network utilization. Despite my best efforts, I have never been able to max out my network.

The Past

MooseFS

This was my first foray into a cluster file system. It was interesting to me because it balanced blocks across all the participating nodes, and was very easy to set up and get running. Any machine running the MooseFS client could gain access to the data, and adding additional storage was as trivial as adding more nodes or more hard drives. It worked on block-counts: as long as a sufficient number of replicas were present in the cluster at large, the system was satisfied.

At the time, I wasn't very familiar with utilities such as Pacemaker, and so my chief complaint with MooseFS was the single "mfsmaster" node. If this node went down, all clients hung until it was brought back up. And clients only knew about one node. Not sure if this was corrected in later versions, but at the time it became part of the deal-breaker. The other part was performance: I couldn't get the latency and throughput I wanted from the cluster.

DRBD + OCFS2

The next route I took was toward OCFS2 on top of DRBD. Eventually this would also turn into OCFS2 on top of DRBD, on top of ZFS. Two nodes can be mirrored easily with DRBD, but to turn their mirrored data into a file system that could be read by either node at the same time required a cluster-aware file system. Enter OCFS2.

This solution had several moving parts: the cluster file system, the mirroring, the cluster management software to keep everything running, and of course the KVM hypervisors. Everything was manually configured and maintained. While it ran for a very long time, it also did not perform well. Part of that was probably the hardware: slow hard drives and too few of them, such that ZFS was configured to double up on its writes for at least a modicum of data redundancy per node. It was also very limited in terms of growth: you can't really have more than three or maybe four DRBD nodes, and after that you are out of luck. I want to say you could run them in active/active, but every write had to be picked up by both DBRD nodes, so your throughput was as fast as your slowest drive (or your network link). I dedicated links to DRBD, but it didn't matter. The drive access was so abysmal that even on a data sync between the nodes, the ethernet bond chuckled and went back to idling.

DRBD + OCFS2 + iSCSI

The next logical step was to remove the data store from the hypervisors and have it be stand-alone. I shared the block devices via iSCSI, and brought the hypervisors (who were the iSCSI initiators) into agreement with OCFS2. This, sadly, was another mixed bag. Again, everything begins at your physical storage. If it isn't fast, nothing else is going to be fast, and every step along the way is an additional bit of latency and a loss of throughput.

One thing this did achieve was the ability to have more than two or three hypervisors. I could theoretically have as many as my network and storage could manage. Sadly, this setup would never see more than three. Storage was still tenuous, though now I could swap out a storage node without interrupting the hypervisors or their VMs.

There was an elephant in the room, however, and it was OCFS2. Long story short, OCFS2 can easily bring an entire cluster to its knees. Fencing is absolutely required. Without it, a single node failure and OCFS2 holds everyone up pending fencing. This is the way it was meant to be, and unfortunately my fencing wasn't all that...

As for iSCSI, I was still stuck with one node being the active beast, while the other sat idly by replicating every write. I did split my storage into multiple targets, so that I could get both nodes involved, but since both still had to participate in the DRBD link, it was as if they were a single machine.

What could I have done different? I could have thrown storage at it. At the time, I was mainly interested in redundancy, and so performance suffered. I later learned how to stripe redundant arrays inside ZFS and get better performance, but that requires an abundance of drives I simply didn't have.

DRBD + NFS

Seeing as how one of the chief problems with OCFS2 was how it could lock up an entire virtualization cluster with ease, I decided to investigate an alternative. NFS had received relatively high remarks for performance, and was a native cluster file system. This, coupled with the fact that it was becoming increasingly difficult to integrate OCFS2 + Pacemaker + CMAN on my nodes with every new release of Ubuntu, made for a compelling reason to test.

The benefits: it's all server-side, so if a hypervisor goes belly-up, the share does not go down with it. It's relatively fast and clean. Fail-over is fairly painless.

The detriments: as with iSCSI, you can only use one node to serve your data. But actually, it's a little worse than that. Let's suppose you want to offer up several NFS shares, hosted perhaps on different drives. All shares have to be shared from the same node. The reason for this is the NFS client handles: NFS server keeps track of them, and consequently you have to also share them between the nodes for fail-over to work (otherwise the clients end up with stale handles and the connections drop). There does not appear to be an easy way to merge to lists of handles, and it sounds like a kludge anyway.

For a detriment, that isn't so bad. Let's face it, if you want to offer fast cluster storage without the OCFS2 baggage, NFS is pretty much as good as it gets, so long as you are sticking with an active/passive storage node setup. Again, extremely fast underlying storage is key to making it work.

The Present: Ceph, and ZFS

A long-time demand has been data integrity. Ever since I had some RAID adapters that shredded my data, and a batch of hard drives that habitually silently corrupted it, I have been extremely cautions. This has not come without a price, either: the inverse of reliability is performance and capacity. The more reliability you have, generally the less performance and less capacity you end up with. Think of RAID-5, RAID-6, RAID-Z3, etc. Balancing performance, capacity and reliability is an ongoing struggle.

Ceph brings to the picture some very interesting attributes. First, the ability to remove all single points of failure. Multiple monitor nodes can be addressed by the hypervisors, and the OSDs are redundant by design. The only questionable feature is the metadata server, which is not required for my environment.

When writing, it writes to all the appropriate disks and waits for those writes to finish. Not every disk in the cluster is involved with a given write. When reading, it reads from the "primary" source for a given piece of data. Data is replicated, and the replicas are secondary to a single primary. Primaries are spread out across the cluster, but the point is that only one disk is going to give you that piece of data. On the other hand, for many pieces of data, you may get them from many disks 0- and consequently very quickly.

Ceph is also a big "more is more" fan. The more nodes you have, the better for resiliency. The more drives you have, the better for performance and redundancy. Best of all, it digs commodity hardware. My kind of tech.

Now, I'm not here to sing praises, and its initial setup and subsequent maintenance can be onerous enough. But let's consider a configuration and discuss options. For starters, I wanted not just fast storage, but very fast. My goal therefore was to let Ceph manage as much of the redundancy as possible, and have ZFS provide basic data integrity and performance. Since ZFS performs per-block checksums, that would ensure integrity. By striping across two or three drives, I can get double-to-triple the performance on reads and writes. This comes at a cost: a single drive failure in that stripe takes out the whole stripe. I need more drives, or more stripes, to spread out the risk. I'd like to think the risk is no greater than having the same number of single drives as I would have stripes.

Each stripe is an OSD, and by keeping the number of participating drives small, I can have more OSDs and more resiliency against catastrophic data-loss. My basic setup is three nodes, two of which with two OSDs each, one with one OSD (for the moment). Each OSD is in fact two drives striped in ZFS. Ceph recommends running the journals to an SSD, so that's what I did. I created a single XFS file system on part of an SSD, and configured Ceph to put all the journals there.

The zpools also benefit from some SSD-love, so I gave part of each SSD over to each zpool as a ZIL device. I split the SSD up with LVM, incidentally, to make resizing and moving easier. Having the SSDs do double-duty seems sub-optimal to me, but I have limited hardware at the moment. In future incarnations I would like to have multiple SSDs, for not only the ZIL and the OSD journals, but also the Ceph monitors.

When using XFS for the journal, the requirement to disable DIO for the OSD goes away. I remap my zpools' default mount location (zfs set mountpoint=/var/lib/ceph/osd/ceph-2 for instance) instead of creating additional ZFS devices. The journals are kept in /var/lib/ceph/osd/journals.

Standing up the OSD is then straightforward, and can easily be done manually. Ceph-deploy would be nice to use, but I'm still gaining experience, and have yet to really explore the power of centralizing the configuration file. Ultimately I should be able to use it for OSD replacement, but I'm just not there yet. Luckily, the manual process is well documented, and once you've done it a few times it ceases to seem tricky.

I ran several tests, but my metric gathering was not really all that stellar. I won't bother reporting many numbers here, other than to say that I noticed significant performance increases from the striping. This was most evident when I transitioned one node from being RAID-Z to being a stripe. My read throughput can reach upwards of 100 Mbytes/sec, and the writing throughput tends to reach between 30 and 60 Mbytes/sec, depending on which OSDs are participating. The servers have a heterogeneous collection of old SATA-2 drives, so I expect performance to improve further with a homogeneous collection SATA-3 drives and more SSDs.

Latency varies, but usually doesn't go beyond a few hundred milliseconds. However, I have yet to really load down the cluster with a lot of VM images.

Options and Builds

I am glad I haven't purchased production hardware yet, as I really didn't understand Ceph's needs until I dug into it. As it stands, running a minimum of three nodes appears to be a requirement. Four would be better, and is therefore now my target minimal deployment.

Since writing the above, I have had the opportunity to experience both intentional and natural failure scenarios, complete with rebuilds. One of my OSDs died and I could have replaced it outright. Instead I chose to augment the cluster with another server and two OSDs. After watching the rebuild and observing some metrics via atop, I concluded that my single SSD was the acting bottleneck and that the ceph journal was the most likely culprit. While the ZIL partition also garnered significant activity, the journal partition was exceptionally busy, such that the spinners never really went about 20% utilization. The recovery took about 6 hours, and 400G of data (1.2T of replicated data) was reallocated across the nodes once the newest node came online. That yields an average recovery rate of about 18MB/sec, if I did the math right.

As such, I am now considering the following for my full production build-out:

4 servers with 24 2.5" hot-swap bays each, outfitted initially with 4 or 6 spinners each, and two SSDs per 6 spinners.
Each pair of spinners will be striped together and serve as a single OSD.
For each pair of SSDs, one will be a ZIL volume shared among the 3 OSDs, and the other a journal volume shared likewise.

Alternatively, a very fast spinner may serve as a better journal volume.
Another alternative: striped mirrors

The backplane, drives, and HBA will be SAS-12Gb/sec

Finding an affordable SAS-12 SSD is actually very challenging.

This configuration can support a maximum 8 OSDs per server, with 8 disks dedicated to ZIL and journal. By revising the plan to use a striped-mirror set for the journal, it might be possible to reduce the ZIL/journal to 6 disks and gain a 9th OSD per server. Two benefits of the striped-mirror set: less chance for catastrophic ZIL or journal failure/loss, and potentially a 3x speed-up on reads and writes to the underlying medium. The question then becomes whether or not it makes sense to use SSDs. Given the amount of work required of the ceph journal, I would not think it wise to reduce the physical size of the journal array below 6 disks.

Using 900G drives and considering 8 OSDs per server, this cluster can support approximately 57.6T of data (or 19.2T after 3x replication), and can be easily expanded with additional servers.

There is still some considerable testing to be done, to tune for optimum performance. Also, having now gone through the exercise of adding an additional node to a live and evolving cluster, I can appreciate having a centralized configuration scheme that is consistent across all nodes. I will try to post again with the steps I took and packages I installed to add storage to my cluster. By and large the steps for adding storage are merely the final steps of a new cluster install, with tweaks.

One of my only concerns is how ZFS will perform once the cluster data has aged a bit. The frequency of writes seems to impact long-term performance. I am also concerned about my choice to employ XFS on the SSDs for the current cluster nodes; this may not have been an optimal choice, though I did not yet perform any additional tuning.

All things being equal, I am very, very, very much enjoying Ceph!

BURNING MIDNIGHT
m.at.work

20151107

High-Availability, High Performance Data Stores for Virtualization: A Recap, and an Experiment