BURNING MIDNIGHTm.at.work: Don't Run Ceph+ZFS on Shitty Hardware

Isn't just the way it goes, when things go to hell at 2 in the morning?

And that's what's happened. But it wasn't the first time.

I have (or soon to be: "had") an experimental Ceph cluster running on some spare hardware. As is proper form for experimental things, it quickly became production when the need to free up some other production hardware arrived. Some notes about this wonder-cluster:

It runs a fairly recent, but not the latest, Ceph.
It started with two nodes, and grew to four, then shrunk to three.
It has disks of many sizes.
It uses ZFS.
The hardware is workstation-class.
The drives are old and several have died.

Sounds like production-quality to me! Ha... but, that was my choice, and I'm now reaping what I have sown.

Long story short, some of the OSDs occasionally get all bound up in some part of ZFS, as near as I can tell. It could be that the drives are not responding fast enough, or that there's a race condition in ZFS that these systems are hitting with unfortunate frequency. Whatever the reason, what ends up happening is that the kernel log gets loaded up with task-waiting notifications, and since there are only 5 out of 8 OSDs still living, the cluster instantly freezes operation due to insufficient replication. Note that the data, at least, is still safe.

Typically I've had to hard-reboot machines when this happens. My last attempt - this very evening - took place from my home office by way of command-line SYSRQ rebooting (thanks, Major Hayden! I love your page on that topic!). Unfortunately, graceful shutdowns don't tend to work when the kernel gets in whatever state I find it at times like these. One morning, I had to have my tech hard-cycle a machine that was even inaccessible via SSH.

Generally what happens next is that the machine in question comes back online, I turn the Ceph-related services back on, the cluster recovers, and everything goes on its merry way...for the most part. If the hypervisors have been starved for IO for several hours, I end up rebooting most of the VMs to get them moving. Unfortunately, tonight was not going to be that awesome.

I had been in the process of migrating VMs off my old Ceph+Proxmox cluster, and on to a new Ceph+Proxmox cluster. This had been going well, but during one particular transfer something peculiar happened... I suddenly couldn't back up VMs any longer. I also noticed on the VM console for the VM in question several telltale kernel alerts, the usual business of "I can't read or write disk! AAAHHH!!" I logged into one of the old Ceph boxes and sure enough, an OSD had gone down. The OSDs on the machine in question were pretty pegged, stuck waiting for disk I/O. But the disks? Not doing anything interesting, ironically. atop reported zero usage. So, I figured a hard-reset was in order, and did my command-line ritual to force the reboot. But it never came back...

Now, at the risk of jinxing myself (since my transfers are not yet complete), I'm going to say right now that fate was on my side. I had transferred all but a very small handful of VMs to the new cluster, and this last set I was saving for last anyway. But they were also important, and I decided it would be much better just to get them transferred before people starting ringing my phone at 6:30 in the morning. The only problem was how to access the images with a frozen Ceph cluster.

I'm sure a kitten somewhere will die every time someone reads this, but I reconfigured the old Ceph cluster to operate with 1 replica. Since I wouldn't be doing much writing, and I just had to get the last few VMs off the old storage, I felt (and hoped) it would be an OK thing to do. Needless to say, I am feverishly transferring while the remaining OSDs are yet living.

Probably the main limiting factor to the transfer rate is the NFS intermediary I have to go through, to get the VMs from one cluster to another. But I must credit Proxmox: their backup and restore functionality has made this infinitely easier than the last time I migrated VMs. The last time, I was transferring from a virsh/pacemaker (yes, completely command-line) configuration. Nothing wrong with virsh or pacemaker (both of which are very powerful packages), but I have to say I'm sold on Proxmox for large-scale hypervisor management...especially for the price!

Between my two new production hypervisors, I have just under 80 VMs running. I'd like a third hypervisor, but I'm not sure I can sell my boss on that just yet. My new production Ceph store has about 4.5T in use, out of 12.9T of space, and I haven't installed all the hard drives yet. When they came in, I noticed that they were all basically made from the same factory, on the same day, so I decided that we'd stagger their install so as to give us hopefully some buffer for when they start dying.

Transfer rates on the new Ceph cluster can reach up to 120MB/sec writing. I was hoping for more, but a large part of that may be the fact that I'm using ZFS for the OSDs, and for the journals, and the journals are not on super-expensive ultra-fast DRAM-SSDs. The journals are, for what it's worth, on SSDs, but unfortunately several of the SSDs keep removing themselves from operation. So far I haven't lost any journals, but I'm sure it will happen sooner or later. Sigh...

And the VM transfers are.....almost done....

BURNING MIDNIGHT
m.at.work

20160321

Don't Run Ceph+ZFS on Shitty Hardware

No comments:

Post a Comment

About Me

Followers