20160321

Port Isolation, VLANs, and Ceph: Lessons Learned

While configuring a new production Ceph cluster, I ran into problems that could only be described as asinine.  We built all the hardware, installed all the operating systems, deployed the latest stable Ceph release, and angels sang out from above....

....and then the cluster said, "Hm...I can't find some of my OSDs....degrade degrade degrade OH WAIT!  There they are..."  And a little later it said, again, "Hm...I can't find some of my OSDs...degrade degrade..."  just like before.  And it kept doing it.

I hit Google with search after search, and of course the Ceph documentation said something to the effect of: "If you have problems, make sure your networking is functioning properly."  So I tried to validate it.  I tested it.  I tested everything I could think of.  I started even disabling ports on the switches, in an attempt to isolate which host was causing the issues.

But as I did this, I noticed something strange.  Well, for starters, I configured all my machines with quad-NIC cards, and split the NICs so that two would serve client traffic, and two would serve cluster traffic.  I had also set the bonding to be balance-ALB.  I have two gigabit switches, so as to remove single-points-of-failure.  And when I disabled some of the ports on one of the switches, the problems went away.

I tested and retested, and couldn't find a reason why this should be.  I tried other ports on the switch, other ports on the servers, all with similar results.  Finally I started up CSSH and began running ping tests between the machines.  I then started shutting down ports in a divide-and-conquer search for the truth.  Eventually, I found that two of the ports (which were for the two servers that seemed to be having most of the problems) on one of the switches were acting very peculiar.  I verified all their settings again and again, and still the same results.  Finally I started going through every single fucking category of settings on my network switches, until I came to "Port Isolation Group"  Inside that, the ports in question were in fact being isolated from the rest of the switch.  I realized I had done this a very long time ago to keep our wifi traffic separate from our LAN traffic.  Turning off port isolation fixed the problem.

And my head slammed the table.

But the fun wasn't over!  In the fight to determine why I had such strange problems, I decided a switch-reboot would be a fun thing to do.  Having two nearly-identically utilized switches meant one could go offline while the other stayed online.  Or so I thought.

I had been working over the past couple of years to make the best use of VLANs.  I hate VLANs, by the way, from a security point of view.  Anyway, circumstances being what they were, I had to use them.  And use them I did!  Unfortunately, I was also trying to be very secure and not allow out-of-scope traffic to hit the other switches.  Enter in the fact that I have several redundant links between the switches, and that I rely on MSTP to do The Right Think (tm), and you have a recipe for additional annoying headaches.

After yet more reading and analysis, I determined that MSTP had configured the spanning-tree to put the root somewhere other than my two core switches.  Now my two core are joined with a 6-port aggregation between them.  I figure that's plenty of bandwidth, but for the fact that the switches wouldn't use it.  And since the cluster and client traffic was only permitted to go over those ports, this became a very big problem.  Manually plotting out the spanning tree allowed me to understand this, and moreover gave me at least an interim answer.  I reconfigured the two core switches to be the preferred roots, and from there all other links fell into place.  To make sure of this, I also configured the non-preferred links between those switches and the others in the network to be of greater cost.

To be certain that a switch failure still did not nuke the network, I ended up configuring the trunks to allow all valid VLANs.  I may eventually pull this back, once I get a better handle on MSTP.  Ideally, MSTP should figure out how to use the appropriate links, but unfortunately I have more to learn there.

So lesson learned: make sure your networking isn't fucked up.

Don't Run Ceph+ZFS on Shitty Hardware

Isn't just the way it goes, when things go to hell at 2 in the morning?

And that's what's happened.  But it wasn't the first time.

I have (or soon to be: "had") an experimental Ceph cluster running on some spare hardware.  As is proper form for experimental things, it quickly became production when the need to free up some other production hardware arrived.  Some notes about this wonder-cluster:

  • It runs a fairly recent, but not the latest, Ceph.
  • It started with two nodes, and grew to four, then shrunk to three.
  • It has disks of many sizes.
  • It uses ZFS.
  • The hardware is workstation-class.
  • The drives are old and several have died.
Sounds like production-quality to me!  Ha...  but, that was my choice, and I'm now reaping what I have sown.

Long story short, some of the OSDs occasionally get all bound up in some part of ZFS, as near as I can tell.  It could be that the drives are not responding fast enough, or that there's a race condition in ZFS that these systems are hitting with unfortunate frequency.  Whatever the reason, what ends up happening is that the kernel log gets loaded up with task-waiting notifications, and since there are only 5 out of 8 OSDs still living, the cluster instantly freezes operation due to insufficient replication.  Note that the data, at least, is still safe.

Typically I've had to hard-reboot machines when this happens.  My last attempt - this very evening - took place from my home office by way of command-line SYSRQ rebooting (thanks, Major Hayden!  I love your page on that topic!).  Unfortunately, graceful shutdowns don't tend to work when the kernel gets in whatever state I find it at times like these.  One morning, I had to have my tech hard-cycle a machine that was even inaccessible via SSH.

Generally what happens next is that the machine in question comes back online, I turn the Ceph-related services back on, the cluster recovers, and everything goes on its merry way...for the most part.  If the hypervisors have been starved for IO for several hours, I end up rebooting most of the VMs to get them moving.  Unfortunately, tonight was not going to be that awesome.

I had been in the process of migrating VMs off my old Ceph+Proxmox cluster, and on to a new Ceph+Proxmox cluster.  This had been going well, but during one particular transfer something peculiar happened... I suddenly couldn't back up VMs any longer.  I also noticed on the VM console for the VM in question several telltale kernel alerts, the usual business of "I can't read or write disk! AAAHHH!!"  I logged into one of the old Ceph boxes and sure enough, an OSD had gone down.  The OSDs on the machine in question were pretty pegged, stuck waiting for disk I/O.  But the disks?  Not doing anything interesting, ironically.  atop reported zero usage.  So, I figured a hard-reset was in order, and did my command-line ritual to force the reboot.  But it never came back...

Now, at the risk of jinxing myself (since my transfers are not yet complete), I'm going to say right now that fate was on my side.  I had transferred all but a very small handful of VMs to the new cluster, and this last set I was saving for last anyway.  But they were also important, and I decided it would be much better just to get them transferred before people starting ringing my phone at 6:30 in the morning.  The only problem was how to access the images with a frozen Ceph cluster.

I'm sure a kitten somewhere will die every time someone reads this, but I reconfigured the old Ceph cluster to operate with 1 replica.  Since I wouldn't be doing much writing, and I just had to get the last few VMs off the old storage, I felt (and hoped) it would be an OK thing to do.  Needless to say, I am feverishly transferring while the remaining OSDs are yet living.

Probably the main limiting factor to the transfer rate is the NFS intermediary I have to go through, to get the VMs from one cluster to another.  But I must credit Proxmox: their backup and restore functionality has made this infinitely easier than the last time I migrated VMs.  The last time, I was transferring from a virsh/pacemaker (yes, completely command-line) configuration.  Nothing wrong with virsh or pacemaker (both of which are very powerful packages), but I have to say I'm sold on Proxmox for large-scale hypervisor management...especially for the price!

Between my two new production hypervisors, I have just under 80 VMs running.  I'd like a third hypervisor, but I'm not sure I can sell my boss on that just yet.  My new production Ceph store has about 4.5T in use, out of 12.9T of space, and I haven't installed all the hard drives yet.  When they came in, I noticed that they were all basically made from the same factory, on the same day, so I decided that we'd stagger their install so as to give us hopefully some buffer for when they start dying.

Transfer rates on the new Ceph cluster can reach up to 120MB/sec writing.  I was hoping for more, but a large part of that may be the fact that I'm using ZFS for the OSDs, and for the journals, and the journals are not on super-expensive ultra-fast DRAM-SSDs.  The journals are, for what it's worth, on SSDs, but unfortunately several of the SSDs keep removing themselves from operation.  So far I haven't lost any journals, but I'm sure it will happen sooner or later.  Sigh...

And the VM transfers are.....almost done....