20160321

Port Isolation, VLANs, and Ceph: Lessons Learned

While configuring a new production Ceph cluster, I ran into problems that could only be described as asinine.  We built all the hardware, installed all the operating systems, deployed the latest stable Ceph release, and angels sang out from above....

....and then the cluster said, "Hm...I can't find some of my OSDs....degrade degrade degrade OH WAIT!  There they are..."  And a little later it said, again, "Hm...I can't find some of my OSDs...degrade degrade..."  just like before.  And it kept doing it.

I hit Google with search after search, and of course the Ceph documentation said something to the effect of: "If you have problems, make sure your networking is functioning properly."  So I tried to validate it.  I tested it.  I tested everything I could think of.  I started even disabling ports on the switches, in an attempt to isolate which host was causing the issues.

But as I did this, I noticed something strange.  Well, for starters, I configured all my machines with quad-NIC cards, and split the NICs so that two would serve client traffic, and two would serve cluster traffic.  I had also set the bonding to be balance-ALB.  I have two gigabit switches, so as to remove single-points-of-failure.  And when I disabled some of the ports on one of the switches, the problems went away.

I tested and retested, and couldn't find a reason why this should be.  I tried other ports on the switch, other ports on the servers, all with similar results.  Finally I started up CSSH and began running ping tests between the machines.  I then started shutting down ports in a divide-and-conquer search for the truth.  Eventually, I found that two of the ports (which were for the two servers that seemed to be having most of the problems) on one of the switches were acting very peculiar.  I verified all their settings again and again, and still the same results.  Finally I started going through every single fucking category of settings on my network switches, until I came to "Port Isolation Group"  Inside that, the ports in question were in fact being isolated from the rest of the switch.  I realized I had done this a very long time ago to keep our wifi traffic separate from our LAN traffic.  Turning off port isolation fixed the problem.

And my head slammed the table.

But the fun wasn't over!  In the fight to determine why I had such strange problems, I decided a switch-reboot would be a fun thing to do.  Having two nearly-identically utilized switches meant one could go offline while the other stayed online.  Or so I thought.

I had been working over the past couple of years to make the best use of VLANs.  I hate VLANs, by the way, from a security point of view.  Anyway, circumstances being what they were, I had to use them.  And use them I did!  Unfortunately, I was also trying to be very secure and not allow out-of-scope traffic to hit the other switches.  Enter in the fact that I have several redundant links between the switches, and that I rely on MSTP to do The Right Think (tm), and you have a recipe for additional annoying headaches.

After yet more reading and analysis, I determined that MSTP had configured the spanning-tree to put the root somewhere other than my two core switches.  Now my two core are joined with a 6-port aggregation between them.  I figure that's plenty of bandwidth, but for the fact that the switches wouldn't use it.  And since the cluster and client traffic was only permitted to go over those ports, this became a very big problem.  Manually plotting out the spanning tree allowed me to understand this, and moreover gave me at least an interim answer.  I reconfigured the two core switches to be the preferred roots, and from there all other links fell into place.  To make sure of this, I also configured the non-preferred links between those switches and the others in the network to be of greater cost.

To be certain that a switch failure still did not nuke the network, I ended up configuring the trunks to allow all valid VLANs.  I may eventually pull this back, once I get a better handle on MSTP.  Ideally, MSTP should figure out how to use the appropriate links, but unfortunately I have more to learn there.

So lesson learned: make sure your networking isn't fucked up.

No comments:

Post a Comment