20121031

Bonded Bonds - A Brief Follow-up

I am reconsidering my original plan to bond two mode-4 bonds over a mode-6 super-bond.

To be fair, it works.  BUT, link recovery-time on total catastrophic all-link-failure is not good.  I would guess this is an edge-case that no one has ever really thought about before, and perhaps there is insufficient maturity in this sort of functionality.  OR I have horribly mis-configured the super-bond device.  More tuning and testing would be in order.

Basically, I have my two bonds connected to two switches and 802.3ad running.  Everything looked smooth, and aside from the fact that I can't get any iperf bandwidth tests to show rates above 1 gbit/sec, I felt it was a fairly stable.  With the mode-6 super-bond joining the two mode-4 bonds, I could easily disconnect any three wires and still have connectivity, albeit with a slight (1 to 3 seconds) delay if the currently active slave was not the last mode-4 bond to be connected.  I suspect that delay is ARP-related, but that's just a hunch.

The trouble began when I simulated a multi-switch failure.  Ordinarily I would expect that once either switch is restored, the super-bond will come back up.  That was either (a) not the case, or (b) taking so long to occur that it was unacceptable.  Of course, if a multi-switch failure really occurred during live operation, chances are I would not be able to do anything about the resultant consequences in time to prevent other resultant and catastrophic consequences, and maybe this is an edge-test-case that I should worry less about.

For the record, to test the above failure case, I pulled the cables first from switch A, then from switch B, one at a time.  After a few seconds I plugged a cable back into switch A.  No results.  I'll have to test again to be certain, but I either had to plug both A cables back into switch A, and/or both B cables back into switch B before I saw connectivity restored.  I want to say it was the latter case - the last switch to disconnect being the first one that would need to reconnect.

Using mode-6 on all links does allow me to traverse the switches and fail both switches, the last switch to fail being the last to be brought back up, with no appreciable delay in restoration of connectivity.  I suspect that there is some non-talking happening between the super-bond and its subordinate bonds, but without digging through the code or asking people a lot more knowledgeable than myself, I will never no for certain.

Of course, on that note, Ubuntu 12.04 hasn't been exactly friendly about bringing my links back up after reboots.  If it can't detect a carrier on one of the lines, it won't bring the interface back.  Worse yet, it won't (or didn't) add it to the mode-6 bond on boot.  If it's a deterministic failure on boot, fine.  If it's non-deterministic, so much the worse, but the outcome would be the same:  here again I may need to resort back to my manual script for initial configuration, as I want to guarantee that no matter the circumstances, the system is configured to expect connectivity someday, if not immediately.

I plan on doing additional bandwidth testing, and will probably try the stacked bonding one more time for kicks.  I really want the maximum possible bandwidth along with the maximum possible reliability.  We shall see how things turn out.

No comments:

Post a Comment