20130429

AoE, you big tease...

I did some more testing with AoE today.  I'll try to detail here what it does and doesn't appear to be.

Using Multiple Ethernet Ports

The aoe driver you modprobe will give you the option of using multiple ethernet ports, or at the very least selecting which port to use.  I'm not sure what the intended functionality of this feature is, because if your vblade server is not able to communicate across more than 1 port at a time, you're really not going to find this very useful.  The only way I've been able to see multi-gigabit speeds is to create RR bonds on both the server and the client.  This requires either direct-connect or some VLAN magic on a managed switch, since many/most switches don't dig RR traffic by itself.

I could see where this feature would work out well if you have multiple segments or multiple servers, and want to spread the load across multiple ports that way.  Otherwise I don't see much usefulness here.

How did I manage RR on my switch?

So, do to this on a managed switch, I created two VLANs for my two bond channels, and assigned one port from each machine to each channel.  Four switch ports, two VLANs, and upwards to 2Gb/sec bandwidth.  This is thus expandable to any number of machines if you can handle the caveat that should a machine lose one port, it will lose all ability to communicate effectively with the rest of its network over this bond.  This is because the RR scheduler on both sides expects all paths to be connected.  A sending port cannot see that the recipient has left the party if both are attached to a switch (which should always be online).  ARP monitoring might take care of this issue, maybe, but then I don't think it will necessarily tell you not to send to a client on a particular channel and you'll need all your servers ARPing each other all the time.  Sounds nasty.

AoE did handle RR traffic extremely well.  Anyone familiar with it will note that packet-ordering is not guaranteed, and you will most definitely have some of your later packets arriving before some of your earlier packets.  In UDP tests the numbers are usually not very large for small bandwidth tests.  The higher the transmission rates, the higher the out-of-ordering.

The Best Possible Speed

To test the effectiveness of AoE, with explicit attention to the E part, I created a ramdrive on the server, seeded it with a 30G file (I have lots of RAM), and then served that up over vblade.  I ran some tests using dbench and dd.  To ensure that no local caching effects skewed the results, I had to set the various /proc/sys/vm/dirty_* fields to zero - specifically, ratio and background_ratio.  Without doing that, you'll see fantastic rates of 900MB/sec, which is a moonshot above any networking gear I have to work with.

With a direct connection between my two machines, and RR bonds in place, I could obtain rates of around 130MB/sec.  The same appeared true for my VLAN'd switch.  Average latency was very low.  In dbench, the WriteX call had the highest average latency of 267ms.  Even flushes ran extremely fast.  That makes me happy, but the compromise is that there is no fault-tolerance, other than what we'd see for if a whole switch dies - and that is, by the way, assuming you have your connections and VLANs spread across multiple switches.

Without all of that rigging, the next best thing is balance-alb, and then you're back to standard gigabit with the added benefit of fault-tolerance.  As far as AoE natively using multiple interfaces, the reality seems to be that this feature either doesn't exist like it's purported to, or it requires additional hardware (read: Coraid cards).  Since vblade itself requires a single interface to bind to, the best hope is a bond, and no bond mode except RR will utilize all available slaves for everything.  That's the blunt truth of it. As far as the aoe module itself, I really don't know what its story is.  Even with the machines directly connected and the server configured with a RR bond, the client machine did not seem to actively make use of the two adapters.

Dealing with Failures

One thing I like about AoE is that it is fairly die-hard.  Even when I forcefully caused a networking fault, the driver recovered once the connectivity returned and things returned to normal.  I guess as long as you don't actively look to kill the connection with an aoe-flush, you should be in a good state no matter what goes wrong.  

That being said, if you're not pushing everything straight to disk and something bad happens on your client, you're looking at some quantity of data now missing from your backing store.  How much will depend on those dirty_* parameters I mentioned earlier.  And catastrophic faults rarely happen predictably.

Of course, setting the dirty_* parameters to something sensible and greater than zero may not be an entirely bad thing.  Allowing some pages to get cached seems to lend itself to significantly latency and throughput.  How to measure the risk?  Well, informally, I'm watching network via ethstatus.  The only traffic on the selected adapter is AoE.  As such, it's pretty easy to see when big accesses start and stop.  In my tests against the ramdrive store, traffic started immediately flowing and stopped a few seconds after the dbench test completed.  Using dd without the oflag=direct option left me with a run that finished very quickly, but that did not appear to be actually committed to disk until about 30 seconds later.  Again, kernel parameters should help this.

Hitting an actual disk store yielded similar results, with only occasional hiccups.  For the most part the latency numbers stayed around 10ms, with transfer speeds reaching over 1100MB/sec (however dbench calculates this, especially considering that observed network speeds never reached beyond an aggregate 70MB/sec).

Security Options

I'm honestly still not a fan of the lack of security features in AoE.  All the same, I'm allured to it, and want to now perform a multi-machine test.  Having multiple clients with multiple adapters on balance-alb will mean that I have to reconfigure vblade to MAC-filter for all those MACs, or not use MAC filtering at all.  That might be an option, and to that end perhaps putting it in a VLAN (for segregation sake, not for bandwidth) wouldn't be so bad.  Of course, that's all really just to keep honest people honest.

Deploying Targets

If we keep the number of targets to a minimum, this should work out OK.  I still don't like that you have to be mindful of your target numbers - deploying identical numbers from multiple machines might spell certain doom.  For instance, I deployed the same device numbers on another machine, and of course there is no way to distinguish between the two.  vblade doesn't even complain that there are identical numbers in use.  Whether or not this will affect targets that are already mounted and in-use, I know not.  The protocol does not seem to concern itself with this edge-case.  As far as I am concerned, the whole thing could be more easily resolved by using a runtime-generated UUID instead of just the device ID numbers.  I guess we'll see how actively this remains developed.

Comparison with iSCSI

I haven't done this yet, but plan to very soon.

Further Testing

I'll be doing some multi-machine testing, and also looking into the applicable kernel parameters more closely.  I want to see how AoE responds to the kind of hammering only a full compliment of virtual machines can provide.  I also want to make sure my data is safe - outages happen far more than they should, and I want to be prepared.





No comments:

Post a Comment