20130424

AoE - An Initial Look

I was recently investigating ways to make my SAN work faster, harder, better.  Along the way, I looked at some alternatives.

Enter AoE, or ATA-over-Ethernet.  If you're reading this you've probably already read some of the documentation and/or used it a bit.  It's pretty cool, but I'm a little concerned about, well, several things. I read some scathing reviews from what appear to be mud-slingers, and I don't like mud.  I have also read several documents that almost look like copy-and-paste evangelism.  Having given it a shot, I'm going to summarize my immediate impressions of AoE here.

Protocol Compare: AoE is only 12 pages, iSCSI is 257.
That's nice, and simplicity is often one with elegance.  But that doesn't mean it's flawless, and iSCSI has a long history of development.  With such a small protocol you also lose, from what I can tell, a lot of the fine-tuning knobs that might allow for more stable operation under less-ideal conditions.  That being said, with such a small protocol it would hopefully be hard to screw up its implementation, either in software or hardware.

I like that it's its own layer-2 protocol.  It feels foreign, but it's also very, very fast.  I think it would be awesome to overlay options of iSCSI on the lightweight framework of AoE.

Security At Its Finest: As a non-routeable protocol, it is inherently secure.
OK, I'm gonna have to say, WTF?  Seriously?  OK, seriously, let's talk about security.  First, secure from who or what?  It's not secure on a LAN in an office with dozens of other potentially compromised systems.  It's spitting out data over the wire unencrypted, available for any sniffer to snag.

Second, it can be made routeable (I've seen the HOW-TOs), and that's cool, but I've never heard of a router being a sufficient security mechanism.  Use a VLAN, you say?  VLAN-jumping is now an old trick in the big book of exploits.  Keep your AoE traffic on a dedicated switch and the door to that room barred tightly.  MAC filtering to control access is cute, but stupid.  Sniff the packets, spoof a MAC and you're done.  Switches will not necessarily protect your data from promiscuous adapters, so don't take that for granted.  Of course, we may as well concede that a sufficiently-motivated individual WILL eventually gain access or compromise a system, whether it's AoE-based or iSCSI-based.  But I find the sheer openness of AoE disturbing.  If I could wrap it up with IPsec or at least have some assurance that the target will be extra-finicky about who/what it lets in, I'd be a little happier, even with degraded performance (within reason).

Then there's the notion that I just like to make sure human error is kept to a bare minimum, especially when it's my fingers on the keyboard.  Keeping targets out of reach means I won't accidentally format a volume that is live and important.  Volumes are exported by e-numbers, so running multiple target servers on your network means you have to manage your device exports very carefully.  Of course, none of this is mentioned in any of the documentation, and everyone's just out there vblading their file images as e0.0 on eth0.

Sorry, a little disdain there as the real world crashes in.  I'll try to curb that.

Multiple Interfaces?  Maybe.
If you happen to have several interfaces on your NIC that are not busy being used for, say, bonding or bridging, then you can let AoE stream traffic over them all!  Bitchin'.  This is the kind of throughput I've been waiting for...except I can't seem to use it without some sacrifices.

For me, the problem starts with running virtual machines that sometimes need access to iSCSI targets.  These targets are "only available" (not totally but let's say they are) over adapters configured for jumbo frames.  What's more, the adapters are bonded because, well, network cables get unplugged sometimes and switches sometimes die.  The book said: "no single points of failure," so there.  But maybe it is not so much of an issue and I just need to hand over a few ports to AoE and be done with it?

The documentation makes it clear how to do this on the client.  On the server, it's not so clear.  I think you bond some interfaces with RR-scheduling, and then let AoE do the rest.  How this will work on a managed gigabit switch that generally hates RR bonding, I do not yet know.  I also have not (yet) been able to use anything except the top-most adapter of any given stack.  For example, I have 4 ports bonded (in balance-alb) and the bond bridged for my VMs.  I can't publish vblades to the 4 ports directly, nor to the bond, but I can to the bridge.  So I'm stuck with the compromise of having to stream AoE data across the wire at basically the same max rate as iSCSI.  Sadness.

Control and Reporting
I'm not intimately familiar with the vblade program, but so far it's not exactly blowing my skirt up.  My chief complaints to-date:  I want to be able to daemonize it in a way that's more intelligent than just running in the background;  I would like to get info about who's using what resources, how many computing/networking resources they're consuming, etc;  I had to hack up a resource agent script so that Pacemaker could reliably start and stop vblade - the issue seemed to involve stdin and stdout handling, where vblade kept crashing.

Since it's not nice to say only negatives, here are some positives:  It starts really fast; it's lightweight;  Fail-over should work flawlessly, and configuration is as easy as naming a device and an adapter to publish it on.  It does one thing really, really well: it provides AoE services.  And that's it.  It will hopefully not crash my hosts.

aoetools is another jewel - in mixed definitions of that word.  Again I find myself pining for documentation, reporting, statistics, load information, and a scheme that is a little more controllable and less haphazard-feeling than modprobe aoe gives you your devices.  Believe me, I think it's cool that it's so simple.  I just somehow miss the fine-grained and ordered control of iSCSI.  Maybe this is just alien to me and I need to get used to it.  I fear there are gotchas I have not yet encountered.

It's FAST!
There's a catch to that.  The catch is that AoE caches a great deal of data on the initiator and backgrounds a lot of the real writing to the target.  So you know that guy that did that 1000 client test with dbench?  He probably wasn't watching his storage server wigging out ten minutes after the test completed.  My tests were too good to be true, and after tuning to ensure writes hit the store as quickly as possible, the real rates presented themselves.

I can imagine that where reading is the primary activity, such as when an VM boots, this is no biggie.  But when I may have a VM host suddenly fail, I don't want a lot of dirty data disappearing with the host.  That would be disastrous.

Luckily, they give some hints on tuneables in /proc/sys/vm.  At one point I cranked the dirty-pages and dirty-ratio all the way to zero, just to see how the system responded.  dbench was my tool of choice, and I ran it with a variety of different client sizes.  I think 50 was about the max my systems could handle without huge (50-second) latencies.  A lot of that is probably my store servers, which are both somewhat slow in the hardware and extremely safe (in terms of data corruption protection and total RAID failures).  I'll be dealing with them soon.

Other than that, I think it'd be hard to beat this protocol over the wire, and it's so low-level that overhead really should be at a minimum.  I do wish the kernel-gotchas were not so ominous; since this protocol is so low-level, your tuning controls become kernel tuning controls, and that bothers me a little.  Subtle breakage in the kernel would not be a fun thing to debug.  Read carefully the tuning documentation that is barely referenced in the tutorials (or not referenced at all - did I mention I would like to see better docs?  Maybe I'll write some here after I get better at using this stuff.).

Vendor Lock-in
I read that, and thought: "Gimme a break!"  Seriously guys, if you're using Microsoft, or VMware, you're already locked in.  Don't go shitting yourself about the fact there's only one hardware vendor right now for AoE cards.  Double-standards are bad, man.

Overall Impressions
So to summarize...

I would like more real documentation, less "it's so AWESOME" bullshit, and some concrete examples of various implementations along with their related tripping-hazards and performance bottlenecks.  (Again, I might write some as I go.)

I feel the system as a whole is still a little immature, but has amazing potential.  I'd love to see more development of it, some work on more robust and effective security against local threats, and some tuning controls to help those of us who throw 10 or 15 Windows virtuals at it.  (Yeah, I know, but I have no choice.)  If anyone is using AoE for running gobs of VMs on cluster storage, I'd love to hear from you!!

If iSCSI and AoE had a child, it would be the holy grail of network storage protocols.  It would look something like this:

  • a daemon to manage vblades, query and control their usage, and distribute workload.
  • the low-and-tight AoE protocol, with at least authentication security if not also full data envelope (options are nice - we like options.  Some may not want or need security, some of us do).
  • target identification, potentially, or at least something to help partition out the vblade-space a little better.  I think of iSCSI target IDs and their LUNs, and though they're painful, they're also explicit.  I like explicitness.
  • Some tuning parameters outside the kernel, so we don't feel like we're sticking our hands in the middle of a gnashing, chomping, chortling machine.
Although billed as competition to iSCSI, I think AoE actually serves a slightly different audience.  Whereas iSCSI provides a great deal of control and flexibility in managing SAN-access to a wide variety of clients, AoE offers unbridled power and throughput on highly controlled and protected network.  I really could never see using AoE to offer targets to coworkers or clients, since a single slip-up out on the floor could spell disaster.  But I'm thinking iSCSI may be too slow for my virtualization clusters.  

iSCSI can be locked down.  AoE can offer near-full-speed data access.

Time will tell which is right for me.


2 comments:

  1. What kind of switch infrastructure is recommended? I can only imagine big iron Cisco GB.
    However... I had a thought on that. I have seen multi port PCI express GB network cards. I wonder if that might provide even tighter integration as well as security. Of course then there is the io bottlenecks of the host system...

    ReplyDelete
    Replies
    1. I'm using some relatively "inexpensive" HP V1910 gigabit switches, and so far my only bottleneck appears to be the drive and replication I/O on my storage cluster. My hypervisors each use two or three of these: http://ark.intel.com/products/49186/Intel-Ethernet-Server-Adapter-I340-T4. But only two ports per machine are used for SAN comm, currently. Jumbo packets are definitely a plus. Bandwidth testing with iperf shows full-duplex transmission rates around 980Mbit/sec (balance-alb).

      On host-to-host direct connections with those cards, I have been able to push out around 3Gb/sec with four ports in use (round-robin), over CAT-5e. 2-port round-robin bonds with CAT-6 reach about 1.95Gb/sec, and that's with TCP in the stack. So the cards are really quite awesome IMO. Even over the switch I am seeing around 96% maximum theoretical bandwidth. I just wish it was easier to do reliable RR over the switch.

      I'll be doing some AoE testing using ramdrives, just to exercise the networking, and will publish the results in another post.

      Delete