BURNING MIDNIGHTm.at.work: 2013/04

20130429

AoE, you big tease...

I did some more testing with AoE today. I'll try to detail here what it does and doesn't appear to be.

Using Multiple Ethernet Ports

The aoe driver you modprobe will give you the option of using multiple ethernet ports, or at the very least selecting which port to use. I'm not sure what the intended functionality of this feature is, because if your vblade server is not able to communicate across more than 1 port at a time, you're really not going to find this very useful. The only way I've been able to see multi-gigabit speeds is to create RR bonds on both the server and the client. This requires either direct-connect or some VLAN magic on a managed switch, since many/most switches don't dig RR traffic by itself.

I could see where this feature would work out well if you have multiple segments or multiple servers, and want to spread the load across multiple ports that way. Otherwise I don't see much usefulness here.

How did I manage RR on my switch?

So, do to this on a managed switch, I created two VLANs for my two bond channels, and assigned one port from each machine to each channel. Four switch ports, two VLANs, and upwards to 2Gb/sec bandwidth. This is thus expandable to any number of machines if you can handle the caveat that should a machine lose one port, it will lose all ability to communicate effectively with the rest of its network over this bond. This is because the RR scheduler on both sides expects all paths to be connected. A sending port cannot see that the recipient has left the party if both are attached to a switch (which should always be online). ARP monitoring might take care of this issue, maybe, but then I don't think it will necessarily tell you not to send to a client on a particular channel and you'll need all your servers ARPing each other all the time. Sounds nasty.

AoE did handle RR traffic extremely well. Anyone familiar with it will note that packet-ordering is not guaranteed, and you will most definitely have some of your later packets arriving before some of your earlier packets. In UDP tests the numbers are usually not very large for small bandwidth tests. The higher the transmission rates, the higher the out-of-ordering.

The Best Possible Speed

To test the effectiveness of AoE, with explicit attention to the E part, I created a ramdrive on the server, seeded it with a 30G file (I have lots of RAM), and then served that up over vblade. I ran some tests using dbench and dd. To ensure that no local caching effects skewed the results, I had to set the various /proc/sys/vm/dirty_* fields to zero - specifically, ratio and background_ratio. Without doing that, you'll see fantastic rates of 900MB/sec, which is a moonshot above any networking gear I have to work with.

With a direct connection between my two machines, and RR bonds in place, I could obtain rates of around 130MB/sec. The same appeared true for my VLAN'd switch. Average latency was very low. In dbench, the WriteX call had the highest average latency of 267ms. Even flushes ran extremely fast. That makes me happy, but the compromise is that there is no fault-tolerance, other than what we'd see for if a whole switch dies - and that is, by the way, assuming you have your connections and VLANs spread across multiple switches.

Without all of that rigging, the next best thing is balance-alb, and then you're back to standard gigabit with the added benefit of fault-tolerance. As far as AoE natively using multiple interfaces, the reality seems to be that this feature either doesn't exist like it's purported to, or it requires additional hardware (read: Coraid cards). Since vblade itself requires a single interface to bind to, the best hope is a bond, and no bond mode except RR will utilize all available slaves for everything. That's the blunt truth of it. As far as the aoe module itself, I really don't know what its story is. Even with the machines directly connected and the server configured with a RR bond, the client machine did not seem to actively make use of the two adapters.

Dealing with Failures

One thing I like about AoE is that it is fairly die-hard. Even when I forcefully caused a networking fault, the driver recovered once the connectivity returned and things returned to normal. I guess as long as you don't actively look to kill the connection with an aoe-flush, you should be in a good state no matter what goes wrong.

That being said, if you're not pushing everything straight to disk and something bad happens on your client, you're looking at some quantity of data now missing from your backing store. How much will depend on those dirty_* parameters I mentioned earlier. And catastrophic faults rarely happen predictably.

Of course, setting the dirty_* parameters to something sensible and greater than zero may not be an entirely bad thing. Allowing some pages to get cached seems to lend itself to significantly latency and throughput. How to measure the risk? Well, informally, I'm watching network via ethstatus. The only traffic on the selected adapter is AoE. As such, it's pretty easy to see when big accesses start and stop. In my tests against the ramdrive store, traffic started immediately flowing and stopped a few seconds after the dbench test completed. Using dd without the oflag=direct option left me with a run that finished very quickly, but that did not appear to be actually committed to disk until about 30 seconds later. Again, kernel parameters should help this.

Hitting an actual disk store yielded similar results, with only occasional hiccups. For the most part the latency numbers stayed around 10ms, with transfer speeds reaching over 1100MB/sec (however dbench calculates this, especially considering that observed network speeds never reached beyond an aggregate 70MB/sec).

Security Options

I'm honestly still not a fan of the lack of security features in AoE. All the same, I'm allured to it, and want to now perform a multi-machine test. Having multiple clients with multiple adapters on balance-alb will mean that I have to reconfigure vblade to MAC-filter for all those MACs, or not use MAC filtering at all. That might be an option, and to that end perhaps putting it in a VLAN (for segregation sake, not for bandwidth) wouldn't be so bad. Of course, that's all really just to keep honest people honest.

Deploying Targets

If we keep the number of targets to a minimum, this should work out OK. I still don't like that you have to be mindful of your target numbers - deploying identical numbers from multiple machines might spell certain doom. For instance, I deployed the same device numbers on another machine, and of course there is no way to distinguish between the two. vblade doesn't even complain that there are identical numbers in use. Whether or not this will affect targets that are already mounted and in-use, I know not. The protocol does not seem to concern itself with this edge-case. As far as I am concerned, the whole thing could be more easily resolved by using a runtime-generated UUID instead of just the device ID numbers. I guess we'll see how actively this remains developed.

Comparison with iSCSI

I haven't done this yet, but plan to very soon.

Further Testing

I'll be doing some multi-machine testing, and also looking into the applicable kernel parameters more closely. I want to see how AoE responds to the kind of hammering only a full compliment of virtual machines can provide. I also want to make sure my data is safe - outages happen far more than they should, and I want to be prepared.

20130424

AoE - An Initial Look

I was recently investigating ways to make my SAN work faster, harder, better. Along the way, I looked at some alternatives.

Enter AoE, or ATA-over-Ethernet. If you're reading this you've probably already read some of the documentation and/or used it a bit. It's pretty cool, but I'm a little concerned about, well, several things. I read some scathing reviews from what appear to be mud-slingers, and I don't like mud. I have also read several documents that almost look like copy-and-paste evangelism. Having given it a shot, I'm going to summarize my immediate impressions of AoE here.

Protocol Compare: AoE is only 12 pages, iSCSI is 257.
That's nice, and simplicity is often one with elegance. But that doesn't mean it's flawless, and iSCSI has a long history of development. With such a small protocol you also lose, from what I can tell, a lot of the fine-tuning knobs that might allow for more stable operation under less-ideal conditions. That being said, with such a small protocol it would hopefully be hard to screw up its implementation, either in software or hardware.

I like that it's its own layer-2 protocol. It feels foreign, but it's also very, very fast. I think it would be awesome to overlay options of iSCSI on the lightweight framework of AoE.

Security At Its Finest: As a non-routeable protocol, it is inherently secure.
OK, I'm gonna have to say, WTF? Seriously? OK, seriously, let's talk about security. First, secure from who or what? It's not secure on a LAN in an office with dozens of other potentially compromised systems. It's spitting out data over the wire unencrypted, available for any sniffer to snag.

Second, it can be made routeable (I've seen the HOW-TOs), and that's cool, but I've never heard of a router being a sufficient security mechanism. Use a VLAN, you say? VLAN-jumping is now an old trick in the big book of exploits. Keep your AoE traffic on a dedicated switch and the door to that room barred tightly. MAC filtering to control access is cute, but stupid. Sniff the packets, spoof a MAC and you're done. Switches will not necessarily protect your data from promiscuous adapters, so don't take that for granted. Of course, we may as well concede that a sufficiently-motivated individual WILL eventually gain access or compromise a system, whether it's AoE-based or iSCSI-based. But I find the sheer openness of AoE disturbing. If I could wrap it up with IPsec or at least have some assurance that the target will be extra-finicky about who/what it lets in, I'd be a little happier, even with degraded performance (within reason).

Then there's the notion that I just like to make sure human error is kept to a bare minimum, especially when it's my fingers on the keyboard. Keeping targets out of reach means I won't accidentally format a volume that is live and important. Volumes are exported by e-numbers, so running multiple target servers on your network means you have to manage your device exports very carefully. Of course, none of this is mentioned in any of the documentation, and everyone's just out there vblading their file images as e0.0 on eth0.

Sorry, a little disdain there as the real world crashes in. I'll try to curb that.

Multiple Interfaces? Maybe.
If you happen to have several interfaces on your NIC that are not busy being used for, say, bonding or bridging, then you can let AoE stream traffic over them all! Bitchin'. This is the kind of throughput I've been waiting for...except I can't seem to use it without some sacrifices.

For me, the problem starts with running virtual machines that sometimes need access to iSCSI targets. These targets are "only available" (not totally but let's say they are) over adapters configured for jumbo frames. What's more, the adapters are bonded because, well, network cables get unplugged sometimes and switches sometimes die. The book said: "no single points of failure," so there. But maybe it is not so much of an issue and I just need to hand over a few ports to AoE and be done with it?

The documentation makes it clear how to do this on the client. On the server, it's not so clear. I think you bond some interfaces with RR-scheduling, and then let AoE do the rest. How this will work on a managed gigabit switch that generally hates RR bonding, I do not yet know. I also have not (yet) been able to use anything except the top-most adapter of any given stack. For example, I have 4 ports bonded (in balance-alb) and the bond bridged for my VMs. I can't publish vblades to the 4 ports directly, nor to the bond, but I can to the bridge. So I'm stuck with the compromise of having to stream AoE data across the wire at basically the same max rate as iSCSI. Sadness.

Control and Reporting
I'm not intimately familiar with the vblade program, but so far it's not exactly blowing my skirt up. My chief complaints to-date: I want to be able to daemonize it in a way that's more intelligent than just running in the background; I would like to get info about who's using what resources, how many computing/networking resources they're consuming, etc; I had to hack up a resource agent script so that Pacemaker could reliably start and stop vblade - the issue seemed to involve stdin and stdout handling, where vblade kept crashing.

Since it's not nice to say only negatives, here are some positives: It starts really fast; it's lightweight; Fail-over should work flawlessly, and configuration is as easy as naming a device and an adapter to publish it on. It does one thing really, really well: it provides AoE services. And that's it. It will hopefully not crash my hosts.

aoetools is another jewel - in mixed definitions of that word. Again I find myself pining for documentation, reporting, statistics, load information, and a scheme that is a little more controllable and less haphazard-feeling than modprobe aoe gives you your devices. Believe me, I think it's cool that it's so simple. I just somehow miss the fine-grained and ordered control of iSCSI. Maybe this is just alien to me and I need to get used to it. I fear there are gotchas I have not yet encountered.

It's FAST!
There's a catch to that. The catch is that AoE caches a great deal of data on the initiator and backgrounds a lot of the real writing to the target. So you know that guy that did that 1000 client test with dbench? He probably wasn't watching his storage server wigging out ten minutes after the test completed. My tests were too good to be true, and after tuning to ensure writes hit the store as quickly as possible, the real rates presented themselves.

I can imagine that where reading is the primary activity, such as when an VM boots, this is no biggie. But when I may have a VM host suddenly fail, I don't want a lot of dirty data disappearing with the host. That would be disastrous.

Luckily, they give some hints on tuneables in /proc/sys/vm. At one point I cranked the dirty-pages and dirty-ratio all the way to zero, just to see how the system responded. dbench was my tool of choice, and I ran it with a variety of different client sizes. I think 50 was about the max my systems could handle without huge (50-second) latencies. A lot of that is probably my store servers, which are both somewhat slow in the hardware and extremely safe (in terms of data corruption protection and total RAID failures). I'll be dealing with them soon.

Other than that, I think it'd be hard to beat this protocol over the wire, and it's so low-level that overhead really should be at a minimum. I do wish the kernel-gotchas were not so ominous; since this protocol is so low-level, your tuning controls become kernel tuning controls, and that bothers me a little. Subtle breakage in the kernel would not be a fun thing to debug. Read carefully the tuning documentation that is barely referenced in the tutorials (or not referenced at all - did I mention I would like to see better docs? Maybe I'll write some here after I get better at using this stuff.).

Vendor Lock-in
I read that, and thought: "Gimme a break!" Seriously guys, if you're using Microsoft, or VMware, you're already locked in. Don't go shitting yourself about the fact there's only one hardware vendor right now for AoE cards. Double-standards are bad, man.

Overall Impressions
So to summarize...

I would like more real documentation, less "it's so AWESOME" bullshit, and some concrete examples of various implementations along with their related tripping-hazards and performance bottlenecks. (Again, I might write some as I go.)

I feel the system as a whole is still a little immature, but has amazing potential. I'd love to see more development of it, some work on more robust and effective security against local threats, and some tuning controls to help those of us who throw 10 or 15 Windows virtuals at it. (Yeah, I know, but I have no choice.) If anyone is using AoE for running gobs of VMs on cluster storage, I'd love to hear from you!!

If iSCSI and AoE had a child, it would be the holy grail of network storage protocols. It would look something like this:

a daemon to manage vblades, query and control their usage, and distribute workload.
the low-and-tight AoE protocol, with at least authentication security if not also full data envelope (options are nice - we like options. Some may not want or need security, some of us do).
target identification, potentially, or at least something to help partition out the vblade-space a little better. I think of iSCSI target IDs and their LUNs, and though they're painful, they're also explicit. I like explicitness.
Some tuning parameters outside the kernel, so we don't feel like we're sticking our hands in the middle of a gnashing, chomping, chortling machine.

Although billed as competition to iSCSI, I think AoE actually serves a slightly different audience. Whereas iSCSI provides a great deal of control and flexibility in managing SAN-access to a wide variety of clients, AoE offers unbridled power and throughput on highly controlled and protected network. I really could never see using AoE to offer targets to coworkers or clients, since a single slip-up out on the floor could spell disaster. But I'm thinking iSCSI may be too slow for my virtualization clusters.

iSCSI can be locked down. AoE can offer near-full-speed data access.

Time will tell which is right for me.

20130417

Sustainable HA MySQL/MariaDB

I ran into this problem just yesterday, and thought I'd write about what I'm trying to do to fix it. Use at your own risk; hopefully this will work well.

I needed to run updates on my DB cluster. It's a two-node cluster, and generally stable on Ubuntu 12.04.2 LTS. Unfortunately, the way I had configured my HA databases, when I ran updates one of the nodes completely broke. Unable to start MariaDB, the update process failed. MariaDB couldn't start because the database files were nowhere to be found on that node at that time.

Not liking the idea of having to update the database server "hot," then migrating over to the second node and updating it "hot" again, I thought perhaps this would be a good time for some manual package management. This would mean the following:

I'd have to get the packages manually and configure the essentials accordingly - factory-default paths be damned!
No more automatic updates - a mixed bag: they're awesome when they work and terrible when they don't. Luckily they usually "Just Work" (tm)
I'd have the latest and greatest that MariaDB has to offer.
I would have to be more mindful in the future about updates and making sure things don't break en-route to a new version.

OK, so originally I had installed MariaDB via apt-get, and put the database files themselves on an iSCSI target. I used bind-mounts to place everything (from configuration files to the actual db files) where MySQL/MariaDB expected everything to be. For this fix, my first thought was to put the binaries (well, the whole MariaDB install) on the iSCSI target. This would mean one upgrade, one copy of binaries, and only one server capable of starting said database.

That didn't work - Pacemaker needs access to the binaries to make sure the database isn't started elsewhere on the cluster. So, I set up a directory structure as follows:

/opt/mariadb

.../versions/ (put your untarred-gzipped deployments here)
.../current --> links to versions/(current version you want to use)
.../var --> this is where the iSCSI target will now be mounted
.../config --> my.cnf and conf.d/... will be here

MariaDB offers precompiled tar.gz deployments, which is really nice. I can put these wherever I want. In this case I'm going the route of having an escape-route for future upgrades by putting the fresh deployment files in a versions/ directory and linking to the version that I want to use. No changes to configuration files or Pacemaker should be necessary, and upgrades won't stomp existing deployments this way. Of course, back up your databases frequently and before each upgrade.

Inside /opt/mariadb/var, I've placed a log and db directory. log originally came from /var/log, and has a variety of transaction logs in it. The db folder contains the actual database files, what would normally be found in /var/lib/mysql.

The configuration files MIGHT work under the /opt/mariadb/var folder, which would mean it ought to be named something more appropriate. I left them out for sake of having them always available on both nodes. I felt this was a safer route and don't have time to experiment much.

The my.cnf file has to be properly configured. I snagged the my.cnf file that the original MariaDB apt-get install provided, and changed paths accordingly. Now there are no bind-mounts, and for all intents and purposes I could simply duplicate the entire /opt/mariadb directory on a new node and be up and running in no-time. (New node deployment is technically untested as of this writing.)

Note that if you happen to be moving existing log files (especially a .index file), the .index file will contain file paths that need to be updated. sed will be your friend here, and you can cat the file to see the contents. Once everything is done, you should be able to perform the following command and see a successful MariaDB launch:

/opt/mariadb/current/bin/mysqld --defaults-file=/opt/mariadb/config/my.cnf

In case you don't know, here's how you shut down your successful launch:

/opt/mariadb/current/bin/mysqladmin --defaults-file=/opt/mariadb/config/my.cnf shutdown -p

The MySQL primitive in Pacemaker needs to be properly configured. Here is what mine looks like:

primitive p_db-mysql0 ocf:heartbeat:mysql \
params binary="/opt/mariadb/current/bin/mysqld" \
config="/opt/mariadb/config/my.cnf" \
datadir="/opt/mariadb/var/db" \
pid="/var/run/mysqld/mysqld.pid" \
socket="/var/run/mysqld/mysqld.sock" \
op start interval="0" timeout="120s" \
op stop interval="0" timeout="120s" \
op monitor interval="20s" timeout="30s"

So far, this new configuration seems to work. Comments and suggestions are welcome.

BURNING MIDNIGHT
m.at.work