iSCSI Target and Initiator Control
(Happily, I came to the realization that such a thing was probably not possible before I ever came to that article - now at least I feel like less of a dumb-ass.)
Suppose you want to offer up an HA iSCSI target to initiators (clients). Suppose your initiators and your target are both governed by Pacemaker - in fact, are in the same cluster. Here's a block diagram. I'm using DRBD to sync two "data-stores."
The current DRBD primary, ds-1, is operating the iSCSI target. For simplicity, ds-2 is acting as the only initiator here. What does this solution give us? If ds-1 dies for any reason, ideally we'd like the file system on ds-2 to not know anything about the service interruption. That is, of course, our HA goal. Now, we can easily clone both the initiator resource and the filesystem resource, and have a copy of those running on every acceptable node in the cluster, including ds-1. My goal is to have two storage nodes and several other "worker" nodes. This method totally masks the true location of the backing storage. The downside? Everything has to go through iSCSI, even for local access. No getting around that, it's the cost of this benefit.
The good news is that you can seamlessly migrate the backing storage from node to node (well, between the two here) without any interruption. I formatted my test case with ext4, mounted it on ds-1, and ran dbench for 60 seconds with 3 simulated clients while I migrated and unmigrated the target from machine to machine about 10 times. dbench was oblivious.
(Note that for a production system where multiple cluster members would actually mount the iSCSI-based store, you need a cluster-aware file system like GFS or OCFS2. I only used ext4 here for brevity and testing, and because I was manually mounting the store on exactly one system.)
Some keys to making this work:
The current DRBD primary, ds-1, is operating the iSCSI target. For simplicity, ds-2 is acting as the only initiator here. What does this solution give us? If ds-1 dies for any reason, ideally we'd like the file system on ds-2 to not know anything about the service interruption. That is, of course, our HA goal. Now, we can easily clone both the initiator resource and the filesystem resource, and have a copy of those running on every acceptable node in the cluster, including ds-1. My goal is to have two storage nodes and several other "worker" nodes. This method totally masks the true location of the backing storage. The downside? Everything has to go through iSCSI, even for local access. No getting around that, it's the cost of this benefit.
The good news is that you can seamlessly migrate the backing storage from node to node (well, between the two here) without any interruption. I formatted my test case with ext4, mounted it on ds-1, and ran dbench for 60 seconds with 3 simulated clients while I migrated and unmigrated the target from machine to machine about 10 times. dbench was oblivious.
(Note that for a production system where multiple cluster members would actually mount the iSCSI-based store, you need a cluster-aware file system like GFS or OCFS2. I only used ext4 here for brevity and testing, and because I was manually mounting the store on exactly one system.)
Some keys to making this work:
- You obviously must colocate your iSCSITarget, iSCSILogicalDevice, and IPaddr resources together.
- the IPaddr resource should be started last. If you don't, the willful migration of the resource will shut down the iSCSITarget/LUN first, which properly severs the connection with the initiator. To trick the system into not knowing, we steal the communication pathway out from under the initiator, and give it back once the new resource is online. This may not work for everyone, but it worked for me.
- The iSCSITarget will need the portals parameter to be set to the virtual IP. Actually it's the iscsi resource that requires that, as it gets upset when it thinks it sees multihomed targets.
- Pick exactly one iSCSI target implementation - don't install both ietd and tgt, or evil will befall you.
- To ensure that the iscsi initiator resource isn't stopped during migration, you must use a score of 0 in the order statement. Here's the pertinent sections of my configuration:
-------------------
primitive p_drbd_store0 ocf:linbit:drbd \
params drbd_resource="store0" \
op monitor interval="15s" role="Master" timeout="20" \
op monitor interval="20s" role="Slave" timeout="20" \
op start interval="0" timeout="240" \
op stop interval="0" timeout="100"
primitive p_ipaddr-store0 ocf:heartbeat:IPaddr2 \
params ip="10.32.16.1" cidr_netmask="12" \
op monitor interval="30s"
primitive p_iscsiclient-store0 ocf:heartbeat:iscsi \
params portal="10.32.16.1:3260" target="iqn.2012-05.datastore:store0" \
op start interval="0" timeout="120" \
op stop interval="0" timeout="120" \
op monitor interval="120" timeout="30"
primitive p_iscsilun_store0 ocf:heartbeat:iSCSILogicalUnit \
params target_iqn="iqn.2012-05.com.ecsorl.core.datastore:store0" lun="0" path="/dev/drbd/by-res/store0"
primitive p_iscsitarget_store0 ocf:heartbeat:iSCSITarget \
params iqn="iqn.2012-05.datastore:store0" portals="10.32.16.1:3260"
group g_iscsisrv-store0 p_iscsitarget_store0 p_iscsilun_store0 p_ipaddr-store0 \
meta target-role="Started"
ms ms_drbd_store0 p_drbd_store0 \
meta master-max="1" notify="true" interleave="true" clone-max="2" target-role="Started"
clone cl_iscsiclient-store0 p_iscsiclient-store0 \
meta interleave="true" globally-unique="false" target-role="Started"
colocation colo_iscsisrv-store0 inf: g_iscsisrv-store0 ms_drbd_store0:Master
order o_iscsiclient-store0 0: g_iscsisrv-store0:start cl_iscsiclient-store0:start
order o_iscsisrv-store0 inf: ms_drbd_store0:promote g_iscsisrv-store0:start
-------------------
One final note... To achieve "load balancing," I set up a second DRBD resource between the two servers, and configured a second set of Pacemaker resources to manage it. In the above configuration snippet, I call the first one store0 - the second one is store1. I also configured preferential location statements to keep store0 on ds-1 and store-1 on ds-2. Yeah, I know, unfortunate names. The truth is the stores can go on either box, and either box can fail or be gracefully pulled down. The initiators should never know.
Why Fencing is REQUIRED...
I'll probably be dropping about $450 for this little gem, or something like it, very very soon. It's the APC AP7900 PDU, rack-mountable ethernet-controllable power distribution unit. Why? I'm glad you asked!
While testing the aforementioned iSCSI configuration, and pumping copious amounts of data through it, I decided to see how a simulated failure would affect the cluster. To "simulate", I killed Pacemaker on ds-2. Ideally, the cluster should have realized something was amiss, and migrated services. It did, in fact, realize something went bust, but migration failed - because I have no fencing. The DRBD resource, primary on ds-2, wouldn't demote because Pacemaker was not there to tell it to do so. We can do some things with DRBD to help this, but the fact is the iSCSITarget and IP were still assigned to ds-2, and there was no killing them off without STONITH. Without killing them, reassignment to the new server would have resulted in an IP conflict. Happy thoughts about what would've happened to our initiators next!
You now see the gremlins crawling forth from the server cage.
During the "failure," the dbench transfer continued like nothing changed, because, for all intents and purposes, nothing had. DRBD was still replicating, iSCSI was still working, and everything was as it should have been had the node not turned inside-out. Realize that even killing the corosync process would have no effect here. If ds-2 has actually been driven batshit crazy, it would have had plenty of time to corrupt our entire datastore before any human would have noticed. So much for HA! The only reasonable recourse would have been to reboot or power-off the node as soon as the total failure in communication/control was detected.
This was a simulated failure, at least, but one I could very readily see happening. Do yourself a favor: fence your nodes.
Oh yeah, and before you say anything, I'm doing this on desktop-class hardware, so IPMI isn't available here. My other server boxen have it, and I love it, and want very much to use it more. Still, some would advocate that it's a sub-standard fencing mechanism, and more drastic measures are warranted. I have no opinions there. FWIW, I'm ready to have a daemon listening on a port for a special command, so that a couple of echos can tell the kernel to kill itself.
Install All Your Resource Agents
I ran across an interesting problem. On two cluster members, I had iSCSITargets defines. On a third, I did not. Running as an asymmetric cluster (symmetric-cluster="false" in the cluster options), I expected that Pacemaker would not try starting an iSCSITarget resource on that third machine without explicit permission. Unfortunately, when it found it could not start a monitor for that resource on the third machine, the resource itself failed completely, and shut itself down on the other two machines.
Thanks to a handy reply from the mailing list, it is to be understood that Pacemaker will check to make sure a resource isn't running anywhere else on the cluster if it's meant to be run in only one place. (This could be logically extended.) Really, the case and point is: make sure you install all your resource agents on all your machines. This will keep your cluster sane.
Thanks to a handy reply from the mailing list, it is to be understood that Pacemaker will check to make sure a resource isn't running anywhere else on the cluster if it's meant to be run in only one place. (This could be logically extended.) Really, the case and point is: make sure you install all your resource agents on all your machines. This will keep your cluster sane.
Monitor, Monitor, Monitor
Not sure if this qualifies as a best-practice yet or not. While trying to determine the source of some DLM strangeness, I realized I had not defined any monitors for either the pacemaker:controld RA or the pacemaker:o2cb RA. I posited that, as a result, the DLM was, for whatever reason, not correctly loading on the target machine, and consequently the o2cb RA failed terribly; this left me unable to mount my OCFS2 file system on that particular machine.
Pacemaker documentation states that it does not, by default, keep an eye on your resources. You must tell it explicitly to monitor by defining the monitor operation. My current word of advice: do this for everything you can, setting reasonable values. I expect to do some tweaking therein, but having the monitor configured to recommended settings certainly seems less harmful than not having it at all.
iSCSI - Don't Mount Your Own Targets
This applies only if your targets happen to be block devices. Surely, if you use a file as a backing store for a target, life will be easier (albeit a little slower). The most recently meltdown occurred during a little node reconfiguration. Simply, I wanted to convert my one node to use a bridge instead of the straight bond, which would thereby allow it to host virtuals as well as provide storage. The standby was fine, the upgrade went great, but the restart was disastrous! Long story short, the logs and the mtab revealed that the two OCFS2 stores which were intended for iSCSI were already mounted! You can't share out something that is already hooked up, so the iSCSITarget resource agent failed - which also failed out the one initiator that was actively connected to it. The initiator machine is now in la-la land, and the VMs that were hosted on the affected store are nuked.
If you build your targets as files instead of block devices, this is a non-issue. The kernel will not sift through files looking for file system identifiers, and you will be safe from unscrupulous UUID mounts to the wrong place. Otherwise, don't mount your target on your server, unless you're doing it yourself and have prepared very carefully to ensure there is NO WAY you or the OS could possibly mount the wrong version of it.
If you build your targets as files instead of block devices, this is a non-issue. The kernel will not sift through files looking for file system identifiers, and you will be safe from unscrupulous UUID mounts to the wrong place. Otherwise, don't mount your target on your server, unless you're doing it yourself and have prepared very carefully to ensure there is NO WAY you or the OS could possibly mount the wrong version of it.
Adding a New Machine to the Cluster
Some handy bits of useful stuff for Ubuntu 11.10:
- apt-get install openais ocfs2-tools ocfs2-tools-pacemaker pacemaker corosync resource-agents iscsitarget open-iscsi iscsitarget-dkms drbd8-utils dlm-pcmk
- for X in {o2cb,drbd,pacemaker,corosync,ocfs2}; do update-rc.d ${X} disable; done
Idea: Have a secondary testing cluster, if feasible, with an identical cluster configuration (other than maybe running in a contained environment, on a different subnet). Make sure your new machine plays nice with the testing cluster before deployment. This way you can make sure you have all the necessary packages installed. The goal here is to avoid contaminating your live cluster with a new, not-yet-configured machine. Even if your back-end resources (such as the size of your DRBD stores) are different (much smaller), the point is to make sure the cluster configuration is good and stable. I am finding that this very powerful tool can be rather unforgiving when prodded appropriately. Luckily, some of my live iSCSI initiators were able to reconnect, as I caught a minor meltdown and averted disaster thanks to some recently-gained experience.
In the above commands, I install more things than I need on a given cluster machine, because Pacemaker doesn't seem to do its thing 100% right unless they are on every machine. (I am finding this out the hard way. And no, the Ubuntu resource-agents package alone does not seem to be enough.) So, things like iscsitarget and DRBD are both unwanted but required.
Test Before Deployment
In the above section on adding a new machine to a cluster, I mention an "idea" that isn't really mine, and is a matter of good practice. Actually, it's an absolutely necessary practice. Do yourself a favor and find some scrap machines you are not otherwise using. Get some reclaims from a local junk store if you have to. Configure them the same way as your production cluster (from Pacemaker's vantage point), and use them as a test cluster. It's important - here's why:
Today a fresh Ubuntu 11.10 install went on a new machine that I needed to add to my VM cluster. I thought I had installed all the necessary resources, but as I wasn't near a browser I didn't check my own blog for the list of commonly "required" packages. As a result, I installed pretty much everything except the dlm-pcmk and openais packages. After I brought Pacemaker up, I realized it wasn't working right, and then realized (with subdued horror) that, thanks to those missing packages, my production cluster was now annihilating itself. Only one machine remained alive: the one machine that successfully STONITHed every other machine. Thankfully, there were only two other machines. Not so thankfully was the fact that between them, about 12 virtual machines were simultaneously killed.
Your test cluster should mirror your production cluster in everything except whatever is the minimal amount of change necessary to segregate it from the production cluster; at least a different multicast address, and maybe a different auth-key. A separate, off-network switch would probably be advisable. Once you've vetted your new machine, remove it from the test cluster, delete the CIB and let the machine join the production cluster.
A word of warning - I haven't tried this whole method yet, but I plan to...very soon.
No comments:
Post a Comment