20160321

Port Isolation, VLANs, and Ceph: Lessons Learned

While configuring a new production Ceph cluster, I ran into problems that could only be described as asinine.  We built all the hardware, installed all the operating systems, deployed the latest stable Ceph release, and angels sang out from above....

....and then the cluster said, "Hm...I can't find some of my OSDs....degrade degrade degrade OH WAIT!  There they are..."  And a little later it said, again, "Hm...I can't find some of my OSDs...degrade degrade..."  just like before.  And it kept doing it.

I hit Google with search after search, and of course the Ceph documentation said something to the effect of: "If you have problems, make sure your networking is functioning properly."  So I tried to validate it.  I tested it.  I tested everything I could think of.  I started even disabling ports on the switches, in an attempt to isolate which host was causing the issues.

But as I did this, I noticed something strange.  Well, for starters, I configured all my machines with quad-NIC cards, and split the NICs so that two would serve client traffic, and two would serve cluster traffic.  I had also set the bonding to be balance-ALB.  I have two gigabit switches, so as to remove single-points-of-failure.  And when I disabled some of the ports on one of the switches, the problems went away.

I tested and retested, and couldn't find a reason why this should be.  I tried other ports on the switch, other ports on the servers, all with similar results.  Finally I started up CSSH and began running ping tests between the machines.  I then started shutting down ports in a divide-and-conquer search for the truth.  Eventually, I found that two of the ports (which were for the two servers that seemed to be having most of the problems) on one of the switches were acting very peculiar.  I verified all their settings again and again, and still the same results.  Finally I started going through every single fucking category of settings on my network switches, until I came to "Port Isolation Group"  Inside that, the ports in question were in fact being isolated from the rest of the switch.  I realized I had done this a very long time ago to keep our wifi traffic separate from our LAN traffic.  Turning off port isolation fixed the problem.

And my head slammed the table.

But the fun wasn't over!  In the fight to determine why I had such strange problems, I decided a switch-reboot would be a fun thing to do.  Having two nearly-identically utilized switches meant one could go offline while the other stayed online.  Or so I thought.

I had been working over the past couple of years to make the best use of VLANs.  I hate VLANs, by the way, from a security point of view.  Anyway, circumstances being what they were, I had to use them.  And use them I did!  Unfortunately, I was also trying to be very secure and not allow out-of-scope traffic to hit the other switches.  Enter in the fact that I have several redundant links between the switches, and that I rely on MSTP to do The Right Think (tm), and you have a recipe for additional annoying headaches.

After yet more reading and analysis, I determined that MSTP had configured the spanning-tree to put the root somewhere other than my two core switches.  Now my two core are joined with a 6-port aggregation between them.  I figure that's plenty of bandwidth, but for the fact that the switches wouldn't use it.  And since the cluster and client traffic was only permitted to go over those ports, this became a very big problem.  Manually plotting out the spanning tree allowed me to understand this, and moreover gave me at least an interim answer.  I reconfigured the two core switches to be the preferred roots, and from there all other links fell into place.  To make sure of this, I also configured the non-preferred links between those switches and the others in the network to be of greater cost.

To be certain that a switch failure still did not nuke the network, I ended up configuring the trunks to allow all valid VLANs.  I may eventually pull this back, once I get a better handle on MSTP.  Ideally, MSTP should figure out how to use the appropriate links, but unfortunately I have more to learn there.

So lesson learned: make sure your networking isn't fucked up.

Don't Run Ceph+ZFS on Shitty Hardware

Isn't just the way it goes, when things go to hell at 2 in the morning?

And that's what's happened.  But it wasn't the first time.

I have (or soon to be: "had") an experimental Ceph cluster running on some spare hardware.  As is proper form for experimental things, it quickly became production when the need to free up some other production hardware arrived.  Some notes about this wonder-cluster:

  • It runs a fairly recent, but not the latest, Ceph.
  • It started with two nodes, and grew to four, then shrunk to three.
  • It has disks of many sizes.
  • It uses ZFS.
  • The hardware is workstation-class.
  • The drives are old and several have died.
Sounds like production-quality to me!  Ha...  but, that was my choice, and I'm now reaping what I have sown.

Long story short, some of the OSDs occasionally get all bound up in some part of ZFS, as near as I can tell.  It could be that the drives are not responding fast enough, or that there's a race condition in ZFS that these systems are hitting with unfortunate frequency.  Whatever the reason, what ends up happening is that the kernel log gets loaded up with task-waiting notifications, and since there are only 5 out of 8 OSDs still living, the cluster instantly freezes operation due to insufficient replication.  Note that the data, at least, is still safe.

Typically I've had to hard-reboot machines when this happens.  My last attempt - this very evening - took place from my home office by way of command-line SYSRQ rebooting (thanks, Major Hayden!  I love your page on that topic!).  Unfortunately, graceful shutdowns don't tend to work when the kernel gets in whatever state I find it at times like these.  One morning, I had to have my tech hard-cycle a machine that was even inaccessible via SSH.

Generally what happens next is that the machine in question comes back online, I turn the Ceph-related services back on, the cluster recovers, and everything goes on its merry way...for the most part.  If the hypervisors have been starved for IO for several hours, I end up rebooting most of the VMs to get them moving.  Unfortunately, tonight was not going to be that awesome.

I had been in the process of migrating VMs off my old Ceph+Proxmox cluster, and on to a new Ceph+Proxmox cluster.  This had been going well, but during one particular transfer something peculiar happened... I suddenly couldn't back up VMs any longer.  I also noticed on the VM console for the VM in question several telltale kernel alerts, the usual business of "I can't read or write disk! AAAHHH!!"  I logged into one of the old Ceph boxes and sure enough, an OSD had gone down.  The OSDs on the machine in question were pretty pegged, stuck waiting for disk I/O.  But the disks?  Not doing anything interesting, ironically.  atop reported zero usage.  So, I figured a hard-reset was in order, and did my command-line ritual to force the reboot.  But it never came back...

Now, at the risk of jinxing myself (since my transfers are not yet complete), I'm going to say right now that fate was on my side.  I had transferred all but a very small handful of VMs to the new cluster, and this last set I was saving for last anyway.  But they were also important, and I decided it would be much better just to get them transferred before people starting ringing my phone at 6:30 in the morning.  The only problem was how to access the images with a frozen Ceph cluster.

I'm sure a kitten somewhere will die every time someone reads this, but I reconfigured the old Ceph cluster to operate with 1 replica.  Since I wouldn't be doing much writing, and I just had to get the last few VMs off the old storage, I felt (and hoped) it would be an OK thing to do.  Needless to say, I am feverishly transferring while the remaining OSDs are yet living.

Probably the main limiting factor to the transfer rate is the NFS intermediary I have to go through, to get the VMs from one cluster to another.  But I must credit Proxmox: their backup and restore functionality has made this infinitely easier than the last time I migrated VMs.  The last time, I was transferring from a virsh/pacemaker (yes, completely command-line) configuration.  Nothing wrong with virsh or pacemaker (both of which are very powerful packages), but I have to say I'm sold on Proxmox for large-scale hypervisor management...especially for the price!

Between my two new production hypervisors, I have just under 80 VMs running.  I'd like a third hypervisor, but I'm not sure I can sell my boss on that just yet.  My new production Ceph store has about 4.5T in use, out of 12.9T of space, and I haven't installed all the hard drives yet.  When they came in, I noticed that they were all basically made from the same factory, on the same day, so I decided that we'd stagger their install so as to give us hopefully some buffer for when they start dying.

Transfer rates on the new Ceph cluster can reach up to 120MB/sec writing.  I was hoping for more, but a large part of that may be the fact that I'm using ZFS for the OSDs, and for the journals, and the journals are not on super-expensive ultra-fast DRAM-SSDs.  The journals are, for what it's worth, on SSDs, but unfortunately several of the SSDs keep removing themselves from operation.  So far I haven't lost any journals, but I'm sure it will happen sooner or later.  Sigh...

And the VM transfers are.....almost done....


20151107

High-Availability, High Performance Data Stores for Virtualization: A Recap, and an Experiment

This post recaps most of the routes I have taken along the way toward a cluster-appropriate, highly-available virtualization backing-store solution.  It will also explore my latest endeavor: Ceph with ZFS.

Goals

My initial goal had been to have a cluster-aware backing store on which to keep virtual machine images.  This store would serve multiple hypervisors.  The store needed to serve the same images to all hypervisors (to support rapid live migration), and needed to be highly-available.  It also needed to provide sufficient performance.  Throughout my experiments and production systems, I have used gigabit connections, often bonded using various kinds of Linux bonding modes.  I won't recap them all here; suffice to say that my current standard is Balance-ALB.  That doesn't mean it's the best, only what I have settled on for sake of having bonding that actually used more than one slave interface at a time.

As my solutions evolved, so too did my goals.  Eventually I wanted to make the maximum use of my hardware.  I wanted every storage server to actually serve data, to maximize both drive I/O and network utilization.  Despite my best efforts, I have never been able to max out my network.

The Past

MooseFS

This was my first foray into a cluster file system.  It was interesting to me because it balanced blocks across all the participating nodes, and was very easy to set up and get running.  Any machine running the MooseFS client could gain access to the data, and adding additional storage was as trivial as adding more nodes or more hard drives.  It worked on block-counts: as long as a sufficient number of replicas were present in the cluster at large, the system was satisfied.

At the time, I wasn't very familiar with utilities such as Pacemaker, and so my chief complaint with MooseFS was the single "mfsmaster" node.  If this node went down, all clients hung until it was brought back up.  And clients only knew about one node.  Not sure if this was corrected in later versions, but at the time it became part of the deal-breaker.  The other part was performance: I couldn't get the latency and throughput I wanted from the cluster.

DRBD + OCFS2

The next route I took was toward OCFS2 on top of DRBD.  Eventually this would also turn into OCFS2 on top of DRBD, on top of ZFS.  Two nodes can be mirrored easily with DRBD, but to turn their mirrored data into a file system that could be read by either node at the same time required a cluster-aware file system.  Enter OCFS2.

This solution had several moving parts: the cluster file system, the mirroring, the cluster management software to keep everything running, and of course the KVM hypervisors.  Everything was manually configured and maintained.  While it ran for a very long time, it also did not perform well.  Part of that was probably the hardware: slow hard drives and too few of them, such that ZFS was configured to double up on its writes for at least a modicum of data redundancy per node.  It was also very limited in terms of growth: you can't really have more than three or maybe four DRBD nodes, and after that you are out of luck.  I want to say you could run them in active/active, but every write had to be picked up by both DBRD nodes, so your throughput was as fast as your slowest drive (or your network link).  I dedicated links to DRBD, but it didn't matter.  The drive access was so abysmal that even on a data sync between the nodes, the ethernet bond chuckled and went back to idling.

DRBD + OCFS2 + iSCSI

The next logical step was to remove the data store from the hypervisors and have it be stand-alone.  I shared the block devices via iSCSI, and brought the hypervisors (who were the iSCSI initiators) into agreement with OCFS2.  This, sadly, was another mixed bag.  Again, everything begins at your physical storage.  If it isn't fast, nothing else is going to be fast, and every step along the way is an additional bit of latency and a loss of throughput.

One thing this did achieve was the ability to have more than two or three hypervisors.  I could theoretically have as many as my network and storage could manage.  Sadly, this setup would never see more than three.  Storage was still tenuous, though now I could swap out a storage node without interrupting the hypervisors or their VMs.

There was an elephant in the room, however, and it was OCFS2.  Long story short, OCFS2 can easily bring an entire cluster to its knees.  Fencing is absolutely required.  Without it, a single node failure and OCFS2 holds everyone up pending fencing.  This is the way it was meant to be, and unfortunately my fencing wasn't all that...

As for iSCSI, I was still stuck with one node being the active beast, while the other sat idly by replicating every write.  I did split my storage into multiple targets, so that I could get both nodes involved, but since both still had to participate in the DRBD link, it was as if they were a single machine.

What could I have done different?  I could have thrown storage at it.  At the time, I was mainly interested in redundancy, and so performance suffered.  I later learned how to stripe redundant arrays inside ZFS and get better performance, but that requires an abundance of drives I simply didn't have.

DRBD + NFS

Seeing as how one of the chief problems with OCFS2 was how it could lock up an entire virtualization cluster with ease, I decided to investigate an alternative.  NFS had received relatively high remarks for performance, and was a native cluster file system.  This, coupled with the fact that it was becoming increasingly difficult to integrate OCFS2 + Pacemaker + CMAN on my nodes with every new release of Ubuntu, made for a compelling reason to test.

The benefits: it's all server-side, so if a hypervisor goes belly-up, the share does not go down with it.  It's relatively fast and clean.  Fail-over is fairly painless.

The detriments: as with iSCSI, you can only use one node to serve your data.  But actually, it's a little worse than that.  Let's suppose you want to offer up several NFS shares, hosted perhaps on different drives.  All shares have to be shared from the same node.  The reason for this is the NFS client handles: NFS server keeps track of them, and consequently you have to also share them between the nodes for fail-over to work (otherwise the clients end up with stale handles and the connections drop).  There does not appear to be an easy way to merge to lists of handles, and it sounds like a kludge anyway.

For a detriment, that isn't so bad.  Let's face it, if you want to offer fast cluster storage without the OCFS2 baggage, NFS is pretty much as good as it gets, so long as you are sticking with an active/passive storage node setup.  Again, extremely fast underlying storage is key to making it work.

The Present: Ceph, and ZFS

A long-time demand has been data integrity.  Ever since I had some RAID adapters that shredded my data, and a batch of hard drives that habitually silently corrupted it, I have been extremely cautions.  This has not come without a price, either: the inverse of reliability is performance and capacity.  The more reliability you have, generally the less performance and less capacity you end up with.  Think of RAID-5, RAID-6, RAID-Z3, etc.  Balancing performance, capacity and reliability is an ongoing struggle.

Ceph brings to the picture some very interesting attributes.  First, the ability to remove all single points of failure.  Multiple monitor nodes can be addressed by the hypervisors, and the OSDs are redundant by design.  The only questionable feature is the metadata server, which is not required for my environment.

When writing, it writes to all the appropriate disks and waits for those writes to finish.  Not every disk in the cluster is involved with a given write.  When reading, it reads from the "primary" source for a given piece of data.  Data is replicated, and the replicas are secondary to a single primary.  Primaries are spread out across the cluster, but the point is that only one disk is going to give you that piece of data.  On the other hand, for many pieces of data, you may get them from many disks 0- and consequently very quickly.

Ceph is also a big "more is more" fan.  The more nodes you have, the better for resiliency.  The more drives you have, the better for performance and redundancy.  Best of all, it digs commodity hardware.  My kind of tech.

Now, I'm not here to sing praises, and its initial setup and subsequent maintenance can be onerous enough.  But let's consider a configuration and discuss options.  For starters, I wanted not just fast storage, but very fast.  My goal therefore was to let Ceph manage as much of the redundancy as possible, and have ZFS provide basic data integrity and performance.  Since ZFS performs per-block checksums, that would ensure integrity.  By striping across two or three drives, I can get double-to-triple the performance on reads and writes.  This comes at a cost: a single drive failure in that stripe takes out the whole stripe.  I need more drives, or more stripes, to spread out the risk.  I'd like to think the risk is no greater than having the same number of single drives as I would have stripes.

Each stripe is an OSD, and by keeping the number of participating drives small, I can have more OSDs and more resiliency against catastrophic data-loss.  My basic setup is three nodes, two of which with two OSDs each, one with one OSD (for the moment).  Each OSD is in fact two drives striped in ZFS.  Ceph recommends running the journals to an SSD, so that's what I did.  I created a single XFS file system on part of an SSD, and configured Ceph to put all the journals there.

The zpools also benefit from some SSD-love, so I gave part of each SSD over to each zpool as a ZIL device.  I split the SSD up with LVM, incidentally, to make resizing and moving easier.  Having the SSDs do double-duty seems sub-optimal to me, but I have limited hardware at the moment.  In future incarnations I would like to have multiple SSDs, for not only the ZIL and the OSD journals, but also the Ceph monitors.

When using XFS for the journal, the requirement to disable DIO for the OSD goes away.  I remap my zpools' default mount location (zfs set mountpoint=/var/lib/ceph/osd/ceph-2 for instance) instead of creating additional ZFS devices.  The journals are kept in /var/lib/ceph/osd/journals.

Standing up the OSD is then straightforward, and can easily be done manually.  Ceph-deploy would be nice to use, but I'm still gaining experience, and have yet to really explore the power of centralizing the configuration file.  Ultimately I should be able to use it for OSD replacement, but I'm just not there yet.  Luckily, the manual process is well documented, and once you've done it a few times it ceases to seem tricky.

I ran several tests, but my metric gathering was not really all that stellar.  I won't bother reporting many numbers here, other than to say that I noticed significant performance increases from the striping.  This was most evident when I transitioned one node from being RAID-Z to being a stripe.  My read throughput can reach upwards of 100 Mbytes/sec, and the writing throughput tends to reach between 30 and 60 Mbytes/sec, depending on which OSDs are participating.  The servers have a heterogeneous collection of old SATA-2 drives, so I expect performance to improve further with a homogeneous collection SATA-3 drives and more SSDs.

Latency varies, but usually doesn't go beyond a few hundred milliseconds.  However, I have yet to really load down the cluster with a lot of VM images.

Options and Builds

I am glad I haven't purchased production hardware yet, as I really didn't understand Ceph's needs until I dug into it.  As it stands, running a minimum of three nodes appears to be a requirement.  Four would be better, and is therefore now my target minimal deployment.

Since writing the above, I have had the opportunity to experience both intentional and natural failure scenarios, complete with rebuilds.  One of my OSDs died and I could have replaced it outright.  Instead I chose to augment the cluster with another server and two OSDs.  After watching the rebuild and observing some metrics via atop, I concluded that my single SSD was the acting bottleneck and that the ceph journal was the most likely culprit.  While the ZIL partition also garnered significant activity, the journal partition was exceptionally busy, such that the spinners never really went about 20% utilization.  The recovery took about 6 hours, and 400G of data (1.2T of replicated data) was reallocated across the nodes once the newest node came online.  That yields an average recovery rate of about 18MB/sec, if I did the math right.

As such, I am now considering the following for my full production build-out:

  • 4 servers with 24 2.5" hot-swap bays each, outfitted initially with 4 or 6 spinners each, and two SSDs per 6 spinners.
  • Each pair of spinners will be striped together and serve as a single OSD.
  • For each pair of SSDs, one will be a ZIL volume shared among the 3 OSDs, and the other a journal volume shared likewise.
    • Alternatively, a very fast spinner may serve as a better journal volume.
    • Another alternative: striped mirrors
  • The backplane, drives, and HBA will be SAS-12Gb/sec
    • Finding an affordable SAS-12 SSD is actually very challenging.
This configuration can support a maximum 8 OSDs per server, with 8 disks dedicated to ZIL and journal.  By revising the plan to use a striped-mirror set for the journal, it might be possible to reduce the ZIL/journal to 6 disks and gain a 9th OSD per server.  Two benefits of the striped-mirror set: less chance for catastrophic ZIL or journal failure/loss, and potentially a 3x speed-up on reads and writes to the underlying medium.  The question then becomes whether or not it makes sense to use SSDs.  Given the amount of work required of the ceph journal, I would not think it wise to reduce the physical size of the journal array below 6 disks.  

Using 900G drives and considering 8 OSDs per server, this cluster can support approximately 57.6T of data (or 19.2T after 3x replication), and can be easily expanded with additional servers.

There is still some considerable testing to be done, to tune for optimum performance.  Also, having now gone through the exercise of adding an additional node to a live and evolving cluster, I can appreciate having a centralized configuration scheme that is consistent across all nodes.  I will try to post again with the steps I took and packages I installed to add storage to my cluster.  By and large the steps for adding storage are merely the final steps of a new cluster install, with tweaks.

One of my only concerns is how ZFS will perform once the cluster data has aged a bit.  The frequency of writes seems to impact long-term performance.  I am also concerned about my choice to employ XFS on the SSDs for the current cluster nodes; this may not have been an optimal choice, though I did not yet perform any additional tuning.

All things being equal, I am very, very, very much enjoying Ceph!

20150121

NFS Stale File Handles... on a node standby migration... with multiple NFS shares

So I guess I'm doing something I shouldn't be.  That's not unusual.

I have a pair of data-store servers that are mirrored with DRBD.  Pacemaker is providing the cluster management, and for sharing up a cluster-acceptable file system I chose NFS because, well, I could.  That and OCFS2 has just been too much of a headache lately and with a move to Proxmox, I think its days in my network are numbered.  This NFS deployment was a trial, and thank goodness that it was!

I set up my Pacemaker along the lines of what you'll most commonly find in most tutorials, including those from LinBit themselves.  Great stuff!

So, I had initially configured a single NFS export, vm0, and it worked well.  It seemed to transition well from server to server, and I was happy.  Then I set up a second NFS export, vm1.  Lately I've been moving some of my inactive VM images over to it (I have a pair of Proxmox hypervisors running against this data store cluster).  I was a little feeling dangerous and decided to put one of the nodes into standby.

And disaster struck.

I won't go into the gory details, but the problem itself and its solution are worth noting here.  Firstly, the initial symptom was that the Proxmox NFS clients couldn't talk to the export that got migrated.  In this case, it was vm0 migrating from ds03 to ds04.  The reported error was that the NFS handle was stale.  That didn't make much sense, because there shouldn't have been any handles left on ds04 for the vm0 share.  However, rmtab doesn't really get cleaned up - unless you are using NFSv4 and TCP, and even then I'm only saying that because that's what preliminary research has turned up and I have yet to test it myself.

So, the rmtab on the two data servers were out of sync with each other.  The documentation I've been able to find on setting up HA NFS only ever discusses having one share, or at least one server being the active server.  Spreading the load across multiple servers via multiple NFS exports is novel, but this is the gotcha: if you don't have the rmtab sync'd up, you will pay the price when your export moves from one data server to another.

In the case of an active/passive cluster, this shouldn't be an issue: the only rmtab you have to worry about is the one on the active node.  Make sure it's actively replicated to the passive node, and your handles will be fine.  That doesn't work with active/active: both nodes are modifying their rmtabs at the same time.

There doesn't appear to be any clean, official way to clean up rmtab.  One way could be to use sed to delete any entries that don't belong on the given server.  So, in ds03's case, it shouldn't have any entries for vm1.  That's easy enough, and we can even probably bash-script that and run it as a cron job every minute.  However, I'm not sure how that will work out if both nodes don't have equivalent rmtabs... I get the feeling, from what I saw online, that not having any appropriate entries will yield a stale handle error.  Example: we clean ds03's rmtab of all vm1 entries, then put ds04 into standby.  vm1 must now be served by ds03.  My guess is that this will result in stale handles (very bad).

As an alternative to just nuking unwanted entries, we can combine the rmtabs from the two nodes so that their respective handle records are valid between both machines.  In this way, each one is active, and each one is passive to its peer.  I am not sure yet how exactly I'd script this out, since both servers need to have the same rmtab at the end of the day, but both are modifying it.  My first instinct would be to make a cron-driven script on one of the nodes that nabs the rmtab from the other node, cleans both the local and remote rmtabs, and then combines them and distributes them out.  This would keep both nodes in sync.  This might even work well, though maybe only where the rmtab doesn't change frequently and your NFS client membership is limited to a set of servers that are supposed to be online all the time.

Anyway, there you have it: beware the rmtab, it will be a small nightmare if you're not careful!!






20141225

Behind The Firewall: Connecting Service Despite Restrictions

Here's the problem:

You have spent a lot of money for some internet service, and being the good administrator you are you'd like to share this service among your very small collection of machines.  You also want to put a firewall in the way of unnecessary and/or unwanted network traffic, because after all, the more hoops people have to jump through to hit your systems, the better.

Now, you've got your firewall connected, and you can ping the world and even browse to places on the firewall.  But when you connect a system behind the firewall, every request out seems to die.  You've checked that the firewall was properly configured, and you had even tested it back at the office.  You can still reach the world from the firewall, and yet from your client machine you can't go anywhere.

After a Wireshark capture of the traffic, you see your client machine traffic going out, and some interesting ICMP messages that say "Time to live exceeded in transit" or "TTL expired in transit."  What's more, you notice that your firewall is sending these messages back out to the servers you're trying to talk to!  It is as though no matter what comes in, its TTL exceeded.

And that is exactly what is happening.  Let's create this situation on purpose.  If we have a NATing firewall that is to be the only NAT device in the network, we don't want to allow any other NAT devices to deliver their traffic to their clients.  It's endpoints or nothing at all.  In iptables, the following rule should accomplish this:

iptables -t mangle -A PREROUTING -i eth0 -j TTL --ttl-set 1

(Refer to this page for more information: http://www.linuxtopia.org/Linux_Firewall_iptables/x4799.html)

The above command should mangle the TTL such that there is only one more hop it can go before it runs out of life.  I haven't tested that - if it doesn't work, try 2 instead of 1.  The above command is really based upon the fix for the problem the above command causes.  Basically, if you're experiencing an issue with the TTL expiring too soon, just rewrite it!

iptables -t mangle -A PREROUTING -i eth0 -j TTL --ttl-set 255

Now we can do another 255 hops before we run out of TTL.  As this is done in the prerouting, the packet is fixed before any checks are made as to whether or not it should be dropped for TTL expiration.  A similar fix can be done with pfSense (search in google for the appropriate terms and make sure to tread carefully when mucking around with the filter code - note that there IS NO GUI OPTION FOR CHANGING THE TTL IN PFSENSE!!).  An alternative to the above is to increment the TTL by 1 or more.

If you have a pfSense VM running on a Linux hypervisor, you can make the fix right in Linux (for the bridge adapter, of course).  IPTables in this way really saves the day.  The fix in pfSense is not terrible, but also not convenient and is not officially supported by their community.

Now that I think about it, this would be a great way to stop people from using unauthorized wireless access points around the office....

20141216

Never run updates from recovery...

Concerning Ubuntu Server 14.04 (and possibly others)....

It's probably not something you should do.  I don't know what Ubuntu's official stance would be, but my experience has been that it's not the best thing to run apt-get dist-upgrade from a recovery kernel.  There are a few things missing, even if you mount the root partition:

  • /dev, /proc, /sys and the gang are all there, but only sort of.  And /var/run?  Well, not really this go-around, if you're using 14.04 LTS.
  • Upstart didn't run from this root, so there is no Upstart.  Anything that needs to deal with Upstart (like, evidently, libpam-systemd) will fail.
  • If something is broken, it might get more broken if you run updates in recovery.

Still, it may become necessary.  There might even be some failures that look semi-legit but are certainly not going to help you back to a working system, because said failures are standing in the way of completing your updates.  One such issue is when one of the update scripts for one of the libraries being updated decides - suddenly - that it really needs to run a script out of /etc/init.d/.  Let's suppose, whatever the reason, no such script exists at the time of this update.  Apt fails, reporting that one of its packages wasn't able to complete all its steps.

One possible solution to this is to trick Apt into thinking there really IS a script available.  In my case, I simply:
  1. touch /etc/init.d/systemd-logind
  2. chmod +x /etc/init.d/systemd-logind
Viola!  Apt was able to finish processing the updates and now I might have a working system again.  I stress might because it shouldn't have broken in the first place and I have no earthly idea why it did break.  But, obviously, missing or non-configured update packages might be in the recipe.

The important thing, when you perform something abominable like what I mentioned above, is that you remove it before you reboot.  If the service does depend on Upstart, having the fake init.d script available may cause much pain.

And that's that.

20140809

Getting the LeapReader to work in Linux...sort of

I'm going to document briefly what I did, and what the problems appear to be.

TL&DR:  Remove usb_storage, blacklist it, reboot, remove it again if necessary, use VirtualBox and connect the device to a Windows VM.  Profit.

Firstly, LeapReader is a Windows-compatible thing, along with its godforsaken software.  It's NOT Linux-compatible, in any sense of the word, and let's just assume it will never be.  The developers Just Don't F-ing Care.

Not a problem, though, if you have VirtualBox to install a Windows virtual machine.  You might even be able to find some VM images from Microsoft themselves - I seem to remember them publishing images for sake of testing software against older versions.  In my case, I have a (legit) licensed copy, because unfortunately LeapFrog is not the only company that hates developing for Linux.

On the metal I'm running Linux Mint 16.  It's stocked with 3.11 for a kernel and version 204 of UDEV.  That last number will become relevant shortly.  Now, you can get LeapReader talking to the Windows VM, and VirtualBox makes that easy.  The hard part is getting LeapFrog Connect to talk to it.  And, in all honesty, I am forced to conclude it's not Connect's fault.

The problems start when UDEV and the kernel decide to try to investigate the device as a USB storage device.  Bad News.  My educated guess is that the kernel's attempts (and verbose failures, from the looks of dmesg) drive the LeapReader into a sort of temporary insanity...after which it won't talk successfully to the LeapFrog Connect program.

I tried, and tried, and tried in vain to get UDEV to stop messing around with the device.  BUT, someone somewhere removed pretty much all the commonly-known features for the commonly-published solutions.  In other words, ignore_device doesn't work, and last_rule is worthless.  At least, so far as my testing has gone.  Someone else might have better luck, and frankly I'm out of time to mess with it.

What I DID get working was to remove the usb_storage module.  I also had to blacklist it.  Strangely, Mint will load this device on boot even though it's blacklisted in modprobe.d.  At least doing an rmmod on the command line will take care of that for the current session.

It is vitally important that the device be disconnected when you remove the usb_storage module.  If usb_storage twiddles with the LeapReader, game over.  Once you are able to connect LeapReader and see nothing but basic USB interactions (no drive-style access) in the kern.log, you're good to connect the device through VirtualBox to Windows.

In the future I would very much like to make this work correctly.  Ideally, UDEV should be instructed not to fiddle with the device, and/or to represent it as anything BUT a USB storage device.  If I could even achieve the latter, I would be happy.  Unfortunately I don't know enough about the relevant subsystems to make it happen.

But, you too can now make the LeapReader work.  All you need is Windows and a whole lot of patience...and possibly an indestructible keyboard.