BURNING MIDNIGHTm.at.work: 2016

20161122

pfSense Minimum TTL Adjustment / Fix

For dealing with situations where your pfSense firewall receives a TTL of 1, you can mangle this in the filtering code on the server. The key file, for version 2.2.5 and similar, is /etc/inc/filter.inc:

filter.inc: $scrubrules .= "scrub on \${$scrubcfg['descr']} all min-ttl 255 {$scrubnodf} {$scrubrnid} {$mssclamp} fragment reassemble\n"; // reassemble all directions

I have highlighted the addition in bold. This is in the function filter_generate_scrubing().

I don't know if this works in newer versions of pfSense. Tread with care!

Source:https://forum.pfsense.org/index.php?topic=27206.0

20161003

Bind9 and the Mysterious PTR resolution failure

I run a sort-of cagey DNS setup. I have a hidden primary that isn't exactly hidden, but is certainly not well-advertised. Couple this with the fact that I run both internal and external views (so that my internal clients can resolve the same names the external clients resolve, but end up with IPs that are actually valid and usable), and the glorious RFC-2317 (which is really damn cool, if you think about it), and there is a recipe for a wonderfully ornery problem.

So, here's the deal. I contact my ISP and say I want to get PTR records. They say it's money or they can delegate. I think, "Cool! I get my first PTR delegation!" But not so fast, because I don't have a /24, and anything smaller than a /24 is either managed by them, or can be RFC-2317'd. The RFC basically recommends how to set up CNAMEs for all the in-addr.arpa records, and point them to a customer's DNS. In this case, they're pointing to my DNS server, which isn't great, but is at least a start.

Once my DNS server and the ISP's DNS were prepped, everything seemed fine...from the outside. From the inside, no results. Why? Well, it turns out the dual-view I have in place was doing its job. When my internal client did an nslookup against the IP, the ISP returned (as it should) the CNAME that pointed to my domain. nslookup did what any good resolver does, and recursed. I had assumed that the DNS server would recurse for me, which was totally wrong. I know, dumbass.

The recursion led my local resolver (nslookup) to seek out my DNS server to resolve its own domain. The DNS server said, "Sure, I actually handle that domain - OH, you're an internal view client, and I have no DNS names that match that query in my internal view." And then failure.

So, the short lesson here is: when you run internal and external views, make sure they match, especially where the PTR records are concerned!!

20160808

How to Find an EXT4 Partition...

So you have a partition table that's a little unorthodox, and maybe is a mix between some NTFS partitions and an extended partition with Linux parts in it. And one day you get the bright idea to nuke that extended partition because it's a user workstation and they only need Windows. (And yes, in your heart of hearts you want nothing more for them to be in LinuxLand, but alas too many software vendors have still not joined the ranks...)

Then the shit hits the fan, and you realize you need that partition back.

And you have no idea where the hell it was, other than that it was on the latter half of the disk.

This happened to me, and I fiddled around until I found it, roughly. This isn't a perfect guide, but might help someone on their way down a similar adventure. Let's do it constructively. First, we'll create an image to test with:

dd if=/dev/zero of=test.img bs=1M count=500

There we now have a 500 meg image file to play with. We can fdisk it directly if we want:

fdisk test.img

I created two partitions, just for kicks, but one would be enough. By default, fdisk creates the first partition 1M into the disk, leaving plenty of space for boot loaders and the like. Thus...

Now, to create the file system so that we have something to find, you need to be a little careful with losetup. Simply mounting the whole file with losetup won't give you the partitions, or at least it didn't for me. So I did this (using the values for the first partition as show in the image above, meaning your precise command line may vary):

losetup -o $[ 2048 * 512 ] --sizelimit $[ 204800 * 1024 ] -f test.img

Note that the sector locations (start and end) are in 512-byte sectors, whereas the Blocks are listed in Kbytes. Thus the 1024 multiplier for the size limit. Subtracting the start sector from the end sector and multiplying by 512 should give the same answer.

Create the file system:

mkfs.ext4 /dev/loop0

(Or whatever device was assigned, remember to use losetup -a when in doubt.)

Let's unmount the device and remount the whole file now:

losetup -d /dev/loop0

losetup -f test.img

We now have a device, with a mystery partition on it. Actually since the partition table is intact, we could easily access it...so let's really screw the goose:

dd if=/dev/zero of=test.img bs=512 count=2

That takes care of two sectors, one of which that was holding the original partition information. We can now begin an attempt to mount:

for X in {0..2049}; do echo $X; mount -oloop,ro,offset=$[ 512 * $X ] \

/dev/loop0 /mnt 2>/dev/null && break; done

Now, in this example we have a very good idea where to look. In the real case where this effort was required, I had to recreate a primary partition to take the place of the extended partition (since I couldn't scan through the extended partition directly), and hope to run across it within the first few megs. To obviate the endless output of errors, and to give myself some feedback on how far into the disk I was, I added the echo and the 2>/dev/null pieces. I am mounting as a loop device, since loop supports the offset option, and am mounting read-only. This way nothing bad should happen even if we run across other data that suggests another file system where it shouldn't be. The loop breaks as soon as a successful mount occurs.

The loop breaks after spitting out sector # 2048, as it should. The mount succeeds, and issuing a mount command reports that the device is in fact mounted and available. Looking in /mnt shows us that there is a file system there. If we want, we can remount with the correct offset, in read-write, and dump data into the test image.

If you are worried about mounting the wrong kind of file system (maybe it was an old, dirty disk that used to house other things, like an NTFS file system that never got zeroed out), you can limit mount's scope by just adding -t ext4, or even specifying multiples like -t ext2,ext3,ext4 ...

If you have the rough sector location, you don't necessarily need to create a partition; you could just scan through the raw device (like I'm doing with the loop0 device above). By changing the start and end sector numbers in the for loop, you can look on any location of the disk.

So, why go through all this trouble? Well, so far as I can tell from the limited reading I've done, EXT has no magic signature. You can tell if you're looking at NTFS because the first sector has a huge NTFS label in the byte data, and xxd will show you this. But EXT doesn't (again, so far as I can tell).

The least you can definitively know about it is that its first important data is 1024 bytes into the partition, and that data will be the first superblock with all its superblock-ness. It's a cryptic mash of bits, and does not suggest anything remotely about it being a file system of its own.

The most you might be fortunate enough to locate is a volume label, in that same superblock, if the volume was even equipped with one. But luckily, thanks to this tutorial, you need neither. You can scan away and find each and every buried and fossilized file system that mount can mount, no partition table required.

So there ya go! Now when you need to "Find ext4 partition" that you lost, and you are tired of all the search results telling you how to find out sizes, and inode counts, and tuning options, and you just want the damned file system, now you know.

20160321

Port Isolation, VLANs, and Ceph: Lessons Learned

While configuring a new production Ceph cluster, I ran into problems that could only be described as asinine. We built all the hardware, installed all the operating systems, deployed the latest stable Ceph release, and angels sang out from above....

....and then the cluster said, "Hm...I can't find some of my OSDs....degrade degrade degrade OH WAIT! There they are..." And a little later it said, again, "Hm...I can't find some of my OSDs...degrade degrade..." just like before. And it kept doing it.

I hit Google with search after search, and of course the Ceph documentation said something to the effect of: "If you have problems, make sure your networking is functioning properly." So I tried to validate it. I tested it. I tested everything I could think of. I started even disabling ports on the switches, in an attempt to isolate which host was causing the issues.

But as I did this, I noticed something strange. Well, for starters, I configured all my machines with quad-NIC cards, and split the NICs so that two would serve client traffic, and two would serve cluster traffic. I had also set the bonding to be balance-ALB. I have two gigabit switches, so as to remove single-points-of-failure. And when I disabled some of the ports on one of the switches, the problems went away.

I tested and retested, and couldn't find a reason why this should be. I tried other ports on the switch, other ports on the servers, all with similar results. Finally I started up CSSH and began running ping tests between the machines. I then started shutting down ports in a divide-and-conquer search for the truth. Eventually, I found that two of the ports (which were for the two servers that seemed to be having most of the problems) on one of the switches were acting very peculiar. I verified all their settings again and again, and still the same results. Finally I started going through every single fucking category of settings on my network switches, until I came to "Port Isolation Group" Inside that, the ports in question were in fact being isolated from the rest of the switch. I realized I had done this a very long time ago to keep our wifi traffic separate from our LAN traffic. Turning off port isolation fixed the problem.

And my head slammed the table.

But the fun wasn't over! In the fight to determine why I had such strange problems, I decided a switch-reboot would be a fun thing to do. Having two nearly-identically utilized switches meant one could go offline while the other stayed online. Or so I thought.

I had been working over the past couple of years to make the best use of VLANs. I hate VLANs, by the way, from a security point of view. Anyway, circumstances being what they were, I had to use them. And use them I did! Unfortunately, I was also trying to be very secure and not allow out-of-scope traffic to hit the other switches. Enter in the fact that I have several redundant links between the switches, and that I rely on MSTP to do The Right Think (tm), and you have a recipe for additional annoying headaches.

After yet more reading and analysis, I determined that MSTP had configured the spanning-tree to put the root somewhere other than my two core switches. Now my two core are joined with a 6-port aggregation between them. I figure that's plenty of bandwidth, but for the fact that the switches wouldn't use it. And since the cluster and client traffic was only permitted to go over those ports, this became a very big problem. Manually plotting out the spanning tree allowed me to understand this, and moreover gave me at least an interim answer. I reconfigured the two core switches to be the preferred roots, and from there all other links fell into place. To make sure of this, I also configured the non-preferred links between those switches and the others in the network to be of greater cost.

To be certain that a switch failure still did not nuke the network, I ended up configuring the trunks to allow all valid VLANs. I may eventually pull this back, once I get a better handle on MSTP. Ideally, MSTP should figure out how to use the appropriate links, but unfortunately I have more to learn there.

So lesson learned: make sure your networking isn't fucked up.

Don't Run Ceph+ZFS on Shitty Hardware

Isn't just the way it goes, when things go to hell at 2 in the morning?

And that's what's happened. But it wasn't the first time.

I have (or soon to be: "had") an experimental Ceph cluster running on some spare hardware. As is proper form for experimental things, it quickly became production when the need to free up some other production hardware arrived. Some notes about this wonder-cluster:

It runs a fairly recent, but not the latest, Ceph.
It started with two nodes, and grew to four, then shrunk to three.
It has disks of many sizes.
It uses ZFS.
The hardware is workstation-class.
The drives are old and several have died.

Sounds like production-quality to me! Ha... but, that was my choice, and I'm now reaping what I have sown.

Long story short, some of the OSDs occasionally get all bound up in some part of ZFS, as near as I can tell. It could be that the drives are not responding fast enough, or that there's a race condition in ZFS that these systems are hitting with unfortunate frequency. Whatever the reason, what ends up happening is that the kernel log gets loaded up with task-waiting notifications, and since there are only 5 out of 8 OSDs still living, the cluster instantly freezes operation due to insufficient replication. Note that the data, at least, is still safe.

Typically I've had to hard-reboot machines when this happens. My last attempt - this very evening - took place from my home office by way of command-line SYSRQ rebooting (thanks, Major Hayden! I love your page on that topic!). Unfortunately, graceful shutdowns don't tend to work when the kernel gets in whatever state I find it at times like these. One morning, I had to have my tech hard-cycle a machine that was even inaccessible via SSH.

Generally what happens next is that the machine in question comes back online, I turn the Ceph-related services back on, the cluster recovers, and everything goes on its merry way...for the most part. If the hypervisors have been starved for IO for several hours, I end up rebooting most of the VMs to get them moving. Unfortunately, tonight was not going to be that awesome.

I had been in the process of migrating VMs off my old Ceph+Proxmox cluster, and on to a new Ceph+Proxmox cluster. This had been going well, but during one particular transfer something peculiar happened... I suddenly couldn't back up VMs any longer. I also noticed on the VM console for the VM in question several telltale kernel alerts, the usual business of "I can't read or write disk! AAAHHH!!" I logged into one of the old Ceph boxes and sure enough, an OSD had gone down. The OSDs on the machine in question were pretty pegged, stuck waiting for disk I/O. But the disks? Not doing anything interesting, ironically. atop reported zero usage. So, I figured a hard-reset was in order, and did my command-line ritual to force the reboot. But it never came back...

Now, at the risk of jinxing myself (since my transfers are not yet complete), I'm going to say right now that fate was on my side. I had transferred all but a very small handful of VMs to the new cluster, and this last set I was saving for last anyway. But they were also important, and I decided it would be much better just to get them transferred before people starting ringing my phone at 6:30 in the morning. The only problem was how to access the images with a frozen Ceph cluster.

I'm sure a kitten somewhere will die every time someone reads this, but I reconfigured the old Ceph cluster to operate with 1 replica. Since I wouldn't be doing much writing, and I just had to get the last few VMs off the old storage, I felt (and hoped) it would be an OK thing to do. Needless to say, I am feverishly transferring while the remaining OSDs are yet living.

Probably the main limiting factor to the transfer rate is the NFS intermediary I have to go through, to get the VMs from one cluster to another. But I must credit Proxmox: their backup and restore functionality has made this infinitely easier than the last time I migrated VMs. The last time, I was transferring from a virsh/pacemaker (yes, completely command-line) configuration. Nothing wrong with virsh or pacemaker (both of which are very powerful packages), but I have to say I'm sold on Proxmox for large-scale hypervisor management...especially for the price!

Between my two new production hypervisors, I have just under 80 VMs running. I'd like a third hypervisor, but I'm not sure I can sell my boss on that just yet. My new production Ceph store has about 4.5T in use, out of 12.9T of space, and I haven't installed all the hard drives yet. When they came in, I noticed that they were all basically made from the same factory, on the same day, so I decided that we'd stagger their install so as to give us hopefully some buffer for when they start dying.

Transfer rates on the new Ceph cluster can reach up to 120MB/sec writing. I was hoping for more, but a large part of that may be the fact that I'm using ZFS for the OSDs, and for the journals, and the journals are not on super-expensive ultra-fast DRAM-SSDs. The journals are, for what it's worth, on SSDs, but unfortunately several of the SSDs keep removing themselves from operation. So far I haven't lost any journals, but I'm sure it will happen sooner or later. Sigh...

And the VM transfers are.....almost done....

BURNING MIDNIGHT
m.at.work