I have a multi-terabyte, multi-server Ceph store that deadlocks itself every so often, always during high-write activity. It has been harrowing. I have bulked up on the RAM, removed slow drives from the cluster, and even disabled one of the two machines. The last case provided the best runtime, but the least reliability and of course the least space. I had to re-add the first (primary) node, since the additional space became a necessity. With the primary re-added, the deadlocks became a thing again.
It should be noted that I have CephFS mounted on the primary Ceph node. Nothing but Ceph itself is running on the secondary node. I am writing huge amounts of data onto the primary. I have another 4-server Ceph cluster that has never experienced any deadlocks, but also does NOT locally mount any Ceph device (CephFS or RBD).
The primary server had previously never had issues like this before, and originally it ran a pair of ZFS RAIDZ-5 arrays, on the same drives no less. Whatever the issue, it is likely related to Ceph (but might not be Ceph's fault). I thought perhaps the issue was with CephFS, since it's not always purported to be production-ready. I created, mapped and formatted an RBD native block device with XFS, and yet the deadlocks remained. I saw some interesting kernel tasks hung up, and searching on them brought me to where I am now.
This issue details deadlock situations and is marked as Won't Fix, since it's believed that the issue is actually one with the kernel (but not necessarily a kernel bug): http://tracker.ceph.com/issues/3076
A very lengthy thread (https://lkml.org/lkml/2004/7/26/68) discussed how a deadlock between two or more processes can and does occur, when VFS is layered on top of one or more other file systems. In my case, CephFS is on top of Ceph, which is on top of local ZFS drives. The gist of it is that a not-quite-OOM condition can occur where memory needs to allocated in order for more memory to be freed, specifically for flushing data to disk, but there is no memory available to do this. If this happens between two processes, it deadlocks without an explicit OOM, and to be fair, my servers never appear to run out of memory. On the contrary, they are almost entirely usable, except for their file systems. Issuing a SYNC (s) to sysrq-trigger before a REBOOT (b) to the same tends to preserve the logs, which are not on Ceph but are otherwise lost without the SYNC.
I tried experimenting with limiting the number of dirty pages in the system, wondering if perhaps this was part of the issue at hand. Here's the kernel reference page: https://www.kernel.org/doc/Documentation/sysctl/vm.txt. This didn't help, but what does appear to help is issuing a "3" to drop_caches. Right now, I am not sure why. But basically this is what happens: The memory stats do not indicate OOM, and the buffers/caches do not run dry. Regardless, several OSDs on the primary node cease to respond, but their processes are not dead. They all appear stuck waiting for an opportunity to write, and consume massive CPU. Ceph eventually DOWNs the OSDs, but I have noout set so they are not removed. This situation will remain thus forever, unless you hardboot....or issue a 3 to drop_caches (latest discovery). Doing the latter clears out the allocated buffers/caches, making a huge chunk available. The OSDs almost instantly begin responding again, and recovery takes place directly. Within a couple of minutes, the cluster is functioning normally again.
I suspect there is something amiss with the kernel cache system.
I have put a cron job in place to drop the caches every 2 hours, just to see how well this functions. I have done this manually a couple of times now, with good results, but this is by no means a good fix.