Reducing ceph filesystem overhead on virtual systems
We're running several virtualized ceph clusters for internal and customer usage. This is quite useful if you've providing container workloads and need a shared storage that is available an all nodes, resilient and quite fast. So yeah - ceph is a good choice here.
Unlike others like glusterfs, ceph is a more complex piece of software and requires you to adjust parameters sometimes. One of these situations will be covered by this post.
Environment
We're using cephfs to provide a distributed filesystem which is accessible on each of the worker nodes. The underlying virtualization stack is based on Hyper-V. Having typical filesystems with all sort of data on it, mostly smaller files (like configuration files, web pages and content) we saw huge waste of disk space and we've quite for a while not been able to explain that. In first place we've fixed on the idea that cephfs doesn't free allocation correctly and that the disks would spill up after a while.
One issue with virtualized environments in this scenario is: the operating system (debian / ubuntu) isn't aware of the actual hardware and declares the disks as HDD - which leads bluestore also to treat is as hardware.
This is why we've set the crush device classes on osd after allocation manually - all disks should be treated correctly. The device classes just don't change the way bluestore treats the disks :-) And if you're having a HDD, bluestore tries to avoid allocating many objects, it focuses on allocating larger objects. This is a good decision if you're heaving rotational media, because you don't want the disks to seek if you could avoid it.
Allocation
So, bluestore assumes that our disks are rotational and treats them as such. And this is where our allocation comes from: bluestore has a minimum object allocation size, which is especially for HDD media by default set to 64K.
Cephfs allocates (at least one) object per file. Having a filesystem with many large files (more than 50% of the files <= 8K) you're able to estimate quickly that this is a huge overhead.
So the idea of allocating larger chunks works fine for rados block devices (RDB) where there's no direct relation between files and objects. On cephfs this add's up quite fast and eating your disk space.
Solution
To adjust this behavior, you can adjust set the object allocation configuration and reduce the unused allocation by far.
Going through all the layers this will set the minimum allocation size of an object on cephfs to 4K, just like on a disk with 4k physical sector size. Larger objects may still be allocated at once, so it doesn't mean that each object is split into 4K chunks!
It's good to set a size here that really matches the lower barrier of a larger set of your files. Having a small subset that is smaller is still fine - then these files will still have some overhead. The majority (approx. 70-80%+) of your files should be just larger than the minimum allocation size.
OSD rebuild required
If you've a running cluster and are adjusting this value, you'll need to rebuild (remove, zap and add again) your OSDs, because existing objects cannot be migrated in background.
Having rebuild your OSDs, you might be happy with the gained space!