On one of our systems we're running some sort of quite active database that is writing at a high ingestion rate 24/7. Since a while we had issues that the system had issues once every week and we thought it relates to postgres (timescaledb) when large portions of the databases are freed.
The problems arised from sunday to monday at 01:00am and showed that the whole I/O subsystem just stalled for 30-45minutes and the system just went very slow meaning that the data ingestion went into timeouts etc.
The problem is, that we've been quite focused on the database as a root cause that we didn't look too much into the base system (which we should have done right away).
TL;DR - fstrim
After a while we discovered the following logs:
This shows that the fstrim trimmed around 360GiB on the disk. On one hand this is quite nice because we pass this information to the disk. On the other hand we're running on a virtual machine that (as by default) uses a fixed size disk on a hybrid storage-subsystem. So there's no requirement to trim at all.
Having disabled the timer, all is fine.
Since Ubuntu 18.04 The fstrim is enabled by default on Ubuntu since version 18.04.
On a recent project I've been stumbling on the case that kerberos tickets have been inadvertently shared across containers on a node - which obviously caught my attention as I'm not keen on sharing such secrets across workloads. This post describes why this happens and what to do to prevent this.
If you run kubernetes on your own, you need to provide a storage solution with it. We are using ceph (operated through rook). This article gives some short overview about it's benefits and some pro's and con's of it.