When fstrim stalls your I/O subsystem
On one of our systems we had issues that - once a week - the I/O subsystem stalled and causes issues on database operations.
On one of our systems we're running some sort of quite active database that is writing at a high ingestion rate 24/7. Since a while we had issues that the system had issues once every week and we thought it relates to postgres (timescaledb) when large portions of the databases are freed.
The problems arised from sunday to monday at 01:00am and showed that the whole I/O subsystem just stalled for 30-45minutes and the system just went very slow meaning that the data ingestion went into timeouts etc.
The problem is, that we've been quite focused on the database as a root cause that we didn't look too much into the base system (which we should have done right away).
TL;DR - fstrim
After a while we discovered the following logs:
This shows that the fstrim trimmed around 360GiB on the disk. On one hand this is quite nice because we pass this information to the disk. On the other hand we're running on a virtual machine that (as by default) uses a fixed size disk on a hybrid storage-subsystem. So there's no requirement to trim at all.
Having disabled the timer, all is fine.