
ceph reports "n hosts fail cephadm check"
One of our ceph clusters entered HEALTH_WARN state with reason "1 hosts fail cephadm check". This guide shows a quick tip how to find out more about this issue.
One of our ceph clusters entered HEALTH_WARN state while seemingly everything had been running. Checking out the status showed:
cluster:
id: 2b9ccc20-2b33-11eb-8d8f-00155d51f07c
health: HEALTH_WARN
1 hosts fail cephadm check
All daemons had been running and everything worked as expected. How to find out what's wrong?
You can use cephadm check-host to verify connectivity and requirements for ceph to run successfully. So let's try out:
ceph cephadm check-host nuv-dc-apphost2
check-host failed:
INFO:cephadm:podman|docker (/usr/bin/docker) is present
INFO:cephadm:systemctl is present
INFO:cephadm:lvcreate is present
WARNING:cephadm:No time sync service is running; checked for ['chrony.service', 'chronyd.service', 'systemd-timesyncd.service', 'ntpd.service', 'ntp.service']
INFO:cephadm:Hostname "nuv-dc-apphost2" matches what is expected.
ERROR: No time synchronization is active
Checking the ntp daemon on the affected host, it's been down indeed. So i just started the daemon again and cluster has gone happy right afterwards again.
ceph cephadm check-host nuv-dc-apphost3
nuv-dc-apphost3 (None) ok
So - if you've a similar issue, invoke ceph cephadm check-host to see what check failed and be able to resolve the issue.