As we're working - just like everyone else :-) - with AI tooling, we're using ollama host host our LLMs. Updating to the recent NVIDIA drivers (555.85), we can see that ollama is no longer using our GPU.
Testing the GPU mapping to the container shows the GPU is still there:
Long story short: After all, the reason seems to be an issue between NVIDIA driver 555.85 and ollama. Downgrade the driver (for example to 552.44) and all is fine again :-)
On a recent project I've been stumbling on the case that kerberos tickets have been inadvertently shared across containers on a node - which obviously caught my attention as I'm not keen on sharing such secrets across workloads. This post describes why this happens and what to do to prevent this.
If you run kubernetes on your own, you need to provide a storage solution with it. We are using ceph (operated through rook). This article gives some short overview about it's benefits and some pro's and con's of it.