One type of incident that arised on several customer environments within the last weeks has been the result of an issue with Azure pod-managed identities.
What is managed podidentity?
The idea behind podidentity is that you can assign an identity (mostly a managed service identity [MSI]) to a pod running in an Azure Kubernetes Cluster (AKS).
Using some magic on the network stack (under the hood with iptables & NAT) you can then request an access token from a well-known oauth2 endpoint (namely: http://169.254.169.254/metadata/identity/oauth2/token) providing just the clientId of the MSI (that has been previously assigned) and the desired scope. Some services in background will inject the authentication for you.
Basically that's how user assigned identities on azure work, the only difference is that on this scenario, when running in managed mode, a controller pod will dynamically attach & detach the MSIs from the nodepool on which the pod requesting it runs.
To get more into this in detail: https://learn.microsoft.com/en-us/azure/aks/use-azure-ad-pod-identity
Receiving "identity not found"
Now you have a brief overview about podidentity, you might have received an error message similar to this:
On the environments we saw this issue "randomly". The overall usage of the environments are mostly many jobs processing data (with applications by spark) in batches. So many pods are running only for a specific purpose and then stopping again.
After some diagnosis we found out that the correct error description (to reproduce it) is: The "identity not found" error occurs when the first pod requesting a managed identity on a given nodepool and the workload in the pod starts within a given timeframe.
Why does this happen?
The origin of the issue is how the overall procedure works:
- A pod starts and requests a specific identity
- The MSI pods check if the MSI is already attached on the VMSS of the nodepool on which the pod is scheduled
- The MSI pods detect the MSI needs to be attached and performs this step
- The pod itself is now starting and might already have sent a request to the oauth2 endpoint
- Despite MSI is attached to the VMSS it is not yet effective, as this process takes some varying time to apply (we saw it in the range between 10 to 90s)
I hope this get's some clearer now: In general everything works. The problem is, that it's asynchronous. The workload starts and the MSI is not yet fully enabled on the nodepool (or VMSS to be more precise).
Why does it only happen sometimes?
We have been fully able to reproduce this, but attaching a MSI always from a "cold" state (where no pod requested it before). From a customer perspective it has been randomly because it might have happened that multiple jobs have been using the same identity (as they have been working on the same data sources), so the first worker might have hit an error but all subsequent ones started in an environment where the identity has already been attached before.
TL;DR - How to fix it?
In our case we didn't find a fix for it - we found a workaround.
The workaround is to run a dummy pod (like the kubernetes pause image) and attach it all the MSIs you need to have attached to a nodepool.
As an example: If you need to have three MSIs (sql-reader, sa-reader, kv-reader) that need always to be available on a nodepool, create the following resources:
- Create three AzureIdentity resources, mapping the MSIs to objects in kubernetes
- Create three AzureIdentityBinding resources, mapping the AzureIdentitys of the first step to a common selector (= all three use the same) selector
- Create a deployment (per nodepool) that has the selector specified in the AzureIdentityBinding applied
In our case i've written a helm chart to do the job for us. Here's an example output you can use in your case.
Hopefully this helps you at some point to get rid of this issue until you can migrate to workload identity.