"Identity not found" when using Azure pod-managed identities

27 Dec 2022

One type of incident that arised on several customer environments within the last weeks has been the result of an issue with Azure pod-managed identities.

What is managed podidentity?

The idea behind podidentity is that you can assign an identity (mostly a managed service identity [MSI]) to a pod running in an Azure Kubernetes Cluster (AKS).

Using some magic on the network stack (under the hood with iptables & NAT) you can then request an access token from a well-known oauth2 endpoint (namely: http://169.254.169.254/metadata/identity/oauth2/token) providing just the clientId of the MSI (that has been previously assigned) and the desired scope. Some services in background will inject the authentication for you.

Basically that's how user assigned identities on azure work, the only difference is that on this scenario, when running in managed mode, a controller pod will dynamically attach & detach the MSIs from the nodepool on which the pod requesting it runs.

To get more into this in detail: https://learn.microsoft.com/en-us/azure/aks/use-azure-ad-pod-identity

DeprecatedMicrosoft deprecated pod-managed identities in favor of Azure AD workload identity

Receiving "identity not found"

Now you have a brief overview about podidentity, you might have received an error message similar to this:

{"error":"invalid_request","error_description":"Identity not found"}

error message

On the environments we saw this issue "randomly". The overall usage of the environments are mostly many jobs processing data (with applications by spark) in batches. So many pods are running only for a specific purpose and then stopping again.

After some diagnosis we found out that the correct error description (to reproduce it) is: The "identity not found" error occurs when the first pod requesting a managed identity on a given nodepool and the workload in the pod starts within a given timeframe.

Why does this happen?

The origin of the issue is how the overall procedure works:

A pod starts and requests a specific identity
The MSI pods check if the MSI is already attached on the VMSS of the nodepool on which the pod is scheduled
The MSI pods detect the MSI needs to be attached and performs this step
The pod itself is now starting and might already have sent a request to the oauth2 endpoint
Despite MSI is attached to the VMSS it is not yet effective, as this process takes some varying time to apply (we saw it in the range between 10 to 90s)

I hope this get's some clearer now: In general everything works. The problem is, that it's asynchronous. The workload starts and the MSI is not yet fully enabled on the nodepool (or VMSS to be more precise).

Why does it only happen sometimes?

We have been fully able to reproduce this, but attaching a MSI always from a "cold" state (where no pod requested it before). From a customer perspective it has been randomly because it might have happened that multiple jobs have been using the same identity (as they have been working on the same data sources), so the first worker might have hit an error but all subsequent ones started in an environment where the identity has already been attached before.

TL;DR - How to fix it?

In our case we didn't find a fix for it - we found a workaround.

The workaround is to run a dummy pod (like the kubernetes pause image) and attach it all the MSIs you need to have attached to a nodepool.

As an example: If you need to have three MSIs (sql-reader, sa-reader, kv-reader) that need always to be available on a nodepool, create the following resources:

Create three AzureIdentity resources, mapping the MSIs to objects in kubernetes
Create three AzureIdentityBinding resources, mapping the AzureIdentitys of the first step to a common selector (= all three use the same) selector
Create a deployment (per nodepool) that has the selector specified in the AzureIdentityBinding applied

In our case i've written a helm chart to do the job for us. Here's an example output you can use in your case.

---
# Source: aad-pod-identity-alwayson/templates/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: example-msialwayson-idset1-r1
spec:
  replicas: 1
  selector:
    matchLabels:
      app: example-msialwayson-idset1-r1
  template:
    metadata:
      labels:
        app: example-msialwayson-idset1-r1
        aadpodidbinding: example-msialwayson-idset1
    spec:
      nodeSelector:
        kubernetes.azure.com/agentpool: batchprocessing
      containers:
        - name: pause
          image: google/pause
          imagePullPolicy: IfNotPresent
          resources:
            limits:
              cpu: 25m
              memory: 32M
            requests:
              cpu: 0
              memory: 0
      tolerations:
        - effect: NoExecute
          key: CriticalAddonsOnly
          operator: Equal
          value: "true"
---
# Source: aad-pod-identity-alwayson/templates/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: example-msialwayson-idset1-r2
spec:
  replicas: 1
  selector:
    matchLabels:
      app: example-msialwayson-idset1-r2
  template:
    metadata:
      labels:
        app: example-msialwayson-idset1-r2
        aadpodidbinding: example-msialwayson-idset1
    spec:
      nodeSelector:
        kubernetes.azure.com/agentpool: compute
      containers:
        - name: pause
          image: google/pause
          imagePullPolicy: IfNotPresent
          resources:
            limits:
              cpu: 25m
              memory: 32M
            requests:
              cpu: 0
              memory: 0
---
# Source: aad-pod-identity-alwayson/templates/azureidentity.yaml
apiVersion: aadpodidentity.k8s.io/v1
kind: AzureIdentity
metadata:
  annotations:
    aadpodidentity.k8s.io/Behavior: namespaced
  name: example-msialwayson-idset1-msi1
spec:
  clientID: 0a8a1d7c-3ef1-45a2-89bd-285e4fbf8cc0
  resourceID: /subscriptions/34f99505-91b1-4189-9b16-53bc868fa2cb/resourceGroups/my-resource-group/providers/Microsoft.ManagedIdentity/userAssignedIdentities/sql-reader
  type: 0
---
# Source: aad-pod-identity-alwayson/templates/azureidentity.yaml
apiVersion: aadpodidentity.k8s.io/v1
kind: AzureIdentity
metadata:
  annotations:
    aadpodidentity.k8s.io/Behavior: namespaced
  name: example-msialwayson-idset1-msi2
spec:
  clientID: 6dd65566-ab49-1236-8838-1556da5e7f1c
  resourceID: /subscriptions/34f99505-91b1-4189-9b16-53bc868fa2cb/resourceGroups/my-resource-group/providers/Microsoft.ManagedIdentity/userAssignedIdentities/sa-reader
  type: 0
---
# Source: aad-pod-identity-alwayson/templates/azureidentity.yaml
apiVersion: aadpodidentity.k8s.io/v1
kind: AzureIdentity
metadata:
  annotations:
    aadpodidentity.k8s.io/Behavior: namespaced
  name: example-msialwayson-idset1-msi3
spec:
  clientID: 9d165566-b6a1-6541-8838-1556da5e7d0a
  resourceID: /subscriptions/34f99505-91b1-4189-9b16-53bc868fa2cb/resourceGroups/my-resource-group/providers/Microsoft.ManagedIdentity/userAssignedIdentities/kv-reader
  type: 0
---
# Source: aad-pod-identity-alwayson/templates/azureidentitybinding.yaml
apiVersion: aadpodidentity.k8s.io/v1
kind: AzureIdentityBinding
metadata:
  name: example-msialwayson-idset1-msi1
spec:
  azureIdentity: example-msialwayson-idset1-msi1
  selector: example-msialwayson-idset1
---
# Source: aad-pod-identity-alwayson/templates/azureidentitybinding.yaml
apiVersion: aadpodidentity.k8s.io/v1
kind: AzureIdentityBinding
metadata:
  name: example-msialwayson-idset1-msi2
spec:
  azureIdentity: example-msialwayson-idset1-msi2
  selector: example-msialwayson-idset1
---
# Source: aad-pod-identity-alwayson/templates/azureidentitybinding.yaml
apiVersion: aadpodidentity.k8s.io/v1
kind: AzureIdentityBinding
metadata:
  name: example-msialwayson-idset1-msi3
spec:
  azureIdentity: example-msialwayson-idset1-msi3
  selector: example-msialwayson-idset1

fix for the podidentity issue

Hopefully this helps you at some point to get rid of this issue until you can migrate to workload identity.

Daniel Nachtrub

Kind of likes computers. Linux foundation certified: LFCS / CKA / CKAD / CKS. Microsoft certified: Cybersecurity Architect Expert & Azure Solutions Architect Expert.

"Identity not found" when using Azure pod-managed identities

What is managed podidentity?

Receiving "identity not found"

Why does this happen?

Why does it only happen sometimes?

TL;DR - How to fix it?

Daniel Nachtrub

Authors →

Daniel Nachtrub

Sebastian Augustin

Lorenz Maier