Azure AKS fluxconfig-agent 401 causing unhealthy - kubernetes

I have an AKS environment based on the AKS-Construction templates
At some point fluxconfig-agent started reporting unhealthy. I checked the logs and it looks like there is a 401 when it tries to fetch config from https://eastus.dp.kubernetesconfiguration.azure.com
{"Message":"2022/10/03 17:09:01 URL:\u003e https://eastus.dp.kubernetesconfiguration.azure.com/subscriptions/xxx/resourceGroups/my-aks/provider/Microsoft.ContainerService-managedclusters/clusters/my-aks/configurations/getPendingConfigs?api-version=2021-11-01","LogType":"ConfigAgentTrace","LogLevel":"Information","Environment":"prod","Role":"ClusterConfigAgent","Location":"eastus","ArmId":"/subscriptions/xxx/resourceGroups/my-aks/providers/Microsoft.ContainerService/managedclusters/my-aks","CorrelationId":"","AgentName":"FluxConfigAgent","AgentVersion":"1.6.0","AgentTimestamp":"2022/10/03 17:09:01"}
{"Message":"2022/10/03 17:09:01 GET configurations returned response code {401}","LogType":"ConfigAgentTrace","LogLevel":"Information","Environment":"prod","Role":"ClusterConfigAgent","Location":"eastus","ArmId":"/subscriptions/xxx/resourceGroups/my-aks/providers/Microsoft.ContainerService/managedclusters/my-aks","CorrelationId":"","AgentName":"FluxConfigAgent","AgentVersion":"1.6.0","AgentTimestamp":"2022/10/03 17:09:01"}
{"Message":"2022/10/03 17:09:01 Failed to GET configurations with ResponseCode : {401}","LogType":"ConfigAgentTrace","LogLevel":"Information","Environment":"prod","Role":"ClusterConfigAgent","Location":"eastus","ArmId":"/subscriptions/xxx/resourceGroups/my-aks/providers/Microsoft.ContainerService/managedclusters/my-aks","CorrelationId":"","AgentName":"FluxConfigAgent","AgentVersion":"1.6.0","AgentTimestamp":"2022/10/03 17:09:01"}
{"Message":"Error in the getting the Configurations: error {%!s(\u003cnil\u003e)}","LogType":"ConfigAgentTrace","LogLevel":"Error","Environment":"prod","Role":"ClusterConfigAgent","Location":"eastus","ArmId":"/subscriptions/xxx/resourceGroups/my-aks/providers/Microsoft.ContainerService/managedclusters/my-aks","CorrelationId":"","AgentName":"FluxConfigAgent","AgentVersion":"1.6.0","AgentTimestamp":"2022/10/03 17:09:01"}
{"Message":"2022/10/03 17:09:01 \"Errorcode: 401, Message Unauthorized client credentials., Target /subscriptions/xxx/resourceGroups/my-aks/provider/Microsoft.ContainerService-managedclusters/clusters/my-aks/configurations/getPendingConfigs\"","LogType":"ConfigAgentTrace","LogLevel":"Information","Environment":"prod","Role":"ClusterConfigAgent","Location":"eastus","ArmId":"/subscriptions/xxx/resourceGroups/my-aks/providers/Microsoft.ContainerService/managedclusters/my-aks","CorrelationId":"","AgentName":"FluxConfigAgent","AgentVersion":"1.6.0","AgentTimestamp":"2022/10/03 17:09:01"}
Is anyone here familiar with how fluxconfig-agent authenticates and what might cause a 401 here?

Seems to have went away for now after upgrading my AKS cluster and nodes to latest Kubernetes version.

Related

Ec2 Metadata updgrade from imdSV1 to imdSV2 causes 403 and 401 error- kube2iam

I recently updated my ec2 instances to use imdSV2 but had to rollback because of the following issue:
It looks like after i did the upgrade my init containers started failing and i saw the following in the logs:
time="2022-01-11T14:25:01Z" level=info msg="PUT /latest/api/token (403) took 0.753220 ms" req.method=PUT req.path=/latest/api/token req.remote=XXXXX res.duration=0.75322 res.status=403 time="2022-01-11T14:25:37Z" level=error msg="Error getting instance id, got status: 401 Unauthorized"
We are using Kube2iam for the same. Any advice what changes need to be done on the Kube2iam side to support imdSV2? Below is some info from my kube2iam daemonset:
EKS =1.21
image = "jtblin/kube2iam:0.10.9"

How to optimize config Keycloak

I have deployed Keycloak on Kubernetes and am having a performance issue with Keycloak as follows:
I run 6 pods Keycloak with mode standalone HA using KUBE_PING in Kubernetes and auto scale hpa if CPU over 80%. When I test the login load with Keycloak, the threshold is only 150ccu and if over this threshold an error will occur. I see log pods Keycloak occur timeout as below
ERROR [org.infinispan.interceptors.impl.InvocationContextInterceptor] (timeout-thread--p16-t1) ISPN000136: Error executing command RemoveCommand on Cache 'authenticationSessions', writing keys [f85ac151-6196-48e9-977c-048fc8bcd975]: org.infinispan.util.concurrent.TimeoutException: ISPN000476: Timed out waiting for responses for request 2312955 from keycloak-9b6486f7-bgw8s"
ERROR [org.infinispan.interceptors.impl.InvocationContextInterceptor] (timeout-thread--p16-t1) ISPN000136: Error executing command ReplaceCommand on Cache 'loginFailures', writing keys [LoginFailureKey [ realmId=etc-internal. userId=a76c3578-32fa-42cb-88d7-fcfdccc5f5c6 ]]: org.infinispan.util.concurrent.TimeoutException: ISPN000476: Timed out waiting for responses for request 2201111 from keycloak-9b6486f7-bgw8s"
ERROR [org.infinispan.interceptors.impl.InvocationContextInterceptor] (timeout-thread--p20-t1) ISPN000136: Error executing command PutKeyValueCommand on Cache 'sessions', writing keys [6a0e8cde-82b7-4824-8082-109ccfc984b4]: org.infinispan.util.concurrent.TimeoutException: ISPN000476: Timed out waiting for responses for request 2296440 from keycloak-9b6486f7-9l9lq"
I see the RAM, CPU of keycloak takes up very little under 20% so it does not auto scale hpa. So I think the current configuration of Keycloak is not optimize as about number of CACHE_OWNERS, Access Token Lifespan, SSO Session Idle, SSO Session Max, etc...
I want to know what configurations to configure accordingly and can load Keycloak to 500ccu with response time ~ 3s. Please support me if you know about this !
In standalone-ha.xml config, I only update config about datasource as image below

InvalidIdentityToken: Couldn't retrieve verification key from your identity provider

I am new to aws and kubectl, I need to deploy one of the app to aws. After deploying to eks cluster, I edited the ingress in the kubectl but unfortunately it returned 404 not found. (i am pretty sure the new service container works fine)
after checking from kubectl describe ingress, here are some events reports:
Warning FailedBuildModel 40m ingress Failed build model due to WebIdentityErr: failed to retrieve credentials
caused by: InvalidIdentityToken: Couldn't retrieve verification key from your identity provider, please reference AssumeRoleWithWebIdentity documentation for requirements
status code: 400, request id: xxxxxxxx-4a93-4e27-9d6b-xxxxxxxx
Warning FailedBuildModel 22m ingress Failed build model due to WebIdentityErr: failed to retrieve credentials
caused by: InvalidIdentityToken: Couldn't retrieve verification key from your identity provider, please reference AssumeRoleWithWebIdentity documentation for requirements
status code: 400, request id: xxxxxxxx-5368-41e1-8a4d-xxxxxxxx
Warning FailedBuildModel 5m8s ingress Failed build model due to WebIdentityErr: failed to retrieve credentials
caused by: InvalidIdentityToken: Couldn't retrieve verification key from your identity provider, please reference AssumeRoleWithWebIdentity documentation for requirements
status code: 400, request id: xxxxxxxx-20ea-4bd0-b1cb-xxxxxxxx
Anyone has ideas about this issue?

fluxcd not applying anything with err="running kubectl: error: unable to recognize \"STDIN\": ..."

I recently installed FluxCD 1.19.0 on an Azure AKS k8s cluster using fluxctl install. We use a private git (self hosted bitbucket) which Flux is able to reach and check out.
Now Flux is not applying anything with the error message:
ts=2020-06-10T09:07:42.7589883Z caller=loop.go:133 component=sync-loop event=refreshed url=ssh://git#bitbucket.some-private-server.com:7999/infra/k8s-gitops.git branch=master HEAD=7bb83d1753a814c510b1583da6867408a5f7e21b
ts=2020-06-10T09:09:00.631764Z caller=sync.go:73 component=daemon info="trying to sync git changes to the cluster" old=7bb83d1753a814c510b1583da6867408a5f7e21b new=7bb83d1753a814c510b1583da6867408a5f7e21b
ts=2020-06-10T09:09:01.6130559Z caller=sync.go:539 method=Sync cmd=apply args= count=3
ts=2020-06-10T09:09:20.2097034Z caller=sync.go:605 method=Sync cmd="kubectl apply -f -" took=18.5965923s err="running kubectl: error: unable to recognize \"STDIN\": an error on the server (\"\") has prevented the request from succeeding" output=
ts=2020-06-10T09:09:38.7432182Z caller=sync.go:605 method=Sync cmd="kubectl apply -f -" took=18.5334244s err="running kubectl: error: unable to recognize \"STDIN\": an error on the server (\"\") has prevented the request from succeeding" output=
ts=2020-06-10T09:09:57.277918Z caller=sync.go:605 method=Sync cmd="kubectl apply -f -" took=18.5346491s err="running kubectl: error: unable to recognize \"STDIN\": an error on the server (\"\") has prevented the request from succeeding" output=
ts=2020-06-10T09:09:57.2779965Z caller=sync.go:167 component=daemon err="<cluster>:namespace/dev: running kubectl: error: unable to recognize \"STDIN\": an error on the server (\"\") has prevented the request from succeeding; <cluster>:namespace/prod: running kubectl: error: unable to recognize \"STDIN\": an error on the server (\"\") has prevented the request from succeeding; dev:service/hello-world: running kubectl: error: unable to recognize \"STDIN\": an error on the server (\"\") has prevented the request from succeeding"
ts=2020-06-10T09:09:57.2879489Z caller=images.go:17 component=sync-loop msg="polling for new images for automated workloads"
ts=2020-06-10T09:09:57.3002208Z caller=images.go:27 component=sync-loop msg="no automated workloads"
From what I understand, Flux passes the resource definitions to kubectl, which then applies them?
The way I interpret the error would mean that kubectl isn't passed anything to. However I opened a shell in the container and made sure Flux was in fact checking something out - which it did.
I tried raising the verbosity to 9, but it didn't return anything that I deemed relevant (detailed outputs of the http requests and responses against the Kubernetes API).
So what is happening here?
The problem was with the version of kubectl used in the 1.19 flux release, so I fixed it by using a prerelease: https://hub.docker.com/r/fluxcd/flux-prerelease/tags

Azure CD Issue : Failed to fetch App Service 'myAppServiceName' publishing credentials

I'm trying to deploy my release on a azure web App. It's not working and I don't know what to do. Maybe I'm missing something in the configuration in my app service or in my release pipeline. I've got the following error
Failed to fetch App Service 'myAppServiceName' publishing credentials. Error: Could not fetch access token for Managed Service Principal.
And here is a block of my debug :
2019-04-11T08:25:35.4761242Z ##[debug]Predeployment Step Started
2019-04-11T08:25:35.4776374Z ##[debug]33ddf4aa-03c4-4031-95fa-e2083d49cc9e data subscriptionid = xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
2019-04-11T08:25:35.4776793Z ##[debug]33ddf4aa-03c4-4031-95fa-e2083d49cc9e data subscriptionname = Paiement à l’utilisation
2019-04-11T08:25:35.4777798Z ##[debug]33ddf4aa-03c4-4031-95fa-e2083d49cc9e auth param serviceprincipalid = null
2019-04-11T08:25:35.4778094Z ##[debug]33ddf4aa-03c4-4031-95fa-e2083d49cc9e data environmentAuthorityUrl = https://login.windows.net/
2019-04-11T08:25:35.4781237Z ##[debug]33ddf4aa-03c4-4031-95fa-e2083d49cc9e auth param tenantid = ***
2019-04-11T08:25:35.4782509Z ##[debug]33ddf4aa-03c4-4031-95fa-e2083d49cc9e=https://management.azure.com/
2019-04-11T08:25:35.4782769Z ##[debug]33ddf4aa-03c4-4031-95fa-e2083d49cc9e data environment = AzureCloud
2019-04-11T08:25:35.4785012Z ##[debug]33ddf4aa-03c4-4031-95fa-e2083d49cc9e auth scheme = ManagedServiceIdentity
2019-04-11T08:25:35.4785626Z ##[debug]33ddf4aa-03c4-4031-95fa-e2083d49cc9e data msiclientId = undefined
2019-04-11T08:25:35.4785882Z ##[debug]33ddf4aa-03c4-4031-95fa-e2083d49cc9e data activeDirectoryServiceEndpointResourceId = https://management.core.windows.net/
2019-04-11T08:25:35.4786107Z ##[debug]33ddf4aa-03c4-4031-95fa-e2083d49cc9e data AzureKeyVaultServiceEndpointResourceId = https://vault.azure.net
2019-04-11T08:25:35.4786348Z ##[debug]33ddf4aa-03c4-4031-95fa-e2083d49cc9e data AzureKeyVaultDnsSuffix = vault.azure.net
2019-04-11T08:25:35.4786525Z ##[debug]33ddf4aa-03c4-4031-95fa-e2083d49cc9e auth param authenticationType = null
2019-04-11T08:25:35.4786735Z ##[debug]33ddf4aa-03c4-4031-95fa-e2083d49cc9e data EnableAdfsAuthentication = false
2019-04-11T08:25:35.4792324Z ##[debug]{"subscriptionID":"mysubscriptionID","subscriptionName":"Paiement à l’utilisation","servicePrincipalClientID":null,"environmentAuthorityUrl":"https://login.windows.net/","tenantID":"***","url":"https://management.azure.com/","environment":"AzureCloud","scheme":"ManagedServiceIdentity","activeDirectoryResourceID":"https://management.azure.com/","azureKeyVaultServiceEndpointResourceId":"https://vault.azure.net","azureKeyVaultDnsSuffix":"vault.azure.net","authenticationType":null,"isADFSEnabled":false,"applicationTokenCredentials":{"clientId":null,"domain":"***","baseUrl":"https://management.azure.com/","authorityUrl":"https://login.windows.net/","activeDirectoryResourceId":"https://management.azure.com/","isAzureStackEnvironment":false,"scheme":0,"isADFSEnabled":false}}
2019-04-11T08:25:35.4809400Z Got service connection details for Azure App Service:'myAppServiceName'
2019-04-11T08:25:35.4846967Z ##[debug][GET]http://169.254.169.254/metadata/identity/oauth2/token?api-version=2018-02-01&resource=https://management.azure.com/
2019-04-11T08:25:35.5443632Z ##[debug]Deployment Failed with Error: Error: Failed to fetch App Service 'myAppServiceName' publishing credentials. Error: Could not fetch access token for Managed Service Principal. Please configure Managed Service Identity (MSI) for virtual machine 'https://aka.ms/azure-msi-docs'. Status code: 400, status message: Bad Request
2019-04-11T08:25:35.5444488Z ##[debug]task result: Failed
2019-04-11T08:25:35.5501745Z ##[error]Error: Failed to fetch App Service 'myAppServiceName' publishing credentials. Error: Could not fetch access token for Managed Service Principal. Please configure Managed Service Identity (MSI) for virtual machine 'https://aka.ms/azure-msi-docs'. Status code: 400, status message: Bad Request
2019-04-11T08:25:35.5511780Z ##[debug]Processed: ##vso[task.issue type=error;]Error: Failed to fetch App Service 'myAppServiceName' publishing credentials. Error: Could not fetch access token for Managed Service Principal. Please configure Managed Service Identity (MSI) for virtual machine 'https://aka.ms/azure-msi-docs'. Status code: 400, status message: Bad Request
2019-04-11T08:25:35.5512729Z ##[debug]Processed: ##vso[task.complete result=Failed;]Error: Failed to fetch App Service 'myAppServiceName' publishing credentials. Error: Could not fetch access token for Managed Service Principal. Please configure Managed Service Identity (MSI) for virtual machine 'https://aka.ms/azure-msi-docs'. Status code: 400, status message: Bad Request
2019-04-11T08:25:35.5512828Z Failed to add release annotation. Error: Failed to get App service 'myAppServiceName' application settings. Error: Could not fetch access token for Managed Service Principal. Please configure Managed Service Identity (MSI) for virtual machine 'https://aka.ms/azure-msi-docs'. Status code: 400, status message: Bad Request
2019-04-11T08:25:35.5645194Z (node:5004) UnhandledPromiseRejectionWarning: Unhandled promise rejection (rejection id: 1): Error: Failed to fetch App Service 'myAppServiceName' publishing profile. Error: Could not fetch access token for Managed Service Principal. Please configure Managed Service Identity (MSI) for virtual machine 'https://aka.ms/azure-msi-docs'. Status code: 400, status message: Bad Request
2019-04-11T08:25:35.5759915Z ##[section]Finishing: Deploy Azure App Service
And some screenshot of
azure missing configuration ?
release pipeline config 1
release pipeline config 2
release pipeline config 3
Let me know if you need more informations.. I'm new in this so maybe missing simple things... Best regards
do you have setting identity Status On ?
like below
In my case, we had just moved our app service to a new resource group, but the pipeline was still referencing the old resource group. Correcting the resource group fixed the issue
A simple typo can also be the reason for this error message.
You will get this error message even though if it's just a typo or wrong value in your "slotName".
Please do ensure that the "slotName" you've given is the actual slotname (the default is 'production'). So if you've added a slot that's called 'stage' then inside the portal it will have your '/stage' or '-stage', but it's still just called 'stage'.
I know several have had this error message shown and none of the above helped them out (I faced the same issue the first time).
My research indicated this to be an intermittent problem.
I redeployed 2 times and it worked.
The first redeploy - just seemed to wait for ages to connect to an available agent, so I cancelled that too, and redeployed - which worked without any issue.
If this is still an issue or if someone had this issue, all I did was just to rerun the release and it well went well. Hopefully someone has saved time by just re-releasing, if this wont work then probably try something else.