Running a kubernetes cluster in AWS via EKS. Everything appears to be working as expected, but just checking through all logs to verify. I hopped on to one of the worker nodes and I noticed a bunch of errors when looking at the kubelet service
Oct 09 09:42:52 ip-172-26-0-213.ec2.internal kubelet[4226]: E1009 09:42:52.335445 4226 kubelet_node_status.go:377] Error updating node status, will retry: error getting node "ip-172-26-0-213.ec2.internal": Unauthorized
Oct 09 10:03:54 ip-172-26-0-213.ec2.internal kubelet[4226]: E1009 10:03:54.831820 4226 kubelet_node_status.go:377] Error updating node status, will retry: error getting node "ip-172-26-0-213.ec2.internal": Unauthorized
Nodes are all showing as ready, but I'm not sure why those errors are appearing. Have 3 worker nodes and all 3 have the same kubelet errors (hostnames are different obviously)
Additional information. It would appear that the error is coming from this line in kubelet_node_status.go
node, err := kl.heartbeatClient.CoreV1().Nodes().Get(string(kl.nodeName), opts)
if err != nil {
return fmt.Errorf("error getting node %q: %v", kl.nodeName, err)
}
From the workers I can execute get nodes using kubectl just fine:
kubectl get --kubeconfig=/var/lib/kubelet/kubeconfig nodes
NAME STATUS ROLES AGE VERSION
ip-172-26-0-58.ec2.internal Ready <none> 1h v1.10.3
ip-172-26-1-193.ec2.internal Ready <none> 1h v1.10.3
Turns out this is not an issue. Official reply from AWS regarding these errors:
The kubelet will regularly report node status to the Kubernetes API. When it does so it needs an authentication token generated by the aws-iam-authenticator. The kubelet will invoke the aws-iam-authenticator and store the token in it's global cache. In EKS this authentication token expires after 21 minutes.
The kubelet doesn't understand token expiry times so it will attempt to reach the API using the token in it's cache. When the API returns the Unauthorized response, there is a retry mechanism to fetch a new token from aws-iam-authenticator and retry the request.
Related
I recently updated my ec2 instances to use imdSV2 but had to rollback because of the following issue:
It looks like after i did the upgrade my init containers started failing and i saw the following in the logs:
time="2022-01-11T14:25:01Z" level=info msg="PUT /latest/api/token (403) took 0.753220 ms" req.method=PUT req.path=/latest/api/token req.remote=XXXXX res.duration=0.75322 res.status=403 time="2022-01-11T14:25:37Z" level=error msg="Error getting instance id, got status: 401 Unauthorized"
We are using Kube2iam for the same. Any advice what changes need to be done on the Kube2iam side to support imdSV2? Below is some info from my kube2iam daemonset:
EKS =1.21
image = "jtblin/kube2iam:0.10.9"
Below is the exception facing while implementing AGIC in AKS
Readiness Prob is failing for the ingress-azure
Events:
Type Reason Age From Message
Normal Scheduled 5m22s default-scheduler Successfully assigned default/ingress-azure-fc5dcbcd8-bsgt8 to aks-agentpool-22890870-vmss000002
Normal Pulling 5m22s kubelet Pulling image "mcr.microsoft.com/azure-application-gateway/kubernetes-ingress:1.4.0"
Normal Pulled 5m22s kubelet Successfully pulled image "mcr.microsoft.com/azure-application-gateway/kubernetes-ingress:1.4.0" in 121.018102ms
Normal Created 5m22s kubelet Created container ingress-azure
Normal Started 5m22s kubelet Started container ingress-azure
Warning Unhealthy 21s (x30 over 5m11s) kubelet Readiness probe failed: Get "http://10.240.xx.xxx:8123/health/ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
kubectl logs -f mic_xxxx:
failed to update user-assigned identities on node aks-agentpool-2xxxxx-vmss (add [1], del [0], update[0]), error: failed to get identity resource, error: failed to get vmss aks-agentpool-2xxxx-vmss in resource group MC_Axx-xx_axxx-ak8_koreacentral, error: failed to get vmss aks-agentpool-2xxxxx-vmss in resource group MC_Axx-axxx_agw-ak8_koreacentral, error: compute.VirtualMachineScaleSetsClient#Get: Failure responding to request: StatusCode=403 -- Original Error: autorest/azure: Service returned an error. Status=403 Code="AuthorizationFailed" Message="The client '4xxxxxx-xxxxxxx-7xxx-xxxxxxx' with object id '4xxxxxx-xxxxxxx-7xxx-xxxxxxx' does not have authorization to perform action 'Microsoft.Compute/virtualMachineScaleSets/read' over scope '/subscriptions/{subscription_id}/resourceGroups/{MC_rg_name}/providers/Microsoft.Compute/virtualMachineScaleSets/aks-agentpool-2xxxxx-vmss' or the scope is invalid. If access was recently granted, please refresh your credentials."
Steps Implemented:
AKS cluster with RABAC enabled & Azure CNI
2 subnets in the same vnet with same resource group (Not the RG which starts with MC_)
Provided the contributor & reader access to the AGW after implementing it.
Applied
kubectl apply -f https://raw.githubusercontent.com/Azure/aad-pod-identity/v1.8.8/deploy/infra/deployment-rbac.yaml
Made changes according in the helm-config.yaml and authenticated using identityResourceID.
Suggested us on this exception. Thanks.
Node IP
Role
OS
192.x.x.11
Master 1
RHEL8
192.x.x.12
Master 2
RHEL8
192.x.x.13
Master 3
RHEL8
192.x.x.16
VIP
Use-Cases
No of Masters Ready or Running
Expected
Actual
3 Masters
Ingress Created with VIP IP and ping to VIP should work
VIP is working
2 Masters
Ingress Created with VIP IP and ping to VIP should work
VIP is working
1 Master
Ingress Created with VIP IP and ping to VIP should work
VIP is not working, Kubectl is not responding
I have Created a RKE2 HA Cluster with kube-vip and the cluster is working fine only when at least 2 masters are in Running, but I want to test a use case where only 1 master is available the VIP should be able to ping and any ingress created with VIP address should work.
In my case when 2 masters are down I'm facing an issue with kube-vip-ds pod, when i check the logs using crictl command I'm getting the below error can someone suggest to me how to reslove this issue.
E0412 12:32:20.733320 1 leaderelection.go:322] error retrieving resource lock kube-system/plndr-cp-lock: etcdserver: request timed out
E0412 12:32:20.733715 1 leaderelection.go:325] error retrieving resource lock kube-system/plndr-svcs-lock: etcdserver: request timed out
E0412 12:32:25.812202 1 leaderelection.go:325] error retrieving resource lock kube-system/plndr-svcs-lock: rpc error: code = Unknown desc = OK: HTTP status code 200; transport: missing content-type field
E0412 12:32:25.830219 1 leaderelection.go:322] error retrieving resource lock kube-system/plndr-cp-lock: rpc error: code = Unknown desc = OK: HTTP status code 200; transport: missing content-type field
E0412 12:33:27.204128 1 leaderelection.go:322] error retrieving resource lock kube-system/plndr-cp-lock: the server was unable to return a response in the time allotted, but may still be processing the request (get leases.coordination.k8s.io plndr-cp-lock)
E0412 12:33:27.504957 1 leaderelection.go:325] error retrieving resource lock kube-system/plndr-svcs-lock: the server was unable to return a response in the time allotted, but may still be processing the request (get leases.coordination.k8s.io plndr-svcs-lock)
E0412 12:34:29.346104 1 leaderelection.go:322] error retrieving resource lock kube-system/plndr-cp-lock: the server was unable to return a response in the time allotted, but may still be processing the request (get leases.coordination.k8s.io plndr-cp-lock)
E0412 12:34:29.354454 1 leaderelection.go:325] error retrieving resource lock kube-system/plndr-svcs-lock: the server was unable to return a response in the time allotted, but may still be processing the request (get leases.coordination.k8s.io plndr-svcs-lock)
Thanks.
Kindly check if you have stacked up etcd datastore as part of your k8s cluster.
etcd for its quorum requires at least 2 masters to be running and in that case failure toleration is n-1, for 3 nodes it shall tolerate only one failure..so in your case as 2 masters are down your cluster is non-operational
I am trying to instantiate "sacc" chaincode (which comes with fabric samples) in an hyperledger fabric network deployed over kubernetes in AKS. After hours trying different adjustments, I've not been able to finish the task. I'm always getting the error:
Error: could not assemble transaction, err proposal response was not
successful, error code 500, msg timeout expired while starting
chaincode sacc:0.1 for transaction
Please note that there is no transaccion ID in the error (I've googled some similar cases, but in all of them, there was an ID for the transaction. Not my case, eventhough the error is the same)
The message in the orderer:
2019-07-23 12:40:13.649 UTC [orderer.common.broadcast] Handle -> WARN
047 Error reading from 10.1.0.45:52550: rpc error: code = Canceled
desc = context canceled 2019-07-23 12:40:13.649 UTC [comm.grpc.server]
1 -> INFO 048 streaming call completed {"grpc.start_time":
"2019-07-23T12:34:13.591Z", "grpc.service": "orderer.AtomicBroadcast",
"grpc.method": "Broadcast", "grpc.peer_address": "10.1.0.45:52550",
"error": "rpc error: code = Canceled desc = context canceled",
"grpc.code": "Canceled", "grpc.call_duration": "6m0.057953469s"}
I am calling for instantiation from inside a cli peer, defining variables CORE_PEER_LOCALMSPID, CORE_PEER_TLS_ROOTCERT_FILE, CORE_PEER_MSPCONFIGPATH, CORE_PEER_ADDRESS and ORDERER_CA with appropriate values before issuing the instantiation call:
peer chaincode instantiate -o -n sacc -v 0.1 -c
'{"Args":["init","hi","1"]}' -C mychannelname --tls 'true' --cafile
$ORDERER_CA
All the peers have declared a dockersocket volume pointing to
/run/docker.sock
All the peers have declared the variable CORE_VM_ENDPOINT to unix:///host/var/run/docker.sock
All the orgs were joined to the channel
All the peers have the chaincode installed
I can't see any further message/error in the cli, nor in the orderer, nor in the peers involved in the channel.
Any ideas on what can be going wrong? Or how could I continue troubleshooting the problem? Is it possible to see logs from the docker container that is being created by the peers? how?
Thanks
I am seeing the following kubernetes integration tests fail pretty consistently, about 90% of the time on RHEL 7.2, Fedora 24, and CentOS7.1:
test/integration/garbagecollector
test/integration/replicationcontroller
They seem to be due to an etcd failure. My online queries lead me to believe this may also encompass an apiserver issue. My setup is simple, I install/start docker, install go, clone the kubernetes repo from github, use hack/install-etcd.sh from the repo and add it to path, get ginkgo, gomega and go-bindata, then run 'make test-integration'. I don't manually change anything or add any custom files/configs. Has anyone run into these issues and know a solution? The only mention of this issue I have seen online has been deemed a flake and has no listed solution, but I run into this issue almost every single test run. Pieces of the error are below, I can give more if needed:
Garbage Collector:
\*many lines from garbagecollector.go that look good*
I0920 14:42:39.725768 11823 garbagecollector.go:479] create storage for resource { v1 secrets}
I0920 14:42:39.725786 11823 garbagecollector.go:479] create storage for resource { v1 serviceaccounts}
I0920 14:42:39.725803 11823 garbagecollector.go:479] create storage for resource { v1 services}
I0920 14:43:09.565529 11823 trace.go:61] Trace "List *rbac.ClusterRoleList" (started 2016-09-20 14:42:39.565113203 -0400 EDT):
[2.564µs] [2.564µs] About to list etcd node
[30.000353492s] [30.000350928s] Etcd node listed
[30.000361771s] [8.279µs] END
E0920 14:43:09.566770 11823 cacher.go:258] unexpected ListAndWatch error: pkg/storage/cacher.go:198: Failed to list *rbac.RoleBinding: client: etcd cluster is unavailable or misconfigured
\*repeats over and over with different thing failed to list*
Replication Controller:
I0920 14:35:16.907283 10482 replication_controller.go:481] replication controller worker shutting down
I0920 14:35:16.907293 10482 replication_controller.go:481] replication controller worker shutting down
I0920 14:35:16.907298 10482 replication_controller.go:481] replication controller worker shutting down
I0920 14:35:16.907303 10482 replication_controller.go:481] replication controller worker shutting down
I0920 14:35:16.907307 10482 replication_controller.go:481] replication controller worker shutting down
E0920 14:35:16.948417 10482 util.go:45] Metric for replication_controller already registered
--- FAIL: TestUpdateLabelToBeAdopted (30.07s)
replicationcontroller_test.go:270: Failed to create replication controller rc: Timeout: request did not complete within allowed duration
E0920 14:44:06.820506 12053 storage_rbac.go:116] unable to initialize clusterroles: client: etcd cluster is unavailable or misconfigured
There are no files in /var/log that even start with kube.
Thanks in advance!
I increased the limits on the number of file descriptors and haven't seen this issue since. So, gonna go ahead and call this solved