We are running a reboot (sudo systemctl reboot) of a worker node with Kubernetes version v1.22.4.
We observe that pods on the rebooted node are presented with status TERMINATED.
kubectl get pod loggingx-7c879bf5c8-nvxls -n footprint”
NAME READY STATUS RESTARTS AGE
loggingx-7c879bf5c8-nvxls 3/3 Terminated 10 (44m ago) 29
Question:
All containers are up 3/3. Why is the status still Terminated ?
kubectl describe pod loggingx-7c879bf5c8-nvxls -n footprint”
Name: loggingx-7c879bf5c8-nvxls
Namespace: footprint
Priority: 0
Node: node-10-63-134-154/10.63.134.154
Start Time: Mon, 08 Aug 2022 07:07:15 +0000
.
.
.
Status: Running
Reason: Terminated
Message: Pod was terminated in response to imminent node shutdown.
Question: The status presented from kubectl get pod .. and kubectl describe pod is different. Why ?
We used the Lens tool and could confirm that the pods are actually running after the reboot!
This behavior applies to both deployments and statefulsets.
We ran the same test in a cluster with kubernetes v1.20.4:
After the reboot is completed, the node becomes ready again and pods are recreated in new/or same worker node.
It looks to us as that the new Beta feature "Non graceful node shutdown" introduced with v.1.21 has a strange impact on node reboot use case.
Have you had any similar experiences?
BR,
Thomas
Am trying to install opensource version of openshift origin i.e. OKD v3.11 without internet using ansible.
During the complete installation process my internet is disabled on the environment. After the successful installation, I observe that the two pods in kube-service-catalog namespace namely apiserver and controller-manager aren't running. After investigating the playbooks, I discover the playbooks generate API Server keys.
Does the generation of API server keys expect a active internet connection? Is there any internet dependency for the apiserver and controller-manager pods to be in running state?
I tried:-
Enabling the internet and redeploying the pods of the kube-service-catalog namespace.
They were in the running state without any restart as expected.
Expected behaviour:-
The two pods in the kube-service-catalog namespace should be stable and be in the Running state with internet disabled.
Actual behaviour:-
The two pods in the kube-service-catalog namespace are in CrashLoopBackOff state.
Version:-
OKD- 3.11, ansible- 2.9
Logs of apiserver pod:-
I0512 04:53:30.258151 1 feature_gate.go:194] feature gates: map[OriginatingIdentity:true NamespacedServiceBroker:true]
I0512 04:53:30.258177 1 hyperkube.go:192] Service Catalog version v3.11.0-0.1.35+8d4f895-2;Upstream:v0.1.35 (built 2019-01-08T23:12:26Z)
W0512 04:53:31.020172 1 util.go:112] OpenAPI spec will not be served
I0512 04:53:31.021577 1 util.go:182] Admission control plugin names: [NamespaceLifecycle MutatingAdmissionWebhook ValidatingAdmissionWebhook ServicePlanChangeValidator BrokerAuthSarCheck DefaultServicePlan ServiceBindingsLifecycle]
I0512 04:53:31.021949 1 plugins.go:158] Loaded 6 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,MutatingAdmissionWebhook,ServicePlanChangeValidator,BrokerAuthSarCheck,DefaultServicePlan,ServiceBindingsLifecycle.
I0512 04:53:31.021971 1 plugins.go:161] Loaded 1 validating admission controller(s) successfully in the following order: ValidatingAdmissionWebhook.
I0512 04:53:31.023932 1 storage_factory.go:285] storing {servicecatalog.k8s.io clusterservicebrokers} in servicecatalog.k8s.io/v1beta1, reading as servicecatalog.k8s.io/__internal from storagebackend.Config{Type:"", Prefix:"/registry", ServerList:[]string{"https://cic-90-master.novalocal:2379"}, KeyFile:"/etc/origin/master/master.etcd-client.key", CertFile:"/etc/origin/master/master.etcd-client.crt", CAFile:"/etc/origin/master/master.etcd-ca.crt", Quorum:true, Paging:true, DeserializationCacheSize:0, Codec:runtime.Codec(nil), Transformer:value.Transformer(nil), CompactionInterval:300000000000, CountMetricPollPeriod:60000000000}
I0512 04:53:31.023978 1 storage_factory.go:285] storing {servicecatalog.k8s.io clusterserviceclasses} in servicecatalog.k8s.io/v1beta1, reading as servicecatalog.k8s.io/__internal from storagebackend.Config{Type:"", Prefix:"/registry", ServerList:[]string{"https://cic-90-master.novalocal:2379"}, KeyFile:"/etc/origin/master/master.etcd-client.key", CertFile:"/etc/origin/master/master.etcd-client.crt", CAFile:"/etc/origin/master/master.etcd-ca.crt", Quorum:true, Paging:true, DeserializationCacheSize:0, Codec:runtime.Codec(nil), Transformer:value.Transformer(nil), CompactionInterval:300000000000, CountMetricPollPeriod:60000000000}
I0512 04:53:31.023998 1 storage_factory.go:285] storing {servicecatalog.k8s.io clusterserviceplans} in servicecatalog.k8s.io/v1beta1, reading as servicecatalog.k8s.io/__internal from storagebackend.Config{Type:"", Prefix:"/registry", ServerList:[]string{"https://cic-90-master.novalocal:2379"}, KeyFile:"/etc/origin/master/master.etcd-client.key", CertFile:"/etc/origin/master/master.etcd-client.crt", CAFile:"/etc/origin/master/master.etcd-ca.crt", Quorum:true, Paging:true, DeserializationCacheSize:0, Codec:runtime.Codec(nil), Transformer:value.Transformer(nil), CompactionInterval:300000000000, CountMetricPollPeriod:60000000000}
I0512 04:53:31.024031 1 storage_factory.go:285] storing {servicecatalog.k8s.io serviceinstances} in servicecatalog.k8s.io/v1beta1, reading as servicecatalog.k8s.io/__internal from storagebackend.Config{Type:"", Prefix:"/registry", ServerList:[]string{"https://cic-90-master.novalocal:2379"}, KeyFile:"/etc/origin/master/master.etcd-client.key", CertFile:"/etc/origin/master/master.etcd-client.crt", CAFile:"/etc/origin/master/master.etcd-ca.crt", Quorum:true, Paging:true, DeserializationCacheSize:0, Codec:runtime.Codec(nil), Transformer:value.Transformer(nil), CompactionInterval:300000000000, CountMetricPollPeriod:60000000000}
I0512 04:53:31.024055 1 storage_factory.go:285] storing {servicecatalog.k8s.io servicebindings} in servicecatalog.k8s.io/v1beta1, reading as servicecatalog.k8s.io/__internal from storagebackend.Config{Type:"", Prefix:"/registry", ServerList:[]string{"https://cic-90-master.novalocal:2379"}, KeyFile:"/etc/origin/master/master.etcd-client.key", CertFile:"/etc/origin/master/master.etcd-client.crt", CAFile:"/etc/origin/master/master.etcd-ca.crt", Quorum:true, Paging:true, DeserializationCacheSize:0, Codec:runtime.Codec(nil), Transformer:value.Transformer(nil), CompactionInterval:300000000000, CountMetricPollPeriod:60000000000}
F0512 04:53:51.025999 1 storage_decorator.go:57] Unable to create storage backend: config (&{ /registry [https://cic-90-master.novalocal:2379] /etc/origin/master/master.etcd-client.key /etc/origin/master/master.etcd-client.crt /etc/origin/master/master.etcd-ca.crt true true 0 {0xc420345080 0xc420345100} <nil> 5m0s 1m0s}), err (context deadline exceeded)
Logs of controller-manager pods:-
I0512 05:05:01.273888 1 feature_gate.go:194] feature gates: map[OriginatingIdentity:true]
I0512 05:05:01.274109 1 feature_gate.go:194] feature gates: map[OriginatingIdentity:true AsyncBindingOperations:true]
I0512 05:05:01.274128 1 feature_gate.go:194] feature gates: map[NamespacedServiceBroker:true OriginatingIdentity:true AsyncBindingOperations:true]
I0512 05:05:01.274155 1 hyperkube.go:192] Service Catalog version v3.11.0-0.1.35+8d4f895-2;Upstream:v0.1.35 (built 2019-01-08T23:12:26Z)
I0512 05:05:01.276689 1 leaderelection.go:185] attempting to acquire leader lease kube-service-catalog/service-catalog-controller-manager...
I0512 05:05:01.303464 1 leaderelection.go:194] successfully acquired lease kube-service-catalog/service-catalog-controller-manager
I0512 05:05:01.303609 1 event.go:221] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"kube-service-catalog", Name:"service-catalog-controller-manager", UID:"724069a9-9362-11ea-b5c1-fa163e86d97a", APIVersion:"v1", ResourceVersion:"126373", FieldPath:""}): type: 'Normal' reason: 'LeaderElection' controller-manager-jvx4f-external-service-catalog-controller became leader
F0512 05:05:01.332950 1 controller_manager.go:237] error running controllers: failed to get api versions from server: failed to get supported resources from server: unable to retrieve the complete list of server APIs: servicecatalog.k8s.io/v1beta1: the server is currently unable to handle the request
Output of kubectl get events:-
LAST SEEN FIRST SEEN COUNT NAME KIND SUBOBJECT TYPE REASON SOURCE MESSAGE
2h 2h 1 service-catalog-controller-manager.160e29595b5f2ac8 ConfigMap Normal LeaderElection service-catalog-controller-manager controller-manager-jvx4f-external-service-catalog-controller became leader
1h 1h 1 service-catalog-controller-manager.160e29a1c8d44d5f ConfigMap Normal LeaderElection service-catalog-controller-manager controller-manager-jvx4f-external-service-catalog-controller became leader
1h 1h 1 service-catalog-controller-manager.160e29e88bcdabf4 ConfigMap Normal LeaderElection service-catalog-controller-manager controller-manager-jvx4f-external-service-catalog-controller became leader
1h 1h 1 service-catalog-controller-manager.160e2a2ea2d553cf ConfigMap Normal LeaderElection service-catalog-controller-manager controller-manager-jvx4f-external-service-catalog-controller became leader
1h 1h 1 service-catalog-controller-manager.160e2abce844b1a6 ConfigMap Normal LeaderElection service-catalog-controller-manager controller-manager-jvx4f-external-service-catalog-controller became leader
1h 1h 1 service-catalog-controller-manager.160e2bd884a3fd98 ConfigMap Normal LeaderElection service-catalog-controller-manager controller-manager-jvx4f-external-service-catalog-controller became leader
1h 17h 183 apiserver-28mjt.160df6e8ab679328 Pod spec.containers{apiserver} Normal Pulled kubelet, cic-90-master.novalocal Container image "docker.io/openshift/origin-service-catalog:v3.11.0" already present on machine
1h 1h 1 service-catalog-controller-manager.160e2c1f807c24b0 ConfigMap Normal LeaderElection service-catalog-controller-manager controller-manager-jvx4f-external-service-catalog-controller became leader
59m 59m 1 service-catalog-controller-manager.160e2cac5f27eb61 ConfigMap Normal LeaderElection service-catalog-controller-manager controller-manager-jvx4f-external-service-catalog-controller became leader
48m 48m 1 service-catalog-controller-manager.160e2d3d315161ed ConfigMap Normal LeaderElection service-catalog-controller-manager controller-manager-jvx4f-external-service-catalog-controller became leader
43m 43m 1 service-catalog-controller-manager.160e2d84348e29c6 ConfigMap Normal LeaderElection service-catalog-controller-manager controller-manager-jvx4f-external-service-catalog-controller became leader
38m 38m 1 service-catalog-controller-manager.160e2dcbb5d88e66 ConfigMap Normal LeaderElection service-catalog-controller-manager controller-manager-jvx4f-external-service-catalog-controller became leader
33m 33m 1 service-catalog-controller-manager.160e2e13307a6011 ConfigMap Normal LeaderElection service-catalog-controller-manager controller-manager-jvx4f-external-service-catalog-controller became leader
23m 23m 1 service-catalog-controller-manager.160e2ea16c9db85d ConfigMap Normal LeaderElection service-catalog-controller-manager controller-manager-jvx4f-external-service-catalog-controller became leader
8m 8m 1 service-catalog-controller-manager.160e2f75c0f6468a ConfigMap Normal LeaderElection service-catalog-controller-manager controller-manager-jvx4f-external-service-catalog-controller became leader
4m 17h 4491 apiserver-28mjt.160df6f2fa5c8d45 Pod spec.containers{apiserver} Warning BackOff kubelet, cic-90-master.novalocal Back-off restarting failed container
2m 2m 1 service-catalog-controller-manager.160e2fbf5d9a2418 ConfigMap Normal LeaderElection service-catalog-controller-manager controller-manager-jvx4f-external-service-catalog-controller became leader
2m 20h 5739 controller-manager-jvx4f.160dec6599cd8b00 Pod spec.containers{controller-manager} Warning BackOff kubelet, cic-90-master.novalocal Back-off restarting failed container
Here are the logs from the autoscaler:
0922 17:08:33.857348 1 auto_scaling_groups.go:102] Updating ASG terraform-eks-demo20190922161659090500000007--terraform-eks-demo20190922161700651000000008
I0922 17:08:33.857380 1 aws_manager.go:152] Refreshed ASG list, next refresh after 2019-09-22 17:08:43.857375311 +0000 UTC m=+259.289807511
I0922 17:08:33.857465 1 utils.go:526] No pod using affinity / antiaffinity found in cluster, disabling affinity predicate for this loop
I0922 17:08:33.857482 1 static_autoscaler.go:261] Filtering out schedulables
I0922 17:08:33.857532 1 static_autoscaler.go:271] No schedulable pods
I0922 17:08:33.857545 1 static_autoscaler.go:279] No unschedulable pods
I0922 17:08:33.857557 1 static_autoscaler.go:333] Calculating unneeded nodes
I0922 17:08:33.857601 1 scale_down.go:376] Scale-down calculation: ignoring 2 nodes unremovable in the last 5m0s
I0922 17:08:33.857621 1 scale_down.go:407] Node ip-10-0-1-135.us-west-2.compute.internal - utilization 0.055000
I0922 17:08:33.857688 1 static_autoscaler.go:349] ip-10-0-1-135.us-west-2.compute.internal is unneeded since 2019-09-22 17:05:07.299351571 +0000 UTC m=+42.731783882 duration 3m26.405144434s
I0922 17:08:33.857703 1 static_autoscaler.go:360] Scale down status: unneededOnly=true lastScaleUpTime=2019-09-22 17:04:42.29864432 +0000 UTC m=+17.731076395 lastScaleDownDeleteTime=2019-09-22 17:04:42.298645611 +0000 UTC m=+17.731077680 lastScaleDownFailTime=2019-09-22 17:04:42.298646962 +0000 UTC m=+17.731079033 scaleDownForbidden=false isDeleteInProgress=false
I0922 17:08:33.857688 1 static_autoscaler.go:349] ip-10-0-1-135.us-west-2.compute.internal is unneeded since 2019-09-22 17:05:07.299351571 +0000 UTC m=+42.731783882 duration 3m26.405144434s
If it's unneeded, then what is the next step? What is it waiting for?
I've drained one node:
kubectl get nodes -o=wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
ip-10-0-0-118.us-west-2.compute.internal Ready <none> 46m v1.13.10-eks-d6460e 10.0.0.118 52.40.115.132 Amazon Linux 2 4.14.138-114.102.amzn2.x86_64 docker://18.6.1
ip-10-0-0-211.us-west-2.compute.internal Ready <none> 44m v1.13.10-eks-d6460e 10.0.0.211 35.166.57.203 Amazon Linux 2 4.14.138-114.102.amzn2.x86_64 docker://18.6.1
ip-10-0-1-135.us-west-2.compute.internal Ready,SchedulingDisabled <none> 46m v1.13.10-eks-d6460e 10.0.1.135 18.237.253.134 Amazon Linux 2 4.14.138-114.102.amzn2.x86_64 docker://18.6.1
Why is it not terminating the instance?
These are the parameters I'm using:
- ./cluster-autoscaler
- --cloud-provider=aws
- --namespace=default
- --scan-interval=25s
- --scale-down-unneeded-time=30s
- --nodes=1:20:terraform-eks-demo20190922161659090500000007--terraform-eks-demo20190922161700651000000008
- --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/example-job-runner
- --logtostderr=true
- --stderrthreshold=info
- --v=4
Have you got any of the following?
Pods running on that node without a controller object (i.e. deployment / replica-set?
Any kube-system pods that don't have a pod disruption budget
Pods with local storage or any custom affinity/anti-affinity/nodeSelectors
An annotation set on that node that prevents cluster-autoscaler from scaling it down
Your config/start-up options for CA look good to me though.
I can only imagine it might be something to with a specific pod running on that node. Maybe run through the kube-system pods running on the nodes listed that are not scaling down and check the above list.
These two page sections have some good items to check on that might be causing CA to not scale down nodes.
low utilization nodes but not scaling down, why?
what types of pods can prevent CA from removing a node?
Here's what I did to solve this issue:
Tail logs for cluster-autoscaler (I used kubetail since cluster-autoscaler had multiple replicas)
From the AWS console, I found the autoscaling group related to my cluster
Reduced the number of desired nodes of the autoscaling group from the AWS console
Waited until the cluster-autoscaler scaled the cluster down
Waited again until the cluster-autoscaler scaled the cluster up
Found the reason for scaling up in the logs and handled it accordingly
Being new to the K8s, I am trying to clean up the whole namespace after running some tests on a Windows 10 machine. In short, I thought it would be as easy as running kubectl.exe delete deployment but the deployments are created back after a second and I don't know how to get rid of them. See the followings for the details of what I did:
1.kubectl get deployments,rs (to see what we already have)
NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
deployment.extensions/postgresql 1 1 1 1 18m
deployment.extensions/redis 1 1 1 1 16m
NAME DESIRED CURRENT READY AGE
replicaset.extensions/postgresql-c8cb9fff6 1 1 1 18m
replicaset.extensions/redis-5678477b7c 1 1 1 16m
2. kubectl scale deployment redis --replicas=0 (Scale down the deployment)
deployment.extensions "redis" scaled
3. kubectl get deployments,rs (Check again how it looks)
NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
deployment.extensions/postgresql 1 1 1 1 21m
deployment.extensions/redis 0 0 0 0 19m
NAME DESIRED CURRENT READY AGE
replicaset.extensions/postgresql-c8cb9fff6 1 1 1 21m
replicaset.extensions/redis-5678477b7c 0 0 0 19m
4. kubectl delete deployment.extensions/redis (Delete the deployment)
deployment.extensions "redis" deleted
5. kubectl get deployments,rs (Check again and see that it is back!)
NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
deployment.extensions/postgresql 1 1 1 1 23m
deployment.extensions/redis 1 1 1 1 27s
NAME DESIRED CURRENT READY AGE
replicaset.extensions/postgresql-c8cb9fff6 1 1 1 23m
replicaset.extensions/redis-5678477b7c 1 1 1 27s
6. kubectl.exe get events (Looking into the events):
Among other things I can see "Scaled down replica set redis-5678477b7c to 0" and then "Scaled up replica set redis-5678477b7c to 1" which looks like it was never actually deleted but immediately scaled up again after the delete command was executed.
Not sure what I am missing but have already checked a couple of other posts like Kubernetes pod gets recreated when deleted and How to delete all resources from Kubernetes one time? but neither one worked for me.
Forgot to say that the K8s cluster is managed by the Docker Desktop.
Use kubectl delete deployment <the name of deployment >
If you need to clean up whole namespace , use kubectl delete namespace <namespace-name>
Then re-create the same namespace by kubectl create ns command , if you need the same namespace.
You can also clean up the namespace by using --all options with objects:
e.g
kubecetl delete deployment --all
kubecetl delete statefulset --all
kubectl delete pvc --all
kubectl delete secrets --all
kubectl delete service --all
and so on.
As pointed out by #David Maze, you're deleting the ReplicaSet instead of the Deployment that's managing it.
From the documentation:
You can define Deployments to create new ReplicaSets
The Deployment will automatically create and manage a ReplicaSet to control the pods. You need to delete the Deployment to erase it's managed resources.
I recently upgraded my GKE cluster from 1.10.x to 1.11.x and since then my calico-node pods fail to connect to the etcd cluster and end up in a CrashLoopBackOff due to livenessProbe error.
I saw that the calico-etcd DaemonSet has desired state 0 and was wondering about that. nodeSelector is at node-role.kubernetes.io/master=.
From the logs of such calico-nodes:
2018-12-19 19:18:28.989 [INFO][7] etcd.go 373: Unhandled error: client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint http://10.96.232.136:6666 exceeded header timeout
2018-12-19 19:18:28.989 [INFO][7] startup.go 254: Unable to query node configuration Name="gke-brokerme-ubuntu-pool-852d0318-j5ft" error=client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint http://10.96.232.136:6666 exceeded header timeout
State of the DaemonSets:
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
calico-etcd 0 0 0 0 0 node-role.kubernetes.io/master= 3d
calico-node 2 2 0 2 0 <none> 3d
k get nodes --show-labels:
NAME STATUS ROLES AGE VERSION LABELS
gke-brokerme-ubuntu-pool-852d0318-7v4m Ready <none> 4d v1.11.5-gke.5 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/fluentd-ds-ready=true,beta.kubernetes.io/instance-type=n1-standard-2,beta.kubernetes.io/os=linux,cloud.google.com/gke-nodepool=ubuntu-pool,cloud.google.com/gke-os-distribution=ubuntu,failure-domain.beta.kubernetes.io/region=europe-west1,failure-domain.beta.kubernetes.io/zone=europe-west1-b,kubernetes.io/hostname=gke-brokerme-ubuntu-pool-852d0318-7v4m,os=ubuntu
gke-brokerme-ubuntu-pool-852d0318-j5ft Ready <none> 1h v1.11.5-gke.5 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/fluentd-ds-ready=true,beta.kubernetes.io/instance-type=n1-standard-2,beta.kubernetes.io/os=linux,cloud.google.com/gke-nodepool=ubuntu-pool,cloud.google.com/gke-os-distribution=ubuntu,failure-domain.beta.kubernetes.io/region=europe-west1,failure-domain.beta.kubernetes.io/zone=europe-west1-b,kubernetes.io/hostname=gke-brokerme-ubuntu-pool-852d0318-j5ft,os=ubuntu
I did not modify any calico manifests, they should be 1:1 provisioned by GKE.
I would expect either the calico-nodes connect to the etc of my Kubernetes cluster, or to a calico-etcd provisioned by the DaemonSet. As there is no master node that I can control in GKE, I kind of get why calico-etcd is at state 0, but then, to which etc are the calico-nodes supposed to connect? What's wrong with my small and basic setup?
We are aware of the issue of calico crash looping in GKE 1.11.x. You can fix this issue, by upgrading to newer versions. , I would recommend you to upgrade to the version '1.11.4-gke.12' or '1.11.3-gke.23' which does not have this issue.