Sudden pod restart of kubernetes deployment, reason? - kubernetes

I'v got microservices deployed on GKE, with Helm v3; all apps/helms stood nicely for months, but yesterday for some reason pods were re-created
kubectl get pods -l app=myapp
NAME READY STATUS RESTARTS AGE
myapp-75cb966746-grjkj 1/1 Running 1 14h
myapp-75cb966746-gz7g7 1/1 Running 0 14h
myapp-75cb966746-nmzzx 1/1 Running 1 14h
the helm3 history myapp shows it was updated 2days ago (40+hrs), not yesterday (so I exclude possibility someone simply run helm3 upgrade ..; (seems like someone ran a command kubectl rollout restart deployment/myapp), any thoughts how can I check why the pods were restarted? I'm not sure how to verify it; PS: the logs from kubectl logs deployment/myapp go back only to 3 hours ago
just for reference, I'm not asking for this command kubectl logs -p myapp-75cb966746-grjkj, with that there is no problem, I want to know what happened to the 3 pods that were there 14 hrs ago, and were simply deleted/replaced - and how to check that.
also no events on the cluster
MacBook-Pro% kubectl get events
No resources found in myns namespace.
as for describing the deployment all there is, is that first the deployment was created few months ago
CreationTimestamp: Thu, 22 Oct 2020 09:19:39 +0200
and that last update was >40hrs ago
lastUpdate: 2021-04-07 07:10:09.715630534 +0200 CEST m=+1.867748121
here is full describe if someone wants
MacBook-Pro% kubectl describe deployment myapp
Name: myapp
Namespace: myns
CreationTimestamp: Thu, 22 Oct 2020 09:19:39 +0200
Labels: app=myapp
Annotations: deployment.kubernetes.io/revision: 42
lastUpdate: 2021-04-07 07:10:09.715630534 +0200 CEST m=+1.867748121
meta.helm.sh/release-name: myapp
meta.helm.sh/release-namespace: myns
Selector: app=myapp,env=myns
Replicas: 3 desired | 3 updated | 3 total | 3 available | 0 unavailable
StrategyType: RollingUpdate
MinReadySeconds: 5
RollingUpdateStrategy: 25% max unavailable, 1 max surge
Pod Template:
Labels: app=myapp
Annotations: kubectl.kubernetes.io/restartedAt: 2020-10-23T11:21:11+02:00
Containers:
myapp:
Image: xxx
Port: 8080/TCP
Host Port: 0/TCP
Limits:
cpu: 1
memory: 1G
Requests:
cpu: 1
memory: 1G
Liveness: http-get http://:myappport/status delay=45s timeout=5s period=10s #success=1 #failure=3
Readiness: http-get http://:myappport/status delay=45s timeout=5s period=10s #success=1 #failure=3
Environment Variables from:
myapp-myns Secret Optional: false
Environment:
myenv: myval
Mounts:
/some/path from myvol (ro)
Volumes:
myvol:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: myvol
Optional: false
Conditions:
Type Status Reason
---- ------ ------
Progressing True NewReplicaSetAvailable
Available True MinimumReplicasAvailable
OldReplicaSets: <none>
NewReplicaSet: myapp-75cb966746 (3/3 replicas created)
Events: <none>

First thing first, I would check nodes on which the Pods were running.
If a Pod is restarted (which means that the RESTART COUNT is incremented) it usually means that the Pod had an error and that error caused the Pod to crash.
In your case tho, the Pod were completely recreated, this means (like you said) that someone could have use a rollout restart, or the deployment was scaled down and then up (both manual operations).
The most common case for Pods to be created automatically, is that the node / nodes were the Pods were executing on had a problem. If a node becomes NotReady, even for a small amount of time, Kubernetes Scheduler will try to schedule new Pods on other nodes in order to match the desired state (number of replicas and so on)
Old Pods on a NotReady node will go into Terminating state and will be forced to terminate as soon as the NotReady node will become Ready again (if they are still up and running)
This is described in details in the documentation ( https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-lifetime )
If a Node dies, the Pods scheduled to that node are scheduled for deletion after a timeout period. Pods do not, by themselves, self-heal. If a Pod is scheduled to a node that then fails, the Pod is deleted; likewise, a Pod won't survive an eviction due to a lack of resources or Node maintenance. Kubernetes uses a higher-level abstraction, called a controller, that handles the work of managing the relatively disposable Pod instances.

I suggest you run kubectl describe deployment <deployment-name> and kubectl describe pod <pod-name>.
In addition, kubectl get events will show cluster-level events and may help you understand what happened.

You can use
kubectl describe pod your_pod_name
where in Containers.your_container_name.lastState you get time and reason why your last pod was terminated (for example, due to error or due to being OOMKilled)
doc reference:
kubectl explain pod.status.containerStatuses.lastState
KIND: Pod
VERSION: v1
RESOURCE: lastState <Object>
DESCRIPTION:
Details about the container's last termination condition.
ContainerState holds a possible state of container. Only one of its members
may be specified. If none of them is specified, the default one is
ContainerStateWaiting.
FIELDS:
running <Object>
Details about a running container
terminated <Object>
Details about a terminated container
waiting <Object>
Details about a waiting container
Example on one of my containers, which terminated due to error in application:
Containers:
my_container:
Last State: Terminated
Reason: Error
Exit Code: 137
Started: Tue, 06 Apr 2021 16:28:57 +0300
Finished: Tue, 06 Apr 2021 16:32:07 +0300
To get previous logs of your container (the restarted one), you may use --previous key on pod, like this:
kubectl logs your_pod_name --previous

Related

Metrics server is currently unable to handle the request

I am new to kubernetes and was trying to apply horizontal pod autoscaling to my existing application. and after following other stackoverflow details - got to know that I need to install metric-server - and I was able to - but some how it's not working and unable to handle request.
Further I followed few more things but unable to resolve the issue - I will really appreciate any help here.
Please let me know for any further details you need for helping me :) Thanks in advance.
Steps followed:
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
serviceaccount/metrics-server created
clusterrole.rbac.authorization.k8s.io/system:aggregated-metrics-reader created
clusterrole.rbac.authorization.k8s.io/system:metrics-server created
rolebinding.rbac.authorization.k8s.io/metrics-server-auth-reader created
clusterrolebinding.rbac.authorization.k8s.io/metrics-server:system:auth-delegator created
clusterrolebinding.rbac.authorization.k8s.io/system:metrics-server created
service/metrics-server created
deployment.apps/metrics-server created
apiservice.apiregistration.k8s.io/v1beta1.metrics.k8s.io created
kubectl get deploy,svc -n kube-system | egrep metrics-server
deployment.apps/metrics-server 1/1 1 1 2m6s
service/metrics-server ClusterIP 10.32.0.32 <none> 443/TCP 2m6s
kubectl get pods -n kube-system | grep metrics-server
metrics-server-64cf6869bd-6gx88 1/1 Running 0 2m39s
vi ana_hpa.yaml
apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
name: ana-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: StatefulSet
name: common-services-auth
minReplicas: 1
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 80
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 160
k apply -f ana_hpa.yaml
horizontalpodautoscaler.autoscaling/ana-hpa created
k get hpa
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
ana-hpa StatefulSet/common-services-auth <unknown>/160%, <unknown>/80% 1 10 0 4s
k describe hpa ana-hpa
Name: ana-hpa
Namespace: default
Labels: <none>
Annotations: <none>
CreationTimestamp: Tue, 12 Apr 2022 17:01:25 +0530
Reference: StatefulSet/common-services-auth
Metrics: ( current / target )
resource memory on pods (as a percentage of request): <unknown> / 160%
resource cpu on pods (as a percentage of request): <unknown> / 80%
Min replicas: 1
Max replicas: 10
StatefulSet pods: 3 current / 0 desired
Conditions:
Type Status Reason Message
---- ------ ------ -------
AbleToScale True SucceededGetScale the HPA controller was able to get the target's current scale
ScalingActive False FailedGetResourceMetric the HPA was unable to compute the replica count: failed to get memory utilization: unable to get metrics for resource memory: unable to fetch metrics from resource metrics API: the server is currently unable to handle the request (get pods.metrics.k8s.io)
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedGetResourceMetric 38s (x8 over 2m23s) horizontal-pod-autoscaler failed to get cpu utilization: unable to get metrics for resource cpu: unable to fetch metrics from resource metrics API: the server is currently unable to handle the request (get pods.metrics.k8s.io)
Warning FailedComputeMetricsReplicas 38s (x8 over 2m23s) horizontal-pod-autoscaler invalid metrics (2 invalid out of 2), first error is: failed to get memory utilization: unable to get metrics for resource memory: unable to fetch metrics from resource metrics API: the server is currently unable to handle the request (get pods.metrics.k8s.io)
Warning FailedGetResourceMetric 23s (x9 over 2m23s) horizontal-pod-autoscaler failed to get memory utilization: unable to get metrics for resource memory: unable to fetch metrics from resource metrics API: the server is currently unable to handle the request (get pods.metrics.k8s.io)
kubectl get --raw /apis/metrics.k8s.io/v1beta1
Error from server (ServiceUnavailable): the server is currently unable to handle the request
kubectl get --raw "/apis/metrics.k8s.io/v1beta1/nodes"
Error from server (ServiceUnavailable): the server is currently unable to handle the request
kubectl edit deployments.apps -n kube-system metrics-server
Add hostNetwork: true
deployment.apps/metrics-server edited
kubectl get pods -n kube-system | grep metrics-server
metrics-server-5dc6dbdb8-42hw9 1/1 Running 0 10m
k describe pod metrics-server-5dc6dbdb8-42hw9 -n kube-system
Name: metrics-server-5dc6dbdb8-42hw9
Namespace: kube-system
Priority: 2000000000
Priority Class Name: system-cluster-critical
Node: pusntyn196.apac.avaya.com/10.133.85.196
Start Time: Tue, 12 Apr 2022 17:08:25 +0530
Labels: k8s-app=metrics-server
pod-template-hash=5dc6dbdb8
Annotations: <none>
Status: Running
IP: 10.133.85.196
IPs:
IP: 10.133.85.196
Controlled By: ReplicaSet/metrics-server-5dc6dbdb8
Containers:
metrics-server:
Container ID: containerd://024afb1998dce4c0bd5f4e58f996068ea37982bd501b54fda2ef8d5c1098b4f4
Image: k8s.gcr.io/metrics-server/metrics-server:v0.6.1
Image ID: k8s.gcr.io/metrics-server/metrics-server#sha256:5ddc6458eb95f5c70bd13fdab90cbd7d6ad1066e5b528ad1dcb28b76c5fb2f00
Port: 4443/TCP
Host Port: 4443/TCP
Args:
--cert-dir=/tmp
--secure-port=4443
--kubelet-preferred-address-types=InternalIP,ExternalIP,Hostname
--kubelet-use-node-status-port
--metric-resolution=15s
State: Running
Started: Tue, 12 Apr 2022 17:08:26 +0530
Ready: True
Restart Count: 0
Requests:
cpu: 100m
memory: 200Mi
Liveness: http-get https://:https/livez delay=0s timeout=1s period=10s #success=1 #failure=3
Readiness: http-get https://:https/readyz delay=20s timeout=1s period=10s #success=1 #failure=3
Environment: <none>
Mounts:
/tmp from tmp-dir (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-g6p4g (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
tmp-dir:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
kube-api-access-g6p4g:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Burstable
Node-Selectors: kubernetes.io/os=linux
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 2s
node.kubernetes.io/unreachable:NoExecute op=Exists for 2s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 2m31s default-scheduler Successfully assigned kube-system/metrics-server-5dc6dbdb8-42hw9 to pusntyn196.apac.avaya.com
Normal Pulled 2m32s kubelet Container image "k8s.gcr.io/metrics-server/metrics-server:v0.6.1" already present on machine
Normal Created 2m31s kubelet Created container metrics-server
Normal Started 2m31s kubelet Started container metrics-server
kubectl get --raw /apis/metrics.k8s.io/v1beta1
Error from server (ServiceUnavailable): the server is currently unable to handle the request
kubectl get pods -n kube-system | grep metrics-server
metrics-server-5dc6dbdb8-42hw9 1/1 Running 0 10m
kubectl logs -f metrics-server-5dc6dbdb8-42hw9 -n kube-system
E0412 11:43:54.684784 1 configmap_cafile_content.go:242] kube-system/extension-apiserver-authentication failed with : missing content for CA bundle "client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file"
E0412 11:44:27.001010 1 configmap_cafile_content.go:242] key failed with : missing content for CA bundle "client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file"
k logs -f metrics-server-5dc6dbdb8-42hw9 -n kube-system
I0412 11:38:26.447305 1 serving.go:342] Generated self-signed cert (/tmp/apiserver.crt, /tmp/apiserver.key)
I0412 11:38:26.899459 1 requestheader_controller.go:169] Starting RequestHeaderAuthRequestController
I0412 11:38:26.899477 1 shared_informer.go:240] Waiting for caches to sync for RequestHeaderAuthRequestController
I0412 11:38:26.899518 1 configmap_cafile_content.go:201] "Starting controller" name="client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file"
I0412 11:38:26.899545 1 shared_informer.go:240] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0412 11:38:26.899546 1 configmap_cafile_content.go:201] "Starting controller" name="client-ca::kube-system::extension-apiserver-authentication::client-ca-file"
I0412 11:38:26.899567 1 shared_informer.go:240] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0412 11:38:26.900480 1 dynamic_serving_content.go:131] "Starting controller" name="serving-cert::/tmp/apiserver.crt::/tmp/apiserver.key"
I0412 11:38:26.900811 1 secure_serving.go:266] Serving securely on [::]:4443
I0412 11:38:26.900854 1 tlsconfig.go:240] "Starting DynamicServingCertificateController"
W0412 11:38:26.900965 1 shared_informer.go:372] The sharedIndexInformer has started, run more than once is not allowed
I0412 11:38:26.999960 1 shared_informer.go:247] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0412 11:38:26.999989 1 shared_informer.go:247] Caches are synced for RequestHeaderAuthRequestController
I0412 11:38:26.999970 1 shared_informer.go:247] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
E0412 11:38:27.000087 1 configmap_cafile_content.go:242] kube-system/extension-apiserver-authentication failed with : missing content for CA bundle "client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file"
E0412 11:38:27.000118 1 configmap_cafile_content.go:242] key failed with : missing content for CA bundle "client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file"
kubectl top nodes
Error from server (ServiceUnavailable): the server is currently unable to handle the request (get nodes.metrics.k8s.io)
kubectl top pods
Error from server (ServiceUnavailable): the server is currently unable to handle the request (get pods.metrics.k8s.io)
Edit metrics server deployment yaml
Add - --kubelet-insecure-tls
k apply -f metric-server-deployment.yaml
serviceaccount/metrics-server unchanged
clusterrole.rbac.authorization.k8s.io/system:aggregated-metrics-reader unchanged
clusterrole.rbac.authorization.k8s.io/system:metrics-server unchanged
rolebinding.rbac.authorization.k8s.io/metrics-server-auth-reader unchanged
clusterrolebinding.rbac.authorization.k8s.io/metrics-server:system:auth-delegator unchanged
clusterrolebinding.rbac.authorization.k8s.io/system:metrics-server unchanged
service/metrics-server unchanged
deployment.apps/metrics-server configured
apiservice.apiregistration.k8s.io/v1beta1.metrics.k8s.io unchanged
kubectl get pods -n kube-system | grep metrics-server
metrics-server-5dc6dbdb8-42hw9 1/1 Running 0 10m
kubectl top pods
Error from server (ServiceUnavailable): the server is currently unable to handle the request (get pods.metrics.k8s.io)
Also tried by adding below to metrics server deployment
command:
- /metrics-server
- --kubelet-insecure-tls
- --kubelet-preferred-address-types=InternalIP
This can easily be resolved by editing the deployment yaml files and adding the hostNetwork: true after the dnsPolicy: ClusterFirst
kubectl edit deployments.apps -n kube-system metrics-server
insert:
hostNetwork: true
I hope this help somebody for bare metal cluster:
$ helm --repo https://kubernetes-sigs.github.io/metrics-server/ --kubeconfig=$HOME/.kube/loc-cluster.config -n kube-system --set args='{--kubelet-insecure-tls}' upgrade --install metrics-server metrics-server
$ helm --kubeconfig=$HOME/.kube/loc-cluster.config -n kube-system uninstall metrics-server
Update: I deployed the metrics-server using the same command. Perhaps you can start fresh by removing existing resources and running:
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
=======================================================================
It appears the --kubelet-insecure-tls flag was not configured correctly for the pod template in the deployment. The following should fix this:
Edit the existing deployment in the cluster with kubectl edit deployment/metrics-server -nkube-system.
Add the flag to the spec.containers[].args list, so that the deployment looks like this:
...
spec:
containers:
- args:
- --cert-dir=/tmp
- --secure-port=4443
- --kubelet-preferred-address-types=InternalIP,ExternalIP,Hostname
- --kubelet-use-node-status-port
- --metric-resolution=15s
- --kubelet-insecure-tls <=======ADD IT HERE.
image: k8s.gcr.io/metrics-server/metrics-server:v0.6.1
...
Simply save your changes and let the deployment rollout the updated pods. You can use watch -n1 kubectl get deployment/kube-metrics -nkube-system and wait for UP-TO-DATE column to show 1.
Like this:
NAME READY UP-TO-DATE AVAILABLE AGE
metrics-server 1/1 1 1 16m
Verify with kubectl top nodes. It will show something like
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
docker-desktop 222m 5% 1600Mi 41%
I've just verified this to work on a local setup. Let me know if this helps :)
Please configuration aggregation layer correctly and carefully, you can use this link for help : https://kubernetes.io/docs/tasks/extend-kubernetes/configure-aggregation-layer/.
apiVersion: apiregistration.k8s.io/v1
kind: APIService
metadata:
name: <name of the registration object>
spec:
group: <API group name this extension apiserver hosts>
version: <API version this extension apiserver hosts>
groupPriorityMinimum: <priority this APIService for this group, see API documentation>
versionPriority: <prioritizes ordering of this version within a group, see API documentation>
service:
namespace: <namespace of the extension apiserver service>
name: <name of the extension apiserver service>
caBundle: <pem encoded ca cert that signs the server cert used by the webhook>
It would be helpful to provide kubectl version return value.
For me on EKS with helmfile I had to write in the values.yaml using the metrics-server chart :
containerPort: 10250
The value was enforced by default to 4443 for an unknown reason when I first deployed the chart.
See doc:
https://github.com/kubernetes-sigs/metrics-server/blob/master/charts/metrics-server/values.yaml#L62
https://aws.amazon.com/premiumsupport/knowledge-center/eks-metrics-server/#:~:text=confirm%20that%20your%20security%20groups
Then kubectl top nodes and kubectl describe apiservice v1beta1.metrics.k8s.io were working.
First of all, execute the following command:
kubectl get apiservices
And checkout the availablity (status) of kube-system/metrics-server service.
In case the availability is True:
Add hostNetwork: true to the spec of your metrics-server deployment by executing the following command:
kubectl edit deployment -n kube-system metrics-server
It should look like the following:
...
spec:
hostNetwork: true
...
Setting hostNetwork to true means that Pod will have access to
the host where it's running.
In case the availability is False (MissingEndpoints):
Download metrics-server:
wget https://github.com/kubernetes-sigs/metrics-server/releases/download/v0.5.0/components.yaml
Remove (legacy) metrics server:
kubectl delete -f components.yaml
Edit downloaded file and add - --kubelet-insecure-tls to args list:
...
labels:
k8s-app: metrics-server
spec:
containers:
- args:
- --cert-dir=/tmp
- --secure-port=443
- --kubelet-preferred-address-types=InternalIP,ExternalIP,Hostname
- --kubelet-use-node-status-port
- --metric-resolution=15s
- --kubelet-insecure-tls # add this line
...
Create service once again:
kubectl apply -f components.yaml

Why does my Kubernetes Cronjob pod get killed while executing?

Kubernetes Version
Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.1", GitCommit:"5e58841cce77d4bc13713ad2b91fa0d961e69192", GitTreeState:"clean", BuildDate:"2021-05-13T02:40:46Z", GoVersion:"go1.16.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.7", GitCommit:"e1d093448d0ed9b9b1a48f49833ff1ee64c05ba5", GitTreeState:"clean", BuildDate:"2021-06-03T00:20:57Z", GoVersion:"go1.15.12", Compiler:"gc", Platform:"linux/amd64"}
I have a Kubernetes crobjob that serves the purpose of running some Azure cli commands on a time based schedule.
Running the container locally works fine, however, manually triggering the Cronjob through Lens, or letting it run per the schedule results in weird behaviour (Running in the cloud as a job yeilds unexpected results).
Here is the cronjob definition:
---
apiVersion: batch/v1beta1
kind: CronJob
metadata:
name: development-scale-down
namespace: development
spec:
schedule: "0 22 * * 0-4"
concurrencyPolicy: Allow
startingDeadlineSeconds: 60
failedJobsHistoryLimit: 5
jobTemplate:
spec:
backoffLimit: 0 # Do not retry
activeDeadlineSeconds: 360 # 5 minutes
template:
spec:
containers:
- name: scaler
image: myimage:latest
imagePullPolicy: Always
env: ...
restartPolicy: "Never"
I ran the cronjob manually and it created job development-scale-down-manual-xwp1k. Describing this job after it completed, we can see the following:
$ kubectl describe job development-scale-down-manual-xwp1k
Name: development-scale-down-manual-xwp1k
Namespace: development
Selector: controller-uid=ecf8fb47-cd50-42eb-9a6f-888f7e2c9257
Labels: controller-uid=ecf8fb47-cd50-42eb-9a6f-888f7e2c9257
job-name=development-scale-down-manual-xwp1k
Annotations: <none>
Parallelism: 1
Completions: 1
Start Time: Wed, 04 Aug 2021 09:40:28 +1200
Active Deadline Seconds: 360s
Pods Statuses: 0 Running / 0 Succeeded / 1 Failed
Pod Template:
Labels: controller-uid=ecf8fb47-cd50-42eb-9a6f-888f7e2c9257
job-name=development-scale-down-manual-xwp1k
Containers:
scaler:
Image: myimage:latest
Port: <none>
Host Port: <none>
Environment:
CLUSTER_NAME: ...
NODEPOOL_NAME: ...
NODEPOOL_SIZE: ...
RESOURCE_GROUP: ...
SP_APP_ID: <set to the key 'application_id' in secret 'scaler-secrets'> Optional: false
SP_PASSWORD: <set to the key 'application_pass' in secret 'scaler-secrets'> Optional: false
SP_TENANT: <set to the key 'application_tenant' in secret 'scaler-secrets'> Optional: false
Mounts: <none>
Volumes: <none>
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulCreate 24m job-controller Created pod: development-scale-down-manual-xwp1k-b858c
Normal SuccessfulCreate 23m job-controller Created pod: development-scale-down-manual-xwp1k-xkkw9
Warning BackoffLimitExceeded 23m job-controller Job has reached the specified backoff limit
This differs from other issues I have read, where it does not mention a "SuccessfulDelete" event.
The events received from kubectl get events tell an interesting story
$ ktl get events | grep xwp1k
3m19s Normal Scheduled pod/development-scale-down-manual-xwp1k-b858c Successfully assigned development/development-scale-down-manual-xwp1k-b858c to aks-burst-37275452-vmss00000d
3m18s Normal Pulling pod/development-scale-down-manual-xwp1k-b858c Pulling image "myimage:latest"
2m38s Normal Pulled pod/development-scale-down-manual-xwp1k-b858c Successfully pulled image "myimage:latest" in 40.365655229s
2m23s Normal Created pod/development-scale-down-manual-xwp1k-b858c Created container myimage
2m23s Normal Started pod/development-scale-down-manual-xwp1k-b858c Started container myimage
2m12s Normal Killing pod/development-scale-down-manual-xwp1k-b858c Stopping container myimage
2m12s Normal Scheduled pod/development-scale-down-manual-xwp1k-xkkw9 Successfully assigned development/development-scale-down-manual-xwp1k-xkkw9 to aks-default-37275452-vmss000002
2m12s Normal Pulling pod/development-scale-down-manual-xwp1k-xkkw9 Pulling image "myimage:latest"
2m11s Normal Pulled pod/development-scale-down-manual-xwp1k-xkkw9 Successfully pulled image "myimage:latest" in 751.93652ms
2m10s Normal Created pod/development-scale-down-manual-xwp1k-xkkw9 Created container myimage
2m10s Normal Started pod/development-scale-down-manual-xwp1k-xkkw9 Started container myimage
3m19s Normal SuccessfulCreate job/development-scale-down-manual-xwp1k Created pod: development-scale-down-manual-xwp1k-b858c
2m12s Normal SuccessfulCreate job/development-scale-down-manual-xwp1k Created pod: development-scale-down-manual-xwp1k-xkkw9
2m1s Warning BackoffLimitExceeded job/development-scale-down-manual-xwp1k Job has reached the specified backoff limit
I cant figure out why the container was killed, the logs all seem fine and there are no resource constraints. The container is removed very quickly meaning I have very little time to debug. The more verbose event line reads as such
3m54s Normal Killing pod/development-scale-down-manual-xwp1k-b858c spec.containers{myimage} kubelet, aks-burst-37275452-vmss00000d Stopping container myimage 3m54s 1 development-scale-down-manual-xwp1k-b858c.1697e9d5e5b846ef
I note that the image pull takes a good few seconds (40) initially, might this aid in exceeding the startingDeadline or another cron spec?
Any thoughts or help appreciated, thank you
Reading logs! Always helpful.
Context
For context, the job itself scales an AKS nodepool. We have two, the default system one, and a new user controlled one. The cronjob is meant to scale the new user (Not system pool).
Investigating
I noticed that the scale-down job always takes longer compared to the scale-up job, this is due to the image pull always happening when the scale down job runs.
I also noticed that the Killing event mentioned above originates from the kubelet. (kubectl get events -o wide)
I went to check the kubelet logs on the host, and realised that the host name was a little atypical (aks-burst-XXXXXXXX-vmss00000d) in the sense that most hosts in our small development cluster usually has numbers on the end, not d
There I realised the naming was different because this node was not part of the default nodepool, and I could not check the kubelet logs because the host had been removed.
Cause
The job scales down compute resources. The scale down would fail, because it was always preceeded by a scale up, in which point a new node was in the cluster. This node had nothing running on it, so the next Job was scheduled on it. The Job started on the new node, told Azure to scale down the new node to 0, and subsequently the Kubelet killed the job as it was running.
Always being scheduled on the new node explains why the image pull happened each time as well.
Fix
I changed the spec and added a NodeSelector so that the Job would always run on the system pool, which is more stable than the user pool
---
apiVersion: batch/v1beta1
kind: CronJob
metadata:
name: development-scale-down
namespace: development
spec:
schedule: "0 22 * * 0-4"
concurrencyPolicy: Allow
startingDeadlineSeconds: 60
failedJobsHistoryLimit: 5
jobTemplate:
spec:
backoffLimit: 0 # Do not retry
activeDeadlineSeconds: 360 # 5 minutes
template:
spec:
containers:
- name: scaler
image: myimage:latest
imagePullPolicy: Always
env: ...
restartPolicy: "Never"
nodeSelector:
agentpool: default

How to check when "kubectl delete" failed with "timeout waiting for ... to be synced"

I have a Kubernetes v1.10.2 cluster and a cronjob on it.
The job config is set to:
failedJobsHistoryLimit: 1
successfulJobsHistoryLimit: 3
But it has created more than ten jobs, which are all successful and not removed automatically.
Now I am trying to delete them manually, with kubectl delete job XXX, but the command timeout as:
$ kubectl delete job XXX
error: timed out waiting for "XXX" to be synced
I want to know how can I check in such a situation. Is there a log file for the command execution?
I only know the kubectl logs command, but it is not for such a situation.
"kubectl get" shows the job has already finished:
status:
active: 1
completionTime: 2018-08-27T21:20:21Z
conditions:
- lastProbeTime: 2018-08-27T21:20:21Z
lastTransitionTime: 2018-08-27T21:20:21Z
status: "True"
type: Complete
failed: 3
startTime: 2018-08-27T01:00:00Z
succeeded: 1
and "kubectl describe" output as:
$ kubectl describe job test-elk-xxx-1535331600 -ntest
Name: test-elk-xxx-1535331600
Namespace: test
Selector: controller-uid=863a14e3-a994-11e8-8bd7-fa163e23632f
Labels: controller-uid=863a14e3-a994-11e8-8bd7-fa163e23632f
job-name=test-elk-xxx-1535331600
Annotations: <none>
Controlled By: CronJob/test-elk-xxx
Parallelism: 0
Completions: 1
Start Time: Mon, 27 Aug 2018 01:00:00 +0000
Pods Statuses: 1 Running / 1 Succeeded / 3 Failed
Pod Template:
Labels: controller-uid=863a14e3-a994-11e8-8bd7-fa163e23632f
job-name=test-elk-xxx-1535331600
Containers:
xxx:
Image: test-elk-xxx:18.03-3
Port: <none>
Host Port: <none>
Args:
--config
/etc/elasticsearch-xxx/xxx.yml
/etc/elasticsearch-xxx/actions.yml
Limits:
cpu: 100m
memory: 100Mi
Requests:
cpu: 100m
memory: 100Mi
Environment: <none>
Mounts:
/etc/elasticsearch-xxx from xxx-configs (ro)
Volumes:
xxx-configs:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: test-elk-xxx
Optional: false
Events: <none>
It indicates still one pod running, but I don't know how to figure out the pod name.
Check if kubectl describe pod <pod name> (associated pod of the job) still returns something, which would:
mean the node is still there
include the pod condition
In that state, you can then consider a force deletion.
I think this is the same as the problem reported in github:
Cannot delete jobs when their associated pods are gone
This is reported by several people, and it is not fixed still.
And can use the "-v=X" (e.g. -v=8) option for the kubectl command, it will give more detailed debug info.
As taken from https://github.com/kubernetes/kubernetes/issues/43168#issuecomment-375700293
Try using --cascade=false in your delete job command.
It worked for me

Kubernetes Keeps Restarting Pods of StatefulSet in Minikube With "Need to kill pod"

Minikube version v0.24.1
kubernetes version 1.8.0
The problem that I am facing is that I have several statefulsets created in minikube each with one pod.
Sometimes when I start up minikube my pods will start up initially then keep being restarted by kubernetes. They will go from the creating container state, to running, to terminating over and over.
Now I've seen kubernetes kill and restart things before if kubernetes detects disk pressure, memory pressure, or some other condition like that, but that's not the case here as these flags are not raised and the only message in the pod's event log is "Need to kill pod".
What's most confusing is that this issue doesn't happen all the time, and I'm not sure how to trigger it. My minikube setup will work for a week or more without this happening then one day I'll start minikube up and the pods for my statefulsets just keep restarting. So far the only workaround I've found is to delete my minikube instance and set it up again from scratch, but obviously this is not ideal.
Seen here is a sample of one of the statefulsets whose pod keeps getting restarted. Seen in the logs kubernetes is deleting the pod and starting it again. This happens repeatedly. I'm unable to figure out why it keeps doing that and why it only gets into this state sometimes.
$ kubectl describe statefulsets mongo --namespace=storage
Name: mongo
Namespace: storage
CreationTimestamp: Mon, 08 Jan 2018 16:11:39 -0600
Selector: environment=test,role=mongo
Labels: name=mongo
Annotations: kubectl.kubernetes.io/last-applied-configuration={"apiVersion":"apps/v1beta1","kind":"StatefulSet","metadata":{"annotations":{},"labels":{"name":"mongo"},"name":"mongo","namespace":"storage"},"...
Replicas: 1 desired | 1 total
Pods Status: 1 Running / 0 Waiting / 0 Succeeded / 0 Failed
Pod Template:
Labels: environment=test
role=mongo
Containers:
mongo:
Image: mongo:3.4.10-jessie
Port: 27017/TCP
Command:
mongod
--replSet
rs0
--smallfiles
--noprealloc
Environment: <none>
Mounts:
/data/db from mongo-persistent-storage (rw)
mongo-sidecar:
Image: cvallance/mongo-k8s-sidecar
Port: <none>
Environment:
MONGO_SIDECAR_POD_LABELS: role=mongo,environment=test
KUBERNETES_MONGO_SERVICE_NAME: mongo
Mounts: <none>
Volumes: <none>
Volume Claims:
Name: mongo-persistent-storage
StorageClass:
Labels: <none>
Annotations: volume.alpha.kubernetes.io/storage-class=default
Capacity: 5Gi
Access Modes: [ReadWriteOnce]
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulDelete 23m (x46 over 1h) statefulset delete Pod mongo-0 in StatefulSet mongo successful
Normal SuccessfulCreate 3m (x62 over 1h) statefulset create Pod mongo-0 in StatefulSet mongo successful
After some more digging there seems to have been a bug which can affect statefulsets that creates multiple controllers for the same statefulset:
https://github.com/kubernetes/kubernetes/issues/56355
This issue seems to have been fixed and the fix seems to have been backported to version 1.8 of kubernetes and included in version 1.9, but minikube doesn't yet have the fixed version. A workaround if your system enters this state is to list the controller revisions like so:
$ kubectl get controllerrevisions --namespace=storage
NAME CONTROLLER REVISION AGE
mongo-68bd5cbcc6 StatefulSet/mongo 1 19h
mongo-68bd5cbcc7 StatefulSet/mongo 1 7d
and delete the duplicate controllers for each statefulset.
$ kubectl delete controllerrevisions mongo-68bd5cbcc6 --namespace=storage
or to simply use version 1.9 of kubernetes or above that includes this bug fix.

No nodes available to schedule pods, using google container engine

I'm having an issue where a container I'd like to run doesn't appear to be getting started on my cluster.
I've tried searching around for possible solutions, but there's a surprising lack of information out there to assist with this issue or anything of it's nature.
Here's the most I could gather:
$ kubectl describe pods/elasticsearch
Name: elasticsearch
Namespace: default
Image(s): my.image.host/my-project/elasticsearch
Node: /
Labels: <none>
Status: Pending
Reason:
Message:
IP:
Replication Controllers: <none>
Containers:
elasticsearch:
Image: my.image.host/my-project/elasticsearch
Limits:
cpu: 100m
State: Waiting
Ready: False
Restart Count: 0
Events:
FirstSeen LastSeen Count From SubobjectPath Reason Message
Mon, 19 Oct 2015 10:28:44 -0500 Mon, 19 Oct 2015 10:34:09 -0500 12 {scheduler } failedScheduling no nodes available to schedule pods
I also see this:
$ kubectl get pod elasticsearch -o wide
NAME READY STATUS RESTARTS AGE NODE
elasticsearch 0/1 Pending 0 5s
I guess I'd like to know: What prerequisites exist so that I can be confident that my container is going to run in container engine? What do I need to do in this scenario to get it running?
Here's my yml file:
apiVersion: v1
kind: Pod
metadata:
name: elasticsearch
spec:
containers:
- name: elasticsearch
image: my.image.host/my-project/elasticsearch
ports:
- containerPort: 9200
resources:
volumeMounts:
- name: elasticsearch-data
mountPath: /usr/share/elasticsearch
volumes:
- name: elasticsearch-data
gcePersistentDisk:
pdName: elasticsearch-staging
fsType: ext4
Here's some more output about my node:
$ kubectl get nodes
NAME LABELS STATUS
gke-elasticsearch-staging-00000000-node-yma3 kubernetes.io/hostname=gke-elasticsearch-staging-00000000-node-yma3 NotReady
You only have one node in your cluster and its status in NotReady. So you won't be able to schedule any pods. You can try to determine why your node isn't ready by looking in /var/log/kubelet.log. You can also add new nodes to your cluster (scale the cluster size up to 2) or delete the node (it will be automatically replaced by the instance group manager) to see if either of those options get you a working node.
It appears that scheduler couldn't see any nodes in your cluster. You can run kubectl get nodes and gcloud compute instances list to confirm whether you have any nodes in the cluster. Did you correctly specify number of nodes (--num-nodes) when creating the cluster?