Why does my Kubernetes Cronjob pod get killed while executing? - kubernetes

Kubernetes Version
Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.1", GitCommit:"5e58841cce77d4bc13713ad2b91fa0d961e69192", GitTreeState:"clean", BuildDate:"2021-05-13T02:40:46Z", GoVersion:"go1.16.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.7", GitCommit:"e1d093448d0ed9b9b1a48f49833ff1ee64c05ba5", GitTreeState:"clean", BuildDate:"2021-06-03T00:20:57Z", GoVersion:"go1.15.12", Compiler:"gc", Platform:"linux/amd64"}
I have a Kubernetes crobjob that serves the purpose of running some Azure cli commands on a time based schedule.
Running the container locally works fine, however, manually triggering the Cronjob through Lens, or letting it run per the schedule results in weird behaviour (Running in the cloud as a job yeilds unexpected results).
Here is the cronjob definition:
---
apiVersion: batch/v1beta1
kind: CronJob
metadata:
name: development-scale-down
namespace: development
spec:
schedule: "0 22 * * 0-4"
concurrencyPolicy: Allow
startingDeadlineSeconds: 60
failedJobsHistoryLimit: 5
jobTemplate:
spec:
backoffLimit: 0 # Do not retry
activeDeadlineSeconds: 360 # 5 minutes
template:
spec:
containers:
- name: scaler
image: myimage:latest
imagePullPolicy: Always
env: ...
restartPolicy: "Never"
I ran the cronjob manually and it created job development-scale-down-manual-xwp1k. Describing this job after it completed, we can see the following:
$ kubectl describe job development-scale-down-manual-xwp1k
Name: development-scale-down-manual-xwp1k
Namespace: development
Selector: controller-uid=ecf8fb47-cd50-42eb-9a6f-888f7e2c9257
Labels: controller-uid=ecf8fb47-cd50-42eb-9a6f-888f7e2c9257
job-name=development-scale-down-manual-xwp1k
Annotations: <none>
Parallelism: 1
Completions: 1
Start Time: Wed, 04 Aug 2021 09:40:28 +1200
Active Deadline Seconds: 360s
Pods Statuses: 0 Running / 0 Succeeded / 1 Failed
Pod Template:
Labels: controller-uid=ecf8fb47-cd50-42eb-9a6f-888f7e2c9257
job-name=development-scale-down-manual-xwp1k
Containers:
scaler:
Image: myimage:latest
Port: <none>
Host Port: <none>
Environment:
CLUSTER_NAME: ...
NODEPOOL_NAME: ...
NODEPOOL_SIZE: ...
RESOURCE_GROUP: ...
SP_APP_ID: <set to the key 'application_id' in secret 'scaler-secrets'> Optional: false
SP_PASSWORD: <set to the key 'application_pass' in secret 'scaler-secrets'> Optional: false
SP_TENANT: <set to the key 'application_tenant' in secret 'scaler-secrets'> Optional: false
Mounts: <none>
Volumes: <none>
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulCreate 24m job-controller Created pod: development-scale-down-manual-xwp1k-b858c
Normal SuccessfulCreate 23m job-controller Created pod: development-scale-down-manual-xwp1k-xkkw9
Warning BackoffLimitExceeded 23m job-controller Job has reached the specified backoff limit
This differs from other issues I have read, where it does not mention a "SuccessfulDelete" event.
The events received from kubectl get events tell an interesting story
$ ktl get events | grep xwp1k
3m19s Normal Scheduled pod/development-scale-down-manual-xwp1k-b858c Successfully assigned development/development-scale-down-manual-xwp1k-b858c to aks-burst-37275452-vmss00000d
3m18s Normal Pulling pod/development-scale-down-manual-xwp1k-b858c Pulling image "myimage:latest"
2m38s Normal Pulled pod/development-scale-down-manual-xwp1k-b858c Successfully pulled image "myimage:latest" in 40.365655229s
2m23s Normal Created pod/development-scale-down-manual-xwp1k-b858c Created container myimage
2m23s Normal Started pod/development-scale-down-manual-xwp1k-b858c Started container myimage
2m12s Normal Killing pod/development-scale-down-manual-xwp1k-b858c Stopping container myimage
2m12s Normal Scheduled pod/development-scale-down-manual-xwp1k-xkkw9 Successfully assigned development/development-scale-down-manual-xwp1k-xkkw9 to aks-default-37275452-vmss000002
2m12s Normal Pulling pod/development-scale-down-manual-xwp1k-xkkw9 Pulling image "myimage:latest"
2m11s Normal Pulled pod/development-scale-down-manual-xwp1k-xkkw9 Successfully pulled image "myimage:latest" in 751.93652ms
2m10s Normal Created pod/development-scale-down-manual-xwp1k-xkkw9 Created container myimage
2m10s Normal Started pod/development-scale-down-manual-xwp1k-xkkw9 Started container myimage
3m19s Normal SuccessfulCreate job/development-scale-down-manual-xwp1k Created pod: development-scale-down-manual-xwp1k-b858c
2m12s Normal SuccessfulCreate job/development-scale-down-manual-xwp1k Created pod: development-scale-down-manual-xwp1k-xkkw9
2m1s Warning BackoffLimitExceeded job/development-scale-down-manual-xwp1k Job has reached the specified backoff limit
I cant figure out why the container was killed, the logs all seem fine and there are no resource constraints. The container is removed very quickly meaning I have very little time to debug. The more verbose event line reads as such
3m54s Normal Killing pod/development-scale-down-manual-xwp1k-b858c spec.containers{myimage} kubelet, aks-burst-37275452-vmss00000d Stopping container myimage 3m54s 1 development-scale-down-manual-xwp1k-b858c.1697e9d5e5b846ef
I note that the image pull takes a good few seconds (40) initially, might this aid in exceeding the startingDeadline or another cron spec?
Any thoughts or help appreciated, thank you

Reading logs! Always helpful.
Context
For context, the job itself scales an AKS nodepool. We have two, the default system one, and a new user controlled one. The cronjob is meant to scale the new user (Not system pool).
Investigating
I noticed that the scale-down job always takes longer compared to the scale-up job, this is due to the image pull always happening when the scale down job runs.
I also noticed that the Killing event mentioned above originates from the kubelet. (kubectl get events -o wide)
I went to check the kubelet logs on the host, and realised that the host name was a little atypical (aks-burst-XXXXXXXX-vmss00000d) in the sense that most hosts in our small development cluster usually has numbers on the end, not d
There I realised the naming was different because this node was not part of the default nodepool, and I could not check the kubelet logs because the host had been removed.
Cause
The job scales down compute resources. The scale down would fail, because it was always preceeded by a scale up, in which point a new node was in the cluster. This node had nothing running on it, so the next Job was scheduled on it. The Job started on the new node, told Azure to scale down the new node to 0, and subsequently the Kubelet killed the job as it was running.
Always being scheduled on the new node explains why the image pull happened each time as well.
Fix
I changed the spec and added a NodeSelector so that the Job would always run on the system pool, which is more stable than the user pool
---
apiVersion: batch/v1beta1
kind: CronJob
metadata:
name: development-scale-down
namespace: development
spec:
schedule: "0 22 * * 0-4"
concurrencyPolicy: Allow
startingDeadlineSeconds: 60
failedJobsHistoryLimit: 5
jobTemplate:
spec:
backoffLimit: 0 # Do not retry
activeDeadlineSeconds: 360 # 5 minutes
template:
spec:
containers:
- name: scaler
image: myimage:latest
imagePullPolicy: Always
env: ...
restartPolicy: "Never"
nodeSelector:
agentpool: default

Related

Back-off restarting failed container In Azure AKS

Linux container pod, with docker images from Azure Container registry, keeps restarting with restartPolicy as Always. Pod description is as below.
kubectl describe pod example-pod
...
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Completed
Exit Code: 0
Started: Thu, 11 Jun 2020 03:27:11 +0000
Finished: Thu, 11 Jun 2020 03:27:12 +0000
...
Back-off restarting failed container
This pod is created with secret to access ACR registry repository.
Reason is that pod completes execution successfully with exit code 0. However, It should keep listening at particular port number. Microsoft document link is at this URL Container Group Runtime under header "Container continually exits and restarts"
deployment-example.yml file content is as below.
apiVersion: apps/v1
kind: Deployment
metadata:
name: example-deployment
namespace: development
labels:
app: example
spec:
replicas: 1
selector:
matchLabels:
app: example
template:
metadata:
labels:
app: example
spec:
containers:
- name: example
image: contentocr.azurecr.io/example:latest
#command: ["ping -t localhost"]
imagePullPolicy: Always
ports:
- name: http-port
containerPort: 3000
imagePullSecrets:
- name: regpass
restartPolicy: Always
nodeSelector:
agent: linux
---
apiVersion: v1
kind: Service
metadata:
name: example
namespace: development
labels:
app: example
spec:
ports:
- name: http-port
port: 3000
targetPort: 3000
selector:
app: example
type: LoadBalancer
Output of kubectl get events is as below.
3m39s Normal Scheduled pod/example-deployment-5dc964fcf8-gbm5t Successfully assigned development/example-deployment-5dc964fcf8-gbm5t to aks-agentpool-18342716-vmss000000
2m6s Normal Pulling pod/example-deployment-5dc964fcf8-gbm5t Pulling image "contentocr.azurecr.io/example:latest"
2m5s Normal Pulled pod/example-deployment-5dc964fcf8-gbm5t Successfully pulled image "contentocr.azurecr.io/example:latest"
2m5s Normal Created pod/example-deployment-5dc964fcf8-gbm5t Created container example
2m49s Normal Started pod/example-deployment-5dc964fcf8-gbm5t Started container example
2m20s Warning BackOff pod/example-deployment-5dc964fcf8-gbm5t Back-off restarting failed container
6m6s Normal SuccessfulCreate replicaset/example-deployment-5dc964fcf8 Created pod: example-deployment-5dc964fcf8-2fdt5
3m39s Normal SuccessfulCreate replicaset/example-deployment-5dc964fcf8 Created pod: example-deployment-5dc964fcf8-gbm5t
6m6s Normal ScalingReplicaSet deployment/example-deployment Scaled up replica set example-deployment-5dc964fcf8 to 1
3m39s Normal ScalingReplicaSet deployment/example-deployment Scaled up replica set example-deployment-5dc964fcf8 to 1
3m38s Normal EnsuringLoadBalancer service/example Ensuring load balancer
3m34s Normal EnsuredLoadBalancer service/example Ensured load balancer
Docker file entry point is like ENTRYPOINT ["npm", "start"] with CMD ["tail -f /dev/null/"]
It runs locally. Implicitly, it assigns CI="true" flag. However, in docker-compose stdin_open: true or tty: true is to be set and in Kubernetes deployment file, ENV named variable CI is to be set up with value "true".
The below command solved my problem:-
az aks update -n aks-nks-k8s-cluster -g aks-nks-k8s-rg --attach-acr aksnksk8s
After executing the above command, below will be displayed:-
Add ROLE Propagation done [###############] 100.0000%
and then,
Running.. followed by Response trail after some time.
Here,
aks-nks-k8s-cluster : Cluster name I have created and using
aks-nks-k8s-rg : Resource Group have created and using
aksnksk8s : Container Registries which I have created and using

Trying to create a Kubernetes deployment but it shows 0 pods available

I'm new to k8s, so some of my terminology might be off. But basically, I'm trying to deploy a simple web api: one load balancer in front of n pods (where right now, n=1).
However, when I try to visit the load balancer's IP address it doesn't show my web application. When I run kubectl get deployments, I get this:
NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
tl-api 1 1 1 0 4m
Here's my YAML file. Let me know if anything looks off--I'm very new to this!
---
apiVersion: apps/v1beta1
kind: Deployment
metadata:
name: tl-api
spec:
replicas: 1
template:
metadata:
labels:
app: tl-api
spec:
containers:
- name: tl-api
image: tlk8s.azurecr.io/devicecloudwebapi:v1
ports:
- containerPort: 80
imagePullSecrets:
- name: acr-auth
nodeSelector:
beta.kubernetes.io/os: windows
---
apiVersion: v1
kind: Service
metadata:
name: tl-api
spec:
type: LoadBalancer
ports:
- port: 80
selector:
app: tl-api
Edit 2: When I try using ACS (which supports Windows), I get this:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 11m default-scheduler Successfully assigned tl-api-3466491809-vd5kg to dc9ebacs9000
Normal SuccessfulMountVolume 11m kubelet, dc9ebacs9000 MountVolume.SetUp succeeded for volume "default-token-v3wz9"
Normal Pulling 4m (x6 over 10m) kubelet, dc9ebacs9000 pulling image "tlk8s.azurecr.io/devicecloudwebapi:v1"
Warning FailedSync 1s (x50 over 10m) kubelet, dc9ebacs9000 Error syncing pod
Normal BackOff 1s (x44 over 10m) kubelet, dc9ebacs9000 Back-off pulling image "tlk8s.azurecr.io/devicecloudwebapi:v1"
I then try examining the failed pod:
PS C:\users\<me>\source\repos\DeviceCloud\DeviceCloud\1- Presentation\DeviceCloud.Web.API> kubectl logs tl-api-3466491809-vd5kg
Error from server (BadRequest): container "tl-api" in pod "tl-api-3466491809-vd5kg" is waiting to start: trying and failing to pull image
When I run docker images I see the following:
REPOSITORY TAG IMAGE ID CREATED SIZE
devicecloudwebapi latest ee3d9c3e231d 24 hours ago 7.85GB
tlk8s.azurecr.io/devicecloudwebapi v1 ee3d9c3e231d 24 hours ago 7.85GB
devicecloudwebapi dev bb33ab221910 25 hours ago 7.76GB
Your problem is that the container image tlk8s.azurecr.io/devicecloudwebapi:v1 is in a private container registry. See the events at the bottom of the following command:
$ kubectl describe po -l=app=tl-api
The official Kubernetes docs describe how to resolve this issue, see Pull an Image from a Private Registry, essentially:
Create a secret kubectl create secret docker-registry
Use it in your deployment, under the spec.imagePullSecrets key

Kubernetes Keeps Restarting Pods of StatefulSet in Minikube With "Need to kill pod"

Minikube version v0.24.1
kubernetes version 1.8.0
The problem that I am facing is that I have several statefulsets created in minikube each with one pod.
Sometimes when I start up minikube my pods will start up initially then keep being restarted by kubernetes. They will go from the creating container state, to running, to terminating over and over.
Now I've seen kubernetes kill and restart things before if kubernetes detects disk pressure, memory pressure, or some other condition like that, but that's not the case here as these flags are not raised and the only message in the pod's event log is "Need to kill pod".
What's most confusing is that this issue doesn't happen all the time, and I'm not sure how to trigger it. My minikube setup will work for a week or more without this happening then one day I'll start minikube up and the pods for my statefulsets just keep restarting. So far the only workaround I've found is to delete my minikube instance and set it up again from scratch, but obviously this is not ideal.
Seen here is a sample of one of the statefulsets whose pod keeps getting restarted. Seen in the logs kubernetes is deleting the pod and starting it again. This happens repeatedly. I'm unable to figure out why it keeps doing that and why it only gets into this state sometimes.
$ kubectl describe statefulsets mongo --namespace=storage
Name: mongo
Namespace: storage
CreationTimestamp: Mon, 08 Jan 2018 16:11:39 -0600
Selector: environment=test,role=mongo
Labels: name=mongo
Annotations: kubectl.kubernetes.io/last-applied-configuration={"apiVersion":"apps/v1beta1","kind":"StatefulSet","metadata":{"annotations":{},"labels":{"name":"mongo"},"name":"mongo","namespace":"storage"},"...
Replicas: 1 desired | 1 total
Pods Status: 1 Running / 0 Waiting / 0 Succeeded / 0 Failed
Pod Template:
Labels: environment=test
role=mongo
Containers:
mongo:
Image: mongo:3.4.10-jessie
Port: 27017/TCP
Command:
mongod
--replSet
rs0
--smallfiles
--noprealloc
Environment: <none>
Mounts:
/data/db from mongo-persistent-storage (rw)
mongo-sidecar:
Image: cvallance/mongo-k8s-sidecar
Port: <none>
Environment:
MONGO_SIDECAR_POD_LABELS: role=mongo,environment=test
KUBERNETES_MONGO_SERVICE_NAME: mongo
Mounts: <none>
Volumes: <none>
Volume Claims:
Name: mongo-persistent-storage
StorageClass:
Labels: <none>
Annotations: volume.alpha.kubernetes.io/storage-class=default
Capacity: 5Gi
Access Modes: [ReadWriteOnce]
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulDelete 23m (x46 over 1h) statefulset delete Pod mongo-0 in StatefulSet mongo successful
Normal SuccessfulCreate 3m (x62 over 1h) statefulset create Pod mongo-0 in StatefulSet mongo successful
After some more digging there seems to have been a bug which can affect statefulsets that creates multiple controllers for the same statefulset:
https://github.com/kubernetes/kubernetes/issues/56355
This issue seems to have been fixed and the fix seems to have been backported to version 1.8 of kubernetes and included in version 1.9, but minikube doesn't yet have the fixed version. A workaround if your system enters this state is to list the controller revisions like so:
$ kubectl get controllerrevisions --namespace=storage
NAME CONTROLLER REVISION AGE
mongo-68bd5cbcc6 StatefulSet/mongo 1 19h
mongo-68bd5cbcc7 StatefulSet/mongo 1 7d
and delete the duplicate controllers for each statefulset.
$ kubectl delete controllerrevisions mongo-68bd5cbcc6 --namespace=storage
or to simply use version 1.9 of kubernetes or above that includes this bug fix.

Pulling private image from docker hub using minikube

I'm using minikube on macOS 10.12 and trying to use a private image hosted at docker hub. I know that minikube launches a VM that as far as I know will be the unique node of my local kubernetes cluster and that will host all my pods.
I read that I could use the VM's docker runtime by running eval $(minikube docker-env). So I used those variables to change from my local docker runtime to the other. Running docker images I could see that the change was done effectively.
My next step was to log in at docker hub using docker login and then pulling my image manually, which ended without error. After that I thought that the image will by ready to by be used by any pod in the cluster but I'm always getting ImagePullBackOff. I also tried to ssh into the VM via minikube ssh and the result is the same, the image is there to be used but for some reason I don't know it's refusing to use it.
In case it helps, this is my deployment description file:
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: web-deployment
spec:
replicas: 1
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: godraude/nginx
imagePullPolicy: Always
ports:
- containerPort: 80
- containerPort: 443
And this is the output of kubectl describe pod <podname>:
Events:
FirstSeen LastSeen Count From SubobjectPath Type Reason Message
--------- -------- ----- ---- ------------- -------- ------ -------
1m 1m 1 {default-scheduler } Normal Scheduled Successfully assigned web-deployment-2451628605-vtbl8 to minikube
1m 23s 4 {kubelet minikube} spec.containers{nginx} Normal Pulling pulling image "godraude/nginx"
1m 20s 4 {kubelet minikube} spec.containers{nginx} Warning Failed Failed to pull image "godraude/nginx": Error: image godraude/nginx not found
1m 20s 4 {kubelet minikube} Warning FailedSync Error syncing pod, skipping: failed to "StartContainer" for "nginx" with ErrImagePull: "Error: image godraude/nginx not found"
1m 4s 5 {kubelet minikube} spec.containers{nginx} Normal BackOff Back-off pulling image "godraude/nginx"
1m 4s 5 {kubelet minikube} Warning FailedSync Error syncing pod, skipping: failed to "StartContainer" for "nginx" with ImagePullBackOff: "Back-off pulling image \"godraude/nginx\""
i think what u need is to create a secrete which will tell kube from where it can pull your private image and its credentials
kubectl create secret docker-registry my-secret --docker-server=DOCKER_REGISTRY_SERVER --docker-username=DOCKER_USER --docker-password=DOCKER_PASSWORD --docker-email=DOCKER_EMAIL
below command to list your secretes
kubectl get secret
NAME TYPE DATA AGE
my-secret kubernetes.io/dockercfg 1 100d
now in deployment defination u need to define whcih secret to use
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: web-deployment
spec:
replicas: 1
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: godraude/nginx
imagePullPolicy: Always
ports:
- containerPort: 80
- containerPort: 443
imagePullSecrets:
- name: my-secret
The problem was the image pull policy. It was set to Always so docker was trying to pull the imagen even if it was present. Setting imagePullPolicy: Never solved the issue.

No nodes available to schedule pods, using google container engine

I'm having an issue where a container I'd like to run doesn't appear to be getting started on my cluster.
I've tried searching around for possible solutions, but there's a surprising lack of information out there to assist with this issue or anything of it's nature.
Here's the most I could gather:
$ kubectl describe pods/elasticsearch
Name: elasticsearch
Namespace: default
Image(s): my.image.host/my-project/elasticsearch
Node: /
Labels: <none>
Status: Pending
Reason:
Message:
IP:
Replication Controllers: <none>
Containers:
elasticsearch:
Image: my.image.host/my-project/elasticsearch
Limits:
cpu: 100m
State: Waiting
Ready: False
Restart Count: 0
Events:
FirstSeen LastSeen Count From SubobjectPath Reason Message
Mon, 19 Oct 2015 10:28:44 -0500 Mon, 19 Oct 2015 10:34:09 -0500 12 {scheduler } failedScheduling no nodes available to schedule pods
I also see this:
$ kubectl get pod elasticsearch -o wide
NAME READY STATUS RESTARTS AGE NODE
elasticsearch 0/1 Pending 0 5s
I guess I'd like to know: What prerequisites exist so that I can be confident that my container is going to run in container engine? What do I need to do in this scenario to get it running?
Here's my yml file:
apiVersion: v1
kind: Pod
metadata:
name: elasticsearch
spec:
containers:
- name: elasticsearch
image: my.image.host/my-project/elasticsearch
ports:
- containerPort: 9200
resources:
volumeMounts:
- name: elasticsearch-data
mountPath: /usr/share/elasticsearch
volumes:
- name: elasticsearch-data
gcePersistentDisk:
pdName: elasticsearch-staging
fsType: ext4
Here's some more output about my node:
$ kubectl get nodes
NAME LABELS STATUS
gke-elasticsearch-staging-00000000-node-yma3 kubernetes.io/hostname=gke-elasticsearch-staging-00000000-node-yma3 NotReady
You only have one node in your cluster and its status in NotReady. So you won't be able to schedule any pods. You can try to determine why your node isn't ready by looking in /var/log/kubelet.log. You can also add new nodes to your cluster (scale the cluster size up to 2) or delete the node (it will be automatically replaced by the instance group manager) to see if either of those options get you a working node.
It appears that scheduler couldn't see any nodes in your cluster. You can run kubectl get nodes and gcloud compute instances list to confirm whether you have any nodes in the cluster. Did you correctly specify number of nodes (--num-nodes) when creating the cluster?