Kubernetes terminating pods without a clear reason in the logs - kubernetes

Is there a way to know why Kubernetes is terminating pods?
If I go to Logging in the Google console, the only message I can find related to this event is:
shutting down, got signal: Terminated
Also, the pods in status Terminating are never being terminated, a few of them are in this status for more than 24 hours now.
I'm not using livenessProbes or readinessProbes.
I am using terminationGracePeriodSeconds: 30
EDIT: added the result of kubectl describe pod <podname> for the pod that is the Terminating status for 9 hours as of now:
Name: storeassets-5383k
Namespace: default
Node: gke-recommendation-engin-default-pool-c9b136a8-0qms/10.132.0.85
Start Time: Sat, 11 Mar 2017 06:27:32 +0000
Labels: app=storeassets
deployment=ab08dc44070ffbbceb69ff6a5d99ae61
version=v1
Status: Terminating (expires Tue, 14 Mar 2017 01:30:48 +0000)
Termination Grace Period: 30s
Reason: NodeLost
Message: Node gke-recommendation-engin-default-pool-c9b136a8-0qms which was running pod storeassets-5383k is unresponsive
IP: 10.60.3.7
Controllers: ReplicationController/storeassets
Containers:
storeassets:
Container ID: docker://7b38f1de0321de4a5f2b484f5e2263164a32e9019b275d25d8823de93fb52c30
Image: eu.gcr.io/<project-name>/recommendation-content-realtime
Image ID: docker://sha256:9e8cf1b743f94f365745a011702a4ae1c2e636ceaaec4dd8d36fef6f787aefe7
Port:
Command:
python
-m
realtimecontent.storeassets
Requests:
cpu: 100m
State: Running
Started: Sat, 11 Mar 2017 06:27:33 +0000
Ready: True
Restart Count: 0
Volume Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-qwfs4 (ro)
Environment Variables:
RECOMMENDATION_PROJECT: <project-name>
RECOMMENDATION_BIGTABLE_ID: recommendation-engine
GOOGLE_APPLICATION_CREDENTIALS: recommendation-engine-credentials.json
Conditions:
Type Status
Initialized True
Ready False
PodScheduled True
Volumes:
default-token-qwfs4:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-qwfs4
QoS Class: Burstable
Tolerations: <none>
No events.

As for why the pods are getting terminated, it must be because your image/container is exiting with a successful status.
Try logging your pod until it exits. You might be able to see the reason why from there.

Related

Getting CrashBackloopError when deploying a pod

I am new to kubernetes and am trying to deploy a pod with private registry. Whenever I deploy this yaml it goes crash loop. Added sleep with a large value thinking that might cause this, still haven't worked.
apiVersion: v1
kind: Pod
metadata:
name: privetae-image-testing
spec:
containers:
- name: private-image-test
image: buildforjenkin.azurecr.io/nginx:latest
imagePullPolicy: IfNotPresent
command: ['echo','success','sleep 1000000']
Here are the logs:
Name: privetae-image-testing
Namespace: default
Priority: 0
Node: docker-desktop/192.168.65.4
Start Time: Sun, 24 Oct 2021 15:52:25 +0530
Labels: <none>
Annotations: <none>
Status: Running
IP: 10.1.1.49
IPs:
IP: 10.1.1.49
Containers:
private-image-test:
Container ID: docker://46520936762f17b70d1ec92a121269e90aef2549390a14184e6c838e1e6bafec
Image: buildforjenkin.azurecr.io/nginx:latest
Image ID: docker-pullable://buildforjenkin.azurecr.io/nginx#sha256:7250923ba3543110040462388756ef099331822c6172a050b12c7a38361ea46f
Port: <none>
Host Port: <none>
Command:
echo
success
sleep 1000000
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Completed
Exit Code: 0
Started: Sun, 24 Oct 2021 15:52:42 +0530
Finished: Sun, 24 Oct 2021 15:52:42 +0530
Ready: False
Restart Count: 2
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-ld6zz (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
kube-api-access-ld6zz:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 34s default-scheduler Successfully assigned default/privetae-image-testing to docker-desktop
Normal Pulled 17s (x3 over 33s) kubelet Container image "buildforjenkin.azurecr.io/nginx:latest" already present on machine
Normal Created 17s (x3 over 33s) kubelet Created container private-image-test
Normal Started 17s (x3 over 33s) kubelet Started container private-image-test
Warning BackOff 2s (x5 over 31s) kubelet Back-off restarting failed container
I am running the cluster on docker-desktop on windows. TIA
Notice you are using standard nginx image? Try delete your pod and re-apply with:
apiVersion: v1
kind: Pod
metadata:
name: private-image-testing
labels:
run: my-nginx
spec:
restartPolicy: Always
containers:
- name: private-image-test
image: buildforjenkin.azurecr.io/nginx:latest
imagePullPolicy: IfNotPresent
ports:
- containerPort: 80
name: http
If your pod runs, you should be able to remote into with kubectl exec -it private-image-testing -- sh, follow by wget -O- localhost should print you a welcome message. If it still fail, paste the output of kubectl logs -f -l run=my-nginx to your question.
Check my previous answer to understand step-by step whats going on after you launch the container.
You are launching some nginx:latest container with the process inside that runs forever as it should be to avoid main process be exited. Then you add overlay that (I will quote David: print the words success and sleep 1000000, and having printed those words, then exit).
Instead of making your container running all the time to serve, you explicitly shooting into your leg by finishing the process using sleep 1000000.
And sure, your command will be executed and container will exit. Check that. It was exited correctly with status 0 and did that 2 times. And will more in the future.
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Completed
Exit Code: 0
Started: Sun, 24 Oct 2021 15:52:42 +0530
Finished: Sun, 24 Oct 2021 15:52:42 +0530
You need to think well if you really need command: ['echo','success','sleep 1000000']

How to see Pod logs: a container name must be specified for pod... choose one of: [wait main]

I am running an Argo workflow and getting the following error in the pod's log:
error: a container name must be specified for pod <name>, choose one of: [wait main]
This error only happens some of the time and only with some of my templates, but when it does, it is a template that is run later in the workflow (i.e. not the first template run). I have not yet been able to identify the parameters that will run successfully, so I will be happy with tips for debugging. I have pasted the output of describe below.
Based on searches, I think the solution is simply that I need to attach "-c main" somewhere, but I do not know where and cannot find information in the Argo docs.
Describe:
Name: message-passing-1-q8jgn-607612432
Namespace: argo
Priority: 0
Node: REDACTED
Start Time: Wed, 17 Mar 2021 17:16:37 +0000
Labels: workflows.argoproj.io/completed=false
workflows.argoproj.io/workflow=message-passing-1-q8jgn
Annotations: cni.projectcalico.org/podIP: 192.168.40.140/32
cni.projectcalico.org/podIPs: 192.168.40.140/32
workflows.argoproj.io/node-name: message-passing-1-q8jgn.e
workflows.argoproj.io/outputs: {"exitCode":"6"}
workflows.argoproj.io/template:
{"name":"egress","arguments":{},"inputs":{...
Status: Failed
IP: 192.168.40.140
IPs:
IP: 192.168.40.140
Controlled By: Workflow/message-passing-1-q8jgn
Containers:
wait:
Container ID: docker://26d6c30440777add2af7ef3a55474d9ff36b8c562d7aecfb911ce62911e5fda3
Image: argoproj/argoexec:v2.12.10
Image ID: docker-pullable://argoproj/argoexec#sha256:6edb85a84d3e54881404d1113256a70fcc456ad49c6d168ab9dfc35e4d316a60
Port: <none>
Host Port: <none>
Command:
argoexec
wait
State: Terminated
Reason: Completed
Exit Code: 0
Started: Wed, 17 Mar 2021 17:16:43 +0000
Finished: Wed, 17 Mar 2021 17:17:03 +0000
Ready: False
Restart Count: 0
Environment:
ARGO_POD_NAME: message-passing-1-q8jgn-607612432 (v1:metadata.name)
Mounts:
/argo/podmetadata from podmetadata (rw)
/mainctrfs/mnt/logs from log-p1-vol (rw)
/mainctrfs/mnt/processed from processed-p1-vol (rw)
/var/run/docker.sock from docker-sock (ro)
/var/run/secrets/kubernetes.io/serviceaccount from argo-token-v2w56 (ro)
main:
Container ID: docker://67e6d6d3717ab1080f14cac6655c90d990f95525edba639a2d2c7b3170a7576e
Image: REDACTED
Image ID: REDACTED
Port: <none>
Host Port: <none>
Command:
/bin/bash
-c
Args:
State: Terminated
Reason: Error
Exit Code: 6
Started: Wed, 17 Mar 2021 17:16:43 +0000
Finished: Wed, 17 Mar 2021 17:17:03 +0000
Ready: False
Restart Count: 0
Environment: <none>
Mounts:
/mnt/logs/ from log-p1-vol (rw)
/mnt/processed/ from processed-p1-vol (rw)
/var/run/secrets/kubernetes.io/serviceaccount from argo-token-v2w56 (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
podmetadata:
Type: DownwardAPI (a volume populated by information about the pod)
Items:
metadata.annotations -> annotations
docker-sock:
Type: HostPath (bare host directory volume)
Path: /var/run/docker.sock
HostPathType: Socket
processed-p1-vol:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: message-passing-1-q8jgn-processed-p1-vol
ReadOnly: false
log-p1-vol:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: message-passing-1-q8jgn-log-p1-vol
ReadOnly: false
argo-token-v2w56:
Type: Secret (a volume populated by a Secret)
SecretName: argo-token-v2w56
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 7m35s default-scheduler Successfully assigned argo/message-passing-1-q8jgn-607612432 to ack1
Normal Pulled 7m31s kubelet Container image "argoproj/argoexec:v2.12.10" already present on machine
Normal Created 7m31s kubelet Created container wait
Normal Started 7m30s kubelet Started container wait
Normal Pulled 7m30s kubelet Container image already present on machine
Normal Created 7m30s kubelet Created container main
Normal Started 7m30s kubelet Started container main
This happens when you try to see logs for a pod with multiple containers and not specify for what container you want to see the log. Typical command to see logs:
kubectl logs <podname>
But your Pod has two container, one named "wait" and one named "main". You can see the logs from the container named "main" with:
kubectl logs <podname> -c main
or you can see the logs from all containers with
kubectl logs <podname> --all-containers

Kubernetes fix gke-metrics-agent stuck in terminating state on GKE

GKE had an outage about 2 days ago in their London datacentre (https://status.cloud.google.com/incident/compute/20013), since which time one of my nodes has been acting up. I've had to manually terminate a number of pods running on it and I'm having issues with a couple of sites, I assume due to their liveness checks failing temporarily which might have something to do with the below error in gke-metrics-agent?
Looking at the system pods I can see one instance of gke-metrics-agent is stuck in a terminating state and has been since last night:
kubectl get pods -n kube-system
reports:
...
gke-metrics-agent-k47g8 0/1 Terminating 0 32d
gke-metrics-agent-knr9h 1/1 Running 0 31h
gke-metrics-agent-vqkpw 1/1 Running 0 32d
...
I've looked at the describe output for the pod but can't see anything that helps me understand what it needs done:
kubectl describe pod gke-metrics-agent-k47g8 -n kube-system
Name: gke-metrics-agent-k47g8
Namespace: kube-system
Priority: 2000001000
Priority Class Name: system-node-critical
Node: <node-name>/<IP>
Start Time: Mon, 09 Nov 2020 03:41:14 +0000
Labels: component=gke-metrics-agent
controller-revision-hash=f8c5b8bfb
k8s-app=gke-metrics-agent
pod-template-generation=4
Annotations: components.gke.io/component-name: gke-metrics-agent
components.gke.io/component-version: 0.27.1
configHash: <config-hash>
Status: Terminating (lasts 15h)
Termination Grace Period: 30s
IP: <IP>
IPs:
IP: <IP>
Controlled By: DaemonSet/gke-metrics-agent
Containers:
gke-metrics-agent:
Container ID: docker://<id>
Image: gcr.io/gke-release/gke-metrics-agent:0.1.3-gke.0
Image ID: docker-pullable://gcr.io/gke-release/gke-metrics-agent#sha256:<hash>
Port: <none>
Host Port: <none>
Command:
/otelsvc
--config=/conf/gke-metrics-agent-config.yaml
--metrics-level=NONE
State: Terminated
Reason: Completed
Exit Code: 0
Started: Mon, 09 Nov 2020 03:41:17 +0000
Finished: Thu, 10 Dec 2020 21:16:50 +0000
Ready: False
Restart Count: 0
Limits:
memory: 50Mi
Requests:
cpu: 3m
memory: 50Mi
Environment:
NODE_NAME: (v1:spec.nodeName)
POD_NAME: gke-metrics-agent-k47g8 (v1:metadata.name)
POD_NAMESPACE: kube-system (v1:metadata.namespace)
KUBELET_HOST: 127.0.0.1
ARG1: ${1}
ARG2: ${2}
Mounts:
/conf from gke-metrics-agent-config-vol (rw)
/etc/ssl/certs from ssl-certs (ro)
/var/run/secrets/kubernetes.io/serviceaccount from gke-metrics-agent-token-cn6ss (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
gke-metrics-agent-config-vol:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: gke-metrics-agent-conf
Optional: false
ssl-certs:
Type: HostPath (bare host directory volume)
Path: /etc/ssl/certs
HostPathType:
gke-metrics-agent-token-cn6ss:
Type: Secret (a volume populated by a Secret)
SecretName: gke-metrics-agent-token-cn6ss
Optional: false
QoS Class: Burstable
Node-Selectors: kubernetes.io/os=linux
Tolerations: :NoExecute
:NoSchedule
components.gke.io/gke-managed-components
node.kubernetes.io/disk-pressure:NoSchedule
node.kubernetes.io/memory-pressure:NoSchedule
node.kubernetes.io/network-unavailable:NoSchedule
node.kubernetes.io/not-ready:NoExecute
node.kubernetes.io/pid-pressure:NoSchedule
node.kubernetes.io/unreachable:NoExecute
node.kubernetes.io/unschedulable:NoSchedule
Events: <none>
I'm not used to having to work on the system pods, in the past my experience troubleshooting issues often falls back on force deleting them when all else fails:
kubectl delete pod <pod-name> -n <ns> --grace-period=0 --force
My concern is I don't fully understand what this might do for a system pod and was hoping someone with expertise could advise on a sensible way forward?
I'm also looking at draining this node so Kubernetes can rebuild a new one. Would this potentially be the easiest way to go?
Following up on this I found the pod that was experiencing the issues with gke-metrics-agent became even less stable as the day went on.
I, therefore, had to drain it. The resources it was running are now on new nodes which are working as expected and all system pods are running as expected (including gke-metrics-agent).
Prior to draining this node I ensured, Pod Disruption Budgets were in place as a number of services run on 1 or 2 instances:
https://kubernetes.io/docs/tasks/run-application/configure-pdb/
This meant I could run:
kubectl drain <node-name>
The deployments then ensured they had enough live pods prior to the bad node being taken offline and seems to have avoided any downtime.

How to keep my pod running (not get terminate/restart) in EKS

I am trying to use AWS EKS (fargate) to run automation case, but some pods (9 out of 10 times running) get terminated, makes automation failure.
I have a bunch of automation cases written in robotframework, the case itself is running well, but it is time-consumming, usually need 6 hours for a round. So, I think I can use K8S to run the cases in-parallel, therefore save time, I use Jenkins to config how many 'automations' run in-parallel, and after all done, merge and present the test result.
But some pods aften get terminated,
command "kubectl get pod", return something like this (I set the "restartPolicy: Never" to keep the error pod to 'describe' its info, otherwise, the pod just gone)
box6 0/1 Error 0 9m39s
command "kubectl describe pod box6" get output like following (masked some private information).
Name: box6
Namespace: default
Priority: 2000001000
Priority Class Name: system-node-critical
Node: XXXXXXXX
Start Time: Mon, 21 Dec 2020 15:29:37 +0800
Labels: eks.amazonaws.com/fargate-profile=eksautomation-profile
name=box6
purpose=demonstrate-command
Annotations: CapacityProvisioned: 0.25vCPU 0.5GB
Logging: LoggingDisabled: LOGGING_CONFIGMAP_NOT_FOUND
kubectl.kubernetes.io/last-applied-configuration:
{"apiVersion":"v1","kind":"Pod","metadata":{"annotations":{},"labels":{"name":"box6","purpose":"demonstrate-command"},"name":"box6","names...
kubernetes.io/psp: eks.privileged
Status: Failed
IP: 192.168.183.226
IPs:
IP: 192.168.183.226
Containers:
box6:
Container ID: XXXXXXXX
Image: XXXXXXXX
Image ID: XXXXXXXX
Port: <none>
Host Port: <none>
Command:
/bin/initMock.sh
State: Terminated
Reason: Error
Exit Code: 143
Started: Mon, 21 Dec 2020 15:32:12 +0800
Finished: Mon, 21 Dec 2020 15:34:09 +0800
Ready: False
Restart Count: 0
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-tsk7j (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
default-token-tsk7j:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-tsk7j
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning LoggingDisabled <unknown> fargate-scheduler Disabled logging because aws-logging configmap was not found. configmap "aws-logging" not found
Normal Scheduled <unknown> fargate-scheduler Successfully assigned default/box6 to XXXXXXXX
Normal Pulling 5m7s kubelet, XXXXXXXX Pulling image "XXXXXXXX"
Normal Pulled 2m38s kubelet, XXXXXXXX Successfully pulled image "XXXXXXXX"
Normal Created 2m38s kubelet, XXXXXXXX Created container box6
Normal Started 2m37s kubelet, XXXXXXXX Started container box6
I did some search upon the error, error code 143 is 128+SIGTERM, I am doubting that the pod get killed by EKS intentionally.
I cannot config the pod to have it restart, because if so, the automation case cannot resume, therefore makes the effort useless (not able to save automation running time).
I have tried to enable cloud watch, hoping to get a clue upon why pod get terminated, but no clue.
Why my pod get terminated by EKS? how should I troubleshooting upon it? how should I avoid it?
Thanks for your help.

Kubernetes init containers run every hour

I have recently set up redis via https://github.com/tarosky/k8s-redis-ha, this repo includes an init container, and I have included an extra init container in order to get passwords etc set up.
I am seeing some strange (and it seems undocumented) behavior, whereby the init containers run as expected before the redis container starts, however then they run subsequently every hour, close to an hour. I have tested this behavior using a busybox init container (which does nothing) on deployments & statefulset and experience the same behavior, so it is not specific to this redis pod.
I have tested this on bare metal with k8s 1.6 and 1.8 with the same results, however when applying init containers to GKE (k8s 1.7) this behavior does not happen. I can't see any flags for GKE's kubelet to dictate this behavior.
See below for kubectl describe pod showing that the init containers are run when the main pod has not exited/crashed.
Name: redis-sentinel-1
Namespace: (redacted)
Node: (redacted)/(redacted)
Start Time: Mon, 12 Mar 2018 06:20:55 +0000
Labels: app=redis-sentinel
controller-revision-hash=redis-sentinel-7cc557cf7c
Annotations: kubernetes.io/created-by={"kind":"SerializedReference","apiVersion":"v1","reference":{"kind":"StatefulSet","namespace":"(redacted)","name":"redis-sentinel","uid":"759a3a3b-25bd-11e8-a8ce-0242ac110...
security.alpha.kubernetes.io/unsafe-sysctls=net.core.somaxconn=1024
Status: Running
IP: (redacted)
Controllers: StatefulSet/redis-sentinel
Init Containers:
redis-ha-server:
Container ID: docker://557d777a7c660b062662426ebe9bbf6f9725fb9d88f89615a8881346587c1835
Image: tarosky/k8s-redis-ha:sentinel-3.0.1
Image ID: docker-pullable://tarosky/k8s-redis-ha#sha256:98e09ef5fbea5bfd2eb1858775c967fa86a92df48e2ec5d0b405f7ca3f5ada1c
Port:
State: Terminated
Reason: Completed
Exit Code: 0
Started: Tue, 13 Mar 2018 03:01:12 +0000
Finished: Tue, 13 Mar 2018 03:01:12 +0000
Ready: True
Restart Count: 0
Environment: <none>
Mounts:
/opt from opt (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-hkj6d (ro)
-redis-init:
Container ID: docker://18c4e353233a6827999ae4a16adf1f408754a21d80a8e3374750fdf9b54f9b1a
Image: gcr.io/(redacted)/redis-init
Image ID: docker-pullable://gcr.io/(redacted)/redis-init#sha256:42042093d58aa597cce4397148a2f1c7967db689256ed4cc8d9f42b34d53aca2
Port:
State: Terminated
Reason: Completed
Exit Code: 0
Started: Tue, 13 Mar 2018 03:01:25 +0000
Finished: Tue, 13 Mar 2018 03:01:25 +0000
Ready: True
Restart Count: 0
Environment: <none>
Mounts:
/opt from opt (rw)
/secrets/redis-password from redis-password (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-hkj6d (ro)
Containers:
redis-sentinel:
Container ID: docker://a54048cbb7ec535c841022c543a0d566c9327f37ede3a6232516721f0e37404d
Image: redis:3.2
Image ID: docker-pullable://redis#sha256:474fb41b08bcebc933c6337a7db1dc7131380ee29b7a1b64a7ab71dad03ad718
Port: 26379/TCP
Command:
/opt/bin/k8s-redis-ha-sentinel
Args:
/opt/sentinel.conf
State: Running
Started: Mon, 12 Mar 2018 06:21:02 +0000
Ready: True
Restart Count: 0
Readiness: exec [redis-cli -p 26379 info server] delay=0s timeout=1s period=10s #success=1 #failure=3
Environment:
SERVICE: redis-server
SERVICE_PORT: redis-server
Mounts:
/opt from opt (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-hkj6d (ro)
redis-sword:
Container ID: docker://50279448bbbf175b6f56f96dab59061c4652c2117452ed15b3a5380681c7176f
Image: tarosky/k8s-redis-ha:sword-3.0.1
Image ID: docker-pullable://tarosky/k8s-redis-ha#sha256:2315c7a47d9e47043d030da270c9a1252c2cfe29c6e381c8f50ca41d3065db6d
Port:
State: Running
Started: Mon, 12 Mar 2018 06:21:03 +0000
Ready: True
Restart Count: 0
Environment:
SERVICE: redis-server
SERVICE_PORT: redis-server
SENTINEL: redis-sentinel
SENTINEL_PORT: redis-sentinel
Mounts:
/opt from opt (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-hkj6d (ro)
Conditions:
Type Status
Initialized True
Ready True
PodScheduled True
Volumes:
opt:
Type: HostPath (bare host directory volume)
Path: /store/redis-sentinel/opt
redis-password:
Type: Secret (a volume populated by a Secret)
SecretName: redis-password
Optional: false
default-token-hkj6d:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-hkj6d
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: <none>
Events:
FirstSeen LastSeen Count From SubObjectPath Type Reason Message
--------- -------- ----- ---- ------------- -------- ------ -------
20h 30m 21 kubelet, 10.1.3.102 spec.initContainers{redis-ha-server} Normal Pulling pulling image "tarosky/k8s-redis-ha:sentinel-3.0.1"
21h 30m 22 kubelet, 10.1.3.102 spec.initContainers{redis-ha-server} Normal Started Started container
21h 30m 22 kubelet, 10.1.3.102 spec.initContainers{redis-ha-server} Normal Created Created container
20h 30m 21 kubelet, 10.1.3.102 spec.initContainers{redis-ha-server} Normal Pulled Successfully pulled image "tarosky/k8s-redis-ha:sentinel-3.0.1"
21h 30m 22 kubelet, 10.1.3.102 spec.initContainers{redis-init} Normal Pulling pulling image "gcr.io/(redacted)/redis-init"
21h 30m 22 kubelet, 10.1.3.102 spec.initContainers{redis-init} Normal Pulled Successfully pulled image "gcr.io/(redacted)/redis-init"
21h 30m 22 kubelet, 10.1.3.102 spec.initContainers{redis-init} Normal Created Created container
21h 30m 22 kubelet, 10.1.3.102 spec.initContainers{redis-init} Normal Started Started container
Note the Containers in the pod which started at Mon, 12 Mar 2018 06:21:02 +0000 (with 0 restarts) and the Init Containers which started from Tue, 13 Mar 2018 03:01:12 +0000. These seem to re-run every hour pretty much in an interval of hour.
Our bare metal must be misconfigured for init containers somewhere? Can anyone shed any light on this strange behavior?
If you are pruning away exited containers, then the container pruning/removal is a likely cause. In my testing, it appears that exited init containers which are removed from Docker Engine (hourly, or otherwise), such as with "docker system prune -f" will cause Kubernetes to re-launch the init containers. Is this the issue in your case, if this is still persisting?
Also, see https://kubernetes.io/docs/concepts/cluster-administration/kubelet-garbage-collection/ for Kubelet garbage collection documentation, which appears to support these types of tasks (rather than needing to implement it yourself)