InfluxDB container dies over time, and can't restart - kubernetes

I'm running a kube-upped Kubernetes 1.2.3 cluster on AWS, on two m4.large nodes, and I'm using the auto-installed influx-grafana pod for cluster monitoring.
My problem is that after a week or two, the influx-container dies and will not come up again. I'm a bit unsure what logs to check for relevant error messages, but the syslog on the minion running the container contained the following information:
Jun 16 05:57:41 ip-172-22-29-244 kubelet[4434]: E0616 05:57:41.382751 4434 event.go:193] Server rejected event '&api.Event{TypeMeta:unversioned.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:api.ObjectMeta{Name:"monitoring-influxdb-grafana-v3-dlx9o.145604121bcf8ade", GenerateName:"", Namespace:"kube-system", SelfLink:"", UID:"", ResourceVersion:"407635", Generation:0, CreationTimestamp:unversioned.Time{Time:time.Time{sec:0, nsec:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*unversioned.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil)}, InvolvedObject:api.ObjectReference{Kind:"Pod", Namespace:"kube-system", Name:"monitoring-influxdb-grafana-v3-dlx9o", UID:"07c2a623-2b57-11e6-b7a9-068c6a09a769", APIVersion:"v1", ResourceVersion:"850776", FieldPath:""}, Reason:"FailedSync", Message:"Error syncing pod, skipping: failed to \"StartContainer\" for \"influxdb\" with CrashLoopBackOff: \"Back-off 5m0s restarting failed container=influxdb pod=monitoring-influxdb-grafana-v3-dlx9o_kube-system(07c2a623-2b57-11e6-b7a9-068c6a09a769)\"\n", Source:api.EventSource{Component:"kubelet", Host:""}, FirstTimestamp:unversioned.Time{Time:time.Time{sec:63600960004, nsec:0, loc:(*time.Location)(0x2e38da0)}}, LastTimestamp:unversioned.Time{Time:time.Time{sec:63601653461, nsec:379098581, loc:(*time.Location)(0x2e38da0)}}, Count:11023, Type:"Warning"}': 'events "monitoring-influxdb-grafana-v3-dlx9o.145604121bcf8ade" not found' (will not retry!)
Jun 16 05:57:54 ip-172-22-29-244 kubelet[4434]: I0616 05:57:54.378491 4434 manager.go:2050] Back-off 5m0s restarting failed container=influxdb pod=monitoring-influxdb-grafana-v3-dlx9o_kube-system(07c2a623-2b57-11e6-b7a9-068c6a09a769)
Jun 16 05:57:54 ip-172-22-29-244 kubelet[4434]: E0616 05:57:54.378545 4434 pod_workers.go:138] Error syncing pod 07c2a623-2b57-11e6-b7a9-068c6a09a769, skipping: failed to "StartContainer" for "influxdb" with CrashLoopBackOff: "Back-off 5m0s restarting failed container=influxdb pod=monitoring-influxdb-grafana-v3-dlx9o_kube-system(07c2a623-2b57-11e6-b7a9-068c6a09a769)"
I've also seen indications that the container was originally OOM-killed.
My assumption is that the influx-index just grows too large over time since there is no automatic cleanup, is killed by Kubernetes once the 500MB memory limit from the manifest is breached, and fail to restart for the same reasons or because it times out while reading the index.
Once this happens, the only way I've been able to get it up and running again is to kill the pod entirely to have Kubernetes re-create it from scratch, which basically means losing all existing data.
But what do I do about it? Changing the memory limits on kube-system pods seems to be non-trivial, and may only buy me a few more days anyways.
I could set up my own watchdog to clean up data, but only being able to keep 1-2 weeks of monitoring data kind of limits its value.


gitlab job pod exits unexpectedly

I currently have gitlab runner deployed in my kubernetes cluster with 2 replicas.
When I run a job in gitlab, the runners are successful in spawning pods that run the pipeline. But in some cases, after the pipeline runs the job, I suddenly get the error
Running after_script
Uploading artifacts for failed job
Cleaning up project directory and file based variables
ERROR: Job failed (system failure): pods "runner-hzfiusrx-project-37057717-concurrent-21gs8vm" not found
When I have a look at the runner logs, all I see is
gitlab-runners-exchange-587cdbf898-pkgt2 | grep "runner-hzfiusrx-project-37057717-concurrent-21gs8vm"
WARNING: Error streaming logs exchange/runner-hzfiusrx-project-37057717-concurrent-21gs8vm/helper:/logs-37057717-2986450184/output.log: command terminated with exit code 137. Retrying... job=2986450184 project=37057717 runner=hzFiusRx
WARNING: Error streaming logs exchange/runner-hzfiusrx-project-37057717-concurrent-21gs8vm/helper:/logs-37057717-2986450184/output.log: pods "runner-hzfiusrx-project-37057717-concurrent-21gs8vm" not found. Retrying... job=2986450184 project=37057717 runner=hzFiusRx
WARNING: Error while executing file based variables removal script error=couldn't get pod details: pods "runner-hzfiusrx-project-37057717-concurrent-21gs8vm" not found job=2986450184 project=37057717 runner=hzFiusRx
ERROR: Job failed (system failure): pods "runner-hzfiusrx-project-37057717-concurrent-21gs8vm" not found duration_s=2067.525269137 job=2986450184 project=37057717 runner=hzFiusRx
WARNING: Failed to process runner builds=32 error=pods "runner-hzfiusrx-project-37057717-concurrent-21gs8vm" not found executor=kubernetes runner=hzFiusRx
Im trying to understand the issue here.
My kubernetes runner config is
host = ""
bearer_token_overwrite_allowed = true
image = "ubuntu:20.04"
namespace = "exchange"
namespace_overwrite_allowed = ""
privileged = true
cpu_request = "5"
memory_request = "25Gi"
The nodes on which the job pods get scheduled have the following capacity
attachable-volumes-aws-ebs: 25
cpu: 8
ephemeral-storage: 20959212Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 32523380Ki
pods: 58
So what exactly might be the issue here ? The cpu and memory dimensioning for the nodes seem correct.
looking at the utilization, everything seems good too
So what might be the issue here ? Is it that kubernetes/gitlab is not gracefully killing the job pod ? Or does it need more memory ?

kubelet won't start after kuberntes/manifest update

This is sort of strange behavior in our K8 cluster.
When we try to deploy a new version of our applications we get:
Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "<container-id>" network for pod "application-6647b7cbdb-4tp2v": networkPlugin cni failed to set up pod "application-6647b7cbdb-4tp2v_default" network: Get "https://[]:443/api/v1/namespaces/default": dial tcp connect: connection refused
I used kubectl get cs and found controller and scheduler in Unhealthy state.
As describer here updated /etc/kubernetes/manifests/kube-scheduler.yaml and
/etc/kubernetes/manifests/kube-controller-manager.yaml by commenting --port=0
When I checked systemctl status kubelet it was working.
Active: active (running) since Mon 2020-10-26 13:18:46 +0530; 1 years 0 months ago
I had restarted kubelet service and controller and scheduler were shown healthy.
But systemctl status kubelet shows (soon after restart kubelet it showed running state)
Active: activating (auto-restart) (Result: exit-code) since Thu 2021-11-11 10:50:49 +0530; 3s ago<br>
Tried adding Environment="KUBELET_SYSTEM_PODS_ARGS=--pod-manifest-path=/etc/kubernetes/manifests --allow-privileged=true --fail-swap-on=false" to /etc/systemd/system/kubelet.service.d/10-kubeadm.conf as described here, but still its not working properly.
Also removed --port=0 comment in above mentioned manifests and tried restarting,still same result.
Edit: This issue was due to kubelet certificate expired and fixed following these steps. If someone faces this issue, make sure /var/lib/kubelet/pki/kubelet-client-current.pem certificate and key values are base64 encoded when placing on /etc/kubernetes/kubelet.conf
Many other suggested kubeadm init again. But this cluster was created using kubespray no manually added nodes.
We have baremetal k8 running on Ubuntu 18.04.
K8: v1.18.8
We would like to know any debugging and fixing suggestions.
When we try to telnet 443 from any node, first attempt fails and second attempt success.
Edit: Found this in kubelet service logs
Nov 10 17:35:05 node1 kubelet[1951]: W1110 17:35:05.380982 1951 docker_sandbox.go:402] failed to read pod IP from plugin/docker: networkPlugin cni failed on the status hook for pod "app-7b54557dd4-bzjd9_default": unexpected command output nsenter: cannot open /proc/12311/ns/net: No such file or directory
Posting comment as the community wiki answer for better visibility
This issue was due to kubelet certificate expired and fixed following these steps. If someone faces this issue, make sure /var/lib/kubelet/pki/kubelet-client-current.pem certificate and key values are base64 encoded when placing on /etc/kubernetes/kubelet.conf

Cassandra pod fails after kubernetes node restart

I have successfully installed dse in my kubernetes environment using the Kubernetes Operator instructions:
With nodetool I checked that all pod successfully joined the ring
The problem is that when I reboot one of the kubernetes node the cassandra pod that was running on that node never recover:
[root#node1 ~]# kubectl exec -it -n cassandra cluster1-dc1-r2-sts-0 -c cassandra nodetool status
Datacenter: dc1
|/ State=Normal/Leaving/Joining/Moving/Stopped
-- Address Load Tokens Owns (effective) Host ID Rack
UN 153.82 KiB 1 77.9% 053cc18e-397c-4abe-bb1b-d48a3fef3c93 r3
DS 136.09 KiB 1 26.9% 8ae31e1c-856e-44a8-b081-c5c040b535b9 r1
UN 202.8 KiB 1 95.2% 06200794-298c-4122-b8ff-4239bc7a8ded r2
[root#node1 ~]# kubectl get pods -n cassandra
cass-operator-56f5f8c7c-w6l2c 1/1 Running 0 17h
cluster1-dc1-r1-sts-0 1/2 Running 2 17h
cluster1-dc1-r2-sts-0 2/2 Running 0 17h
cluster1-dc1-r3-sts-0 2/2 Running 0 17h
I have looked into the logs but I can't figure out what is the problem.
The "kubectl logs"" command return the logs below:
INFO [nioEventLoopGroup-2-1] 2020-03-25 12:13:13,536 - address=/ url=/api/v0/probes/liveness status=200 OK
INFO [epollEventLoopGroup-6506-1] 2020-03-25 12:13:14,110 - Could not access native clock (see debug logs for details), falling back to Java system clock
WARN [epollEventLoopGroup-6506-2] 2020-03-25 12:13:14,111 - Unknown channel option 'TCP_NODELAY' for channel '[id: 0x8a898bf3]'
WARN [epollEventLoopGroup-6506-2] 2020-03-25 12:13:14,116 - [s6501] Error connecting to /tmp/dse.sock, trying next node null
at io.netty.bootstrap.Bootstrap$
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(
at io.netty.util.concurrent.SingleThreadEventExecutor$
INFO [nioEventLoopGroup-2-2] 2020-03-25 12:13:14,118 - address=/ url=/api/v0/probes/readiness status=500 Internal Server Error
The error null appears also when cassandra starts successfully.
So what remains is the error:
address=/ url=/api/v0/probes/readiness status=500 Internal Server Error
Which doesn't say much to me.
The "kubectl describe" shows the following
Warning Unhealthy 4m41s (x6535 over 18h) kubelet, node2 Readiness probe failed: HTTP probe failed with statuscode: 500
In the cassandra container only this process is running:
java -Xms128m -Xmx128m -jar /opt/dse/resources/management-api/management-api- --dse-socket /tmp/dse.sock --host tcp://```
And in the /var/log/cassandra/system.log I can't point out any error
Andrea, the error " null" is a harmless message about a transient error during the Cassandra pod starting up and healthcheck.
I was able to reproduce the issue you ran into. If you run kubectl get pods you should see the affected pod showing 1/2 under "READY" column, this means the Cassandra container was not brought up in the auto-restarted pod. Only the management API container is running. I suspect this is a bug in the operator and I'll work with the developers to sort it out.
As a workaround you can run kubectl delete pod/<pod_name> to recover your Cassandra cluster back to a normal state (in your case kubectl delete pod/cluster1-dc1-r1-sts-0). This will redeploy the pod and remount the data volume automatically, without losing anything.
I got this error when CoreDNS pods were not running on the node, on which I had started Cassandra. The DNS resolutions were not working properly. So, debugging network connectivity may help.

How can I run multiple network interfaces on a k8s node?

Running Openshift 4.1 on K8s v1.13.4. I'm trying to add a second network (for NFS storage) to my compute nodes, and as soon as I do, the node stops reporting NodeReady.
See below logs from kubelet. Completely lost.. How can I add another interface to my nodes?
FieldPath:""}): type: 'Normal' reason: 'NodeReady' Node compute-0 status is now: NodeReady
Jun 26 05:41:22 compute-0 hyperkube[923]: E0626 05:41:22.367174 923 kubelet_node_status.go:380] Error updating node status, will retry: failed to patch status "{\"status\":{\"$setElementOrder/addresses\":[{\"type\":\"ExternalIP\"},{\"type\":\"InternalIP\"},{\"type\":\"ExternalIP\"},{\"type\":\"InternalIP\"},{\"type\":\"Hostname\"}],\"$setElementOrder/conditions\":[{\"type\":\"MemoryPressure\"},{\"type\":\"Dis
Jun 26 05:41:22 compute-0 hyperkube[923]: [map[address: type:ExternalIP] map[type:ExternalIP address:] map[address: type:InternalIP] map[address: type:InternalIP]]
Jun 26 05:41:22 compute-0 hyperkube[923]: doesn't match $setElementOrder list:
Resolution was to delete node from cluster kubectl delete node compute-0, reboot it, and let ignition rejoin it to the cluster.
This is a known bug

kubelet saying node "master01" not found

I try to stack up my kubeadm cluster with three masters. I receive this problem from my init command...
[kubelet-check] Initial timeout of 40s passed.
Unfortunately, an error has occurred:
timed out waiting for the condition
This error is likely caused by:
- The kubelet is not running
- The kubelet is unhealthy due to a misconfiguration of the node in some way (required cgroups disabled)
If you are on a systemd-powered system, you can try to troubleshoot the error with the following commands:
- 'systemctl status kubelet'
- 'journalctl -xeu kubelet'
Additionally, a control plane component may have crashed or exited when started by the container runtime.
To troubleshoot, list all containers using your preferred container runtimes CLI, e.g. docker.
Here is one example how you may list all Kubernetes containers running in docker:
- 'docker ps -a | grep kube | grep -v pause'
Once you have found the failing container, you can inspect its logs with:
- 'docker logs CONTAINERID'
error execution phase wait-control-plane: couldn't initialize a Kubernetes cluster
But I do not use no cgroupfs but systemd
And my kubelet complain for not knowing his nodename.
Jan 23 14:54:12 master01 kubelet[5620]: E0123 14:54:12.251885 5620 kubelet.go:2266] node "master01" not found
Jan 23 14:54:12 master01 kubelet[5620]: E0123 14:54:12.352932 5620 kubelet.go:2266] node "master01" not found
Jan 23 14:54:12 master01 kubelet[5620]: E0123 14:54:12.453895 5620 kubelet.go:2266] node "master01" not found
Please let me know where is the issue.
The issue can be because of docker version, as docker version < 18.6 is supported in latest kubernetes version i.e. v1.13.xx.
Actually I also got the same issue but it get resolved after downgrading the docker version from 18.9 to 18.6.
If the problem is not related to Docker it might be because the Kubelet service failed to establish connection to API server.
I would first of all check the status of Kubelet: systemctl status kubelet and consider restarting with systemctl restart kubelet.
If this doesn't help try re-installing kubeadm or running kubeadm init with other version (use the --kubernetes-version=X.Y.Z flag).
In my case,my k8s version is 1.21.1 and my docker version is 19.03. I solved this bug by upgrading docker to version 20.7.