How can I run multiple network interfaces on a k8s node? - kubernetes

Running Openshift 4.1 on K8s v1.13.4. I'm trying to add a second network (for NFS storage) to my compute nodes, and as soon as I do, the node stops reporting NodeReady.
See below logs from kubelet. Completely lost.. How can I add another interface to my nodes?
v1.13.4
FieldPath:""}): type: 'Normal' reason: 'NodeReady' Node compute-0 status is now: NodeReady
Jun 26 05:41:22 compute-0 hyperkube[923]: E0626 05:41:22.367174 923 kubelet_node_status.go:380] Error updating node status, will retry: failed to patch status "{\"status\":{\"$setElementOrder/addresses\":[{\"type\":\"ExternalIP\"},{\"type\":\"InternalIP\"},{\"type\":\"ExternalIP\"},{\"type\":\"InternalIP\"},{\"type\":\"Hostname\"}],\"$setElementOrder/conditions\":[{\"type\":\"MemoryPressure\"},{\"type\":\"Dis
...
Jun 26 05:41:22 compute-0 hyperkube[923]: [map[address:10.90.49.111 type:ExternalIP] map[type:ExternalIP address:10.90.51.94] map[address:10.90.49.111 type:InternalIP] map[address:10.90.51.94 type:InternalIP]]
Jun 26 05:41:22 compute-0 hyperkube[923]: doesn't match $setElementOrder list:

Resolution was to delete node from cluster kubectl delete node compute-0, reboot it, and let ignition rejoin it to the cluster.
This is a known bug

Related

kubelet won't start after kuberntes/manifest update

This is sort of strange behavior in our K8 cluster.
When we try to deploy a new version of our applications we get:
Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "<container-id>" network for pod "application-6647b7cbdb-4tp2v": networkPlugin cni failed to set up pod "application-6647b7cbdb-4tp2v_default" network: Get "https://[10.233.0.1]:443/api/v1/namespaces/default": dial tcp 10.233.0.1:443: connect: connection refused
I used kubectl get cs and found controller and scheduler in Unhealthy state.
As describer here updated /etc/kubernetes/manifests/kube-scheduler.yaml and
/etc/kubernetes/manifests/kube-controller-manager.yaml by commenting --port=0
When I checked systemctl status kubelet it was working.
Active: active (running) since Mon 2020-10-26 13:18:46 +0530; 1 years 0 months ago
I had restarted kubelet service and controller and scheduler were shown healthy.
But systemctl status kubelet shows (soon after restart kubelet it showed running state)
Active: activating (auto-restart) (Result: exit-code) since Thu 2021-11-11 10:50:49 +0530; 3s ago<br>
Docs: https://github.com/GoogleCloudPlatform/kubernetes<br> Process: 21234 ExecStart=/usr/bin/kubelet $KUBELET_KUBECONFIG_ARGS $KUBELET_CONFIG_ARGS $KUBELET_KUBEADM_ARGS $KUBELET
Tried adding Environment="KUBELET_SYSTEM_PODS_ARGS=--pod-manifest-path=/etc/kubernetes/manifests --allow-privileged=true --fail-swap-on=false" to /etc/systemd/system/kubelet.service.d/10-kubeadm.conf as described here, but still its not working properly.
Also removed --port=0 comment in above mentioned manifests and tried restarting,still same result.
Edit: This issue was due to kubelet certificate expired and fixed following these steps. If someone faces this issue, make sure /var/lib/kubelet/pki/kubelet-client-current.pem certificate and key values are base64 encoded when placing on /etc/kubernetes/kubelet.conf
Many other suggested kubeadm init again. But this cluster was created using kubespray no manually added nodes.
We have baremetal k8 running on Ubuntu 18.04.
K8: v1.18.8
We would like to know any debugging and fixing suggestions.
PS:
When we try to telnet 10.233.0.1 443 from any node, first attempt fails and second attempt success.
Edit: Found this in kubelet service logs
Nov 10 17:35:05 node1 kubelet[1951]: W1110 17:35:05.380982 1951 docker_sandbox.go:402] failed to read pod IP from plugin/docker: networkPlugin cni failed on the status hook for pod "app-7b54557dd4-bzjd9_default": unexpected command output nsenter: cannot open /proc/12311/ns/net: No such file or directory
Posting comment as the community wiki answer for better visibility
This issue was due to kubelet certificate expired and fixed following these steps. If someone faces this issue, make sure /var/lib/kubelet/pki/kubelet-client-current.pem certificate and key values are base64 encoded when placing on /etc/kubernetes/kubelet.conf

Cassandra pod fails after kubernetes node restart

I have successfully installed dse in my kubernetes environment using the Kubernetes Operator instructions:
With nodetool I checked that all pod successfully joined the ring
The problem is that when I reboot one of the kubernetes node the cassandra pod that was running on that node never recover:
[root#node1 ~]# kubectl exec -it -n cassandra cluster1-dc1-r2-sts-0 -c cassandra nodetool status
Datacenter: dc1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving/Stopped
-- Address Load Tokens Owns (effective) Host ID Rack
UN 10.244.166.132 153.82 KiB 1 77.9% 053cc18e-397c-4abe-bb1b-d48a3fef3c93 r3
DS 10.244.104.1 136.09 KiB 1 26.9% 8ae31e1c-856e-44a8-b081-c5c040b535b9 r1
UN 10.244.135.2 202.8 KiB 1 95.2% 06200794-298c-4122-b8ff-4239bc7a8ded r2
[root#node1 ~]# kubectl get pods -n cassandra
NAME READY STATUS RESTARTS AGE
cass-operator-56f5f8c7c-w6l2c 1/1 Running 0 17h
cluster1-dc1-r1-sts-0 1/2 Running 2 17h
cluster1-dc1-r2-sts-0 2/2 Running 0 17h
cluster1-dc1-r3-sts-0 2/2 Running 0 17h
I have looked into the logs but I can't figure out what is the problem.
The "kubectl logs"" command return the logs below:
INFO [nioEventLoopGroup-2-1] 2020-03-25 12:13:13,536 Cli.java:555 - address=/192.168.0.11:38590 url=/api/v0/probes/liveness status=200 OK
INFO [epollEventLoopGroup-6506-1] 2020-03-25 12:13:14,110 Clock.java:35 - Could not access native clock (see debug logs for details), falling back to Java system clock
WARN [epollEventLoopGroup-6506-2] 2020-03-25 12:13:14,111 Slf4JLogger.java:146 - Unknown channel option 'TCP_NODELAY' for channel '[id: 0x8a898bf3]'
WARN [epollEventLoopGroup-6506-2] 2020-03-25 12:13:14,116 Loggers.java:28 - [s6501] Error connecting to /tmp/dse.sock, trying next node
java.io.FileNotFoundException: null
at io.netty.channel.unix.Errors.throwConnectException(Errors.java:110)
at io.netty.channel.unix.Socket.connect(Socket.java:257)
at io.netty.channel.epoll.AbstractEpollChannel.doConnect0(AbstractEpollChannel.java:732)
at io.netty.channel.epoll.AbstractEpollChannel.doConnect(AbstractEpollChannel.java:717)
at io.netty.channel.epoll.EpollDomainSocketChannel.doConnect(EpollDomainSocketChannel.java:87)
at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.connect(AbstractEpollChannel.java:559)
at io.netty.channel.DefaultChannelPipeline$HeadContext.connect(DefaultChannelPipeline.java:1366)
at io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:545)
at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:530)
at io.netty.channel.ChannelOutboundHandlerAdapter.connect(ChannelOutboundHandlerAdapter.java:47)
at io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:545)
at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:530)
at io.netty.channel.ChannelDuplexHandler.connect(ChannelDuplexHandler.java:50)
at io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:545)
at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:530)
at io.netty.channel.ChannelDuplexHandler.connect(ChannelDuplexHandler.java:50)
at com.datastax.oss.driver.internal.core.channel.ConnectInitHandler.connect(ConnectInitHandler.java:57)
at io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:545)
at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:530)
at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:512)
at io.netty.channel.DefaultChannelPipeline.connect(DefaultChannelPipeline.java:1024)
at io.netty.channel.AbstractChannel.connect(AbstractChannel.java:276)
at io.netty.bootstrap.Bootstrap$3.run(Bootstrap.java:252)
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:404)
at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:375)
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:884)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run(Thread.java:748)
INFO [nioEventLoopGroup-2-2] 2020-03-25 12:13:14,118 Cli.java:555 - address=/192.168.0.11:38592 url=/api/v0/probes/readiness status=500 Internal Server Error
The error java.io.FileNotFoundException: null appears also when cassandra starts successfully.
So what remains is the error:
address=/192.168.0.11:38592 url=/api/v0/probes/readiness status=500 Internal Server Error
Which doesn't say much to me.
The "kubectl describe" shows the following
Warning Unhealthy 4m41s (x6535 over 18h) kubelet, node2 Readiness probe failed: HTTP probe failed with statuscode: 500
In the cassandra container only this process is running:
java -Xms128m -Xmx128m -jar /opt/dse/resources/management-api/management-api-6.8.0.20200316-LABS-all.jar --dse-socket /tmp/dse.sock --host tcp://0.0.0.0```
And in the /var/log/cassandra/system.log I can't point out any error
Andrea, the error "java.io.FileNotFoundException: null" is a harmless message about a transient error during the Cassandra pod starting up and healthcheck.
I was able to reproduce the issue you ran into. If you run kubectl get pods you should see the affected pod showing 1/2 under "READY" column, this means the Cassandra container was not brought up in the auto-restarted pod. Only the management API container is running. I suspect this is a bug in the operator and I'll work with the developers to sort it out.
As a workaround you can run kubectl delete pod/<pod_name> to recover your Cassandra cluster back to a normal state (in your case kubectl delete pod/cluster1-dc1-r1-sts-0). This will redeploy the pod and remount the data volume automatically, without losing anything.
I got this error when CoreDNS pods were not running on the node, on which I had started Cassandra. The DNS resolutions were not working properly. So, debugging network connectivity may help.

kubelet saying node "master01" not found

I try to stack up my kubeadm cluster with three masters. I receive this problem from my init command...
[kubelet-check] Initial timeout of 40s passed.
Unfortunately, an error has occurred:
timed out waiting for the condition
This error is likely caused by:
- The kubelet is not running
- The kubelet is unhealthy due to a misconfiguration of the node in some way (required cgroups disabled)
If you are on a systemd-powered system, you can try to troubleshoot the error with the following commands:
- 'systemctl status kubelet'
- 'journalctl -xeu kubelet'
Additionally, a control plane component may have crashed or exited when started by the container runtime.
To troubleshoot, list all containers using your preferred container runtimes CLI, e.g. docker.
Here is one example how you may list all Kubernetes containers running in docker:
- 'docker ps -a | grep kube | grep -v pause'
Once you have found the failing container, you can inspect its logs with:
- 'docker logs CONTAINERID'
error execution phase wait-control-plane: couldn't initialize a Kubernetes cluster
But I do not use no cgroupfs but systemd
And my kubelet complain for not knowing his nodename.
Jan 23 14:54:12 master01 kubelet[5620]: E0123 14:54:12.251885 5620 kubelet.go:2266] node "master01" not found
Jan 23 14:54:12 master01 kubelet[5620]: E0123 14:54:12.352932 5620 kubelet.go:2266] node "master01" not found
Jan 23 14:54:12 master01 kubelet[5620]: E0123 14:54:12.453895 5620 kubelet.go:2266] node "master01" not found
Please let me know where is the issue.
The issue can be because of docker version, as docker version < 18.6 is supported in latest kubernetes version i.e. v1.13.xx.
Actually I also got the same issue but it get resolved after downgrading the docker version from 18.9 to 18.6.
If the problem is not related to Docker it might be because the Kubelet service failed to establish connection to API server.
I would first of all check the status of Kubelet: systemctl status kubelet and consider restarting with systemctl restart kubelet.
If this doesn't help try re-installing kubeadm or running kubeadm init with other version (use the --kubernetes-version=X.Y.Z flag).
In my case,my k8s version is 1.21.1 and my docker version is 19.03. I solved this bug by upgrading docker to version 20.7.

How do I fix Kubernetes NodeUnderDiskPressure errors?

After creating a simple hello world deployment, my pod status shows as "PENDING". When I run kubectl describe pod on the pod, I get the following:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 14s (x6 over 29s) default-scheduler 0/1 nodes are available: 1 NodeUnderDiskPressure.
If I check on my node health, I get:
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
OutOfDisk False Fri, 27 Jul 2018 15:17:27 -0700 Fri, 27 Jul 2018 14:13:33 -0700 KubeletHasSufficientDisk kubelet has sufficient disk space available
MemoryPressure False Fri, 27 Jul 2018 15:17:27 -0700 Fri, 27 Jul 2018 14:13:33 -0700 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure True Fri, 27 Jul 2018 15:17:27 -0700 Fri, 27 Jul 2018 14:13:43 -0700 KubeletHasDiskPressure kubelet has disk pressure
Ready True Fri, 27 Jul 2018 15:17:27 -0700 Fri, 27 Jul 2018 14:13:43 -0700 KubeletReady kubelet is posting ready status. AppArmor enabled
So it seems the issue is that "kubelet has disk pressure" but I can't really figure out what that means. I can't SSH into minikube and check on its disk space because I'm using VMWare Workstation with --vm-driver=none.
This is an old question but I just saw it and because it doesn't have an answer yet, I will write my answer.
I was facing this problem and my pods were getting evicted many times because of disk pressure and different commands such as df or du were not helpful.
With the help of the answer that I wrote here, I found out that the main problem is the log files of the pods and because K8s is not supporting log rotation they can grow to hundreds of Gigs.
There are different log rotation methods available but I currently I am searching for the best practice for K8s so I can't suggest any specific one, yet.
I hope this can be helpful.
Personally I couldn't solve the problem using kube commands because ...
It was said to be due to an antivirus (McAfee).
Reinstalling the company-endorsed docker-desktop version solved the problem.
Had the similar issue.
My_error_log :
Warning FailedScheduling 3m23s default-scheduler 0/3 nodes are available: 1 node(s) didn't match Pod's node affinity/selector, 1 node(s) had taint {node-role.kubernetes.io/controlplane: true}, that the pod didn't tolerate, 1 node(s) had taint {node.kubernetes.io/disk-pressure: }, that the pod didn't tolerate
For me, the / partition was filled to 82%. Cleaning up some unwanted folders resolved the issue.
Command used:-
ssh uname#IP_or_hostname (login to the worker node )
df -h (to check the disk usage)
rm -rf folder_name (delete the unwanted folder,you are forcefully deleting the file, so make sure you really want to delete it).
I hope this can save someone's time.
Community hinted you the comments above. Will try to consolidate it.
The kubelet maps one or more eviction signals to a corresponding node condition.
If a hard eviction threshold has been met, or a soft eviction
threshold has been met independent of its associated grace period, the
kubelet reports a condition that reflects the node is under
pressure.
DiskPressure
Available disk space and inodes on either the node’s root filesystem
or image filesystem has satisfied an eviction threshold
So the problem might be not enough disk space or filesystem has run out of inodes. You have to learn about the conditions of your environment and then apply them in your kubelet configuration.
You do not need to ssh into the minikube since you are running it inside of your host:
--vm-driver=none -
option that runs the Kubernetes components on the host and not in a
VM. Docker is required to use this driver but no hypervisor. If you
use --vm-driver=none, be sure to specify a bridge
network for docker. Otherwise it might change between network restarts,
causing loss of connectivity to your cluster.
You might try to check if there are some issues related to the mentioned topics:
kubectl describe nodes
Look at df reports:
df -i
df -h
Some further reading so you can grasp the topic:
Configure Out Of Resource Handling - section Node Conditions.

InfluxDB container dies over time, and can't restart

I'm running a kube-upped Kubernetes 1.2.3 cluster on AWS, on two m4.large nodes, and I'm using the auto-installed influx-grafana pod for cluster monitoring.
My problem is that after a week or two, the influx-container dies and will not come up again. I'm a bit unsure what logs to check for relevant error messages, but the syslog on the minion running the container contained the following information:
Jun 16 05:57:41 ip-172-22-29-244 kubelet[4434]: E0616 05:57:41.382751 4434 event.go:193] Server rejected event '&api.Event{TypeMeta:unversioned.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:api.ObjectMeta{Name:"monitoring-influxdb-grafana-v3-dlx9o.145604121bcf8ade", GenerateName:"", Namespace:"kube-system", SelfLink:"", UID:"", ResourceVersion:"407635", Generation:0, CreationTimestamp:unversioned.Time{Time:time.Time{sec:0, nsec:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*unversioned.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil)}, InvolvedObject:api.ObjectReference{Kind:"Pod", Namespace:"kube-system", Name:"monitoring-influxdb-grafana-v3-dlx9o", UID:"07c2a623-2b57-11e6-b7a9-068c6a09a769", APIVersion:"v1", ResourceVersion:"850776", FieldPath:""}, Reason:"FailedSync", Message:"Error syncing pod, skipping: failed to \"StartContainer\" for \"influxdb\" with CrashLoopBackOff: \"Back-off 5m0s restarting failed container=influxdb pod=monitoring-influxdb-grafana-v3-dlx9o_kube-system(07c2a623-2b57-11e6-b7a9-068c6a09a769)\"\n", Source:api.EventSource{Component:"kubelet", Host:"ip-172-22-29-244.eu-west-1.compute.internal"}, FirstTimestamp:unversioned.Time{Time:time.Time{sec:63600960004, nsec:0, loc:(*time.Location)(0x2e38da0)}}, LastTimestamp:unversioned.Time{Time:time.Time{sec:63601653461, nsec:379098581, loc:(*time.Location)(0x2e38da0)}}, Count:11023, Type:"Warning"}': 'events "monitoring-influxdb-grafana-v3-dlx9o.145604121bcf8ade" not found' (will not retry!)
Jun 16 05:57:54 ip-172-22-29-244 kubelet[4434]: I0616 05:57:54.378491 4434 manager.go:2050] Back-off 5m0s restarting failed container=influxdb pod=monitoring-influxdb-grafana-v3-dlx9o_kube-system(07c2a623-2b57-11e6-b7a9-068c6a09a769)
Jun 16 05:57:54 ip-172-22-29-244 kubelet[4434]: E0616 05:57:54.378545 4434 pod_workers.go:138] Error syncing pod 07c2a623-2b57-11e6-b7a9-068c6a09a769, skipping: failed to "StartContainer" for "influxdb" with CrashLoopBackOff: "Back-off 5m0s restarting failed container=influxdb pod=monitoring-influxdb-grafana-v3-dlx9o_kube-system(07c2a623-2b57-11e6-b7a9-068c6a09a769)"
I've also seen indications that the container was originally OOM-killed.
My assumption is that the influx-index just grows too large over time since there is no automatic cleanup, is killed by Kubernetes once the 500MB memory limit from the manifest is breached, and fail to restart for the same reasons or because it times out while reading the index.
Once this happens, the only way I've been able to get it up and running again is to kill the pod entirely to have Kubernetes re-create it from scratch, which basically means losing all existing data.
But what do I do about it? Changing the memory limits on kube-system pods seems to be non-trivial, and may only buy me a few more days anyways.
I could set up my own watchdog to clean up data, but only being able to keep 1-2 weeks of monitoring data kind of limits its value.