What is the default behaviour on GCP Kubernetes for Spring Boot applications when
System.exit(1)
is called? Will kubernetes recreate the container?
If no - how can i force recreation if my application crashes?
-Alex
On exit code for application kubernetes will restart the container inside the POD and pod state will change to in Ready column.
if you want to terminate pod gracefully you can have a look at this : https://dzone.com/articles/gracefully-shutting-down-java-in-containers
Terminated: Indicates that the container completed its execution and has stopped running. A container enters into this when it has successfully completed execution or when it has failed for some reason. Regardless, a reason and exit code is displayed, as well as the container’s start and finish time. Before a container enters into Terminated, preStop hook (if any) is executed.
...
State: Terminated
Reason: Completed
Exit Code: 0
Started: Wed, 30 Jan 2019 11:45:26 +0530
Finished: Wed, 30 Jan 2019 11:45:26 +0530
...
Please have a look at kubernetes official doc : https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/
Related
My apps deployed as debians and started using systemd service.The app is getting crashed randomly. I am unable to find the reason for the crash.
I have 4 applications running[built using java, scala], out of which two are getting killed(named as op and common). All are started using systemd services.
Error on syslog is
Jul 22 11:45:44 misqa mosquitto[2930]: Socket error on client
005056b76983-Common, disconnecting
Jul 22 11:45:44 misqa systemd[1]: commonmod.service: Main process
exited, code=exited, status=143/n/a
Jul 22 11:45:44 misqa systemd[1]: commonmod.service: Unit entered
failed state
Jul 22 11:45:44 misqa systemd[1]: commonmod.service: Failed with
result 'exit-code'
Jul 22 11:45:44 misqa systemd[1]: opmod.service: Main process exited,
code=exited, status=143/n/a
Jul 22 11:45:44 misqa systemd[1]: opmod.service: Unit entered failed
state
Jul 22 11:45:44 misqa systemd[1]: opmod.service: Failed with result
'exit-code'
But I am not getting any error on my application log file for both op and common
When I read more, I understood that the reason for crash is due to SIGTERM command, but unable to find out what is causing it. In any of these applications, I dont have exec commands for killall.
Is there anyway to identify which process is killing my applications.
My systemd service is like this:
[Unit]
Description=common Module
After=common-api
Requires=common-api
[Service]
TimeoutStartSec=0
ExecStart=/usr/bin/common-api
[Install]
WantedBy=multi-user.target
Basically Java programs sometimes don't send back the expected exit status when shutting down in response to SIGTERM.
You should be able to suppress this by adding the exit code into the systemd service file as a "success" exit status:
[Service]
SuccessExitStatus=143
This solution was sucessful applied here (serverfault) and here (stasckoverflow) both with java apps.
Running Openshift 4.1 on K8s v1.13.4. I'm trying to add a second network (for NFS storage) to my compute nodes, and as soon as I do, the node stops reporting NodeReady.
See below logs from kubelet. Completely lost.. How can I add another interface to my nodes?
v1.13.4
FieldPath:""}): type: 'Normal' reason: 'NodeReady' Node compute-0 status is now: NodeReady
Jun 26 05:41:22 compute-0 hyperkube[923]: E0626 05:41:22.367174 923 kubelet_node_status.go:380] Error updating node status, will retry: failed to patch status "{\"status\":{\"$setElementOrder/addresses\":[{\"type\":\"ExternalIP\"},{\"type\":\"InternalIP\"},{\"type\":\"ExternalIP\"},{\"type\":\"InternalIP\"},{\"type\":\"Hostname\"}],\"$setElementOrder/conditions\":[{\"type\":\"MemoryPressure\"},{\"type\":\"Dis
...
Jun 26 05:41:22 compute-0 hyperkube[923]: [map[address:10.90.49.111 type:ExternalIP] map[type:ExternalIP address:10.90.51.94] map[address:10.90.49.111 type:InternalIP] map[address:10.90.51.94 type:InternalIP]]
Jun 26 05:41:22 compute-0 hyperkube[923]: doesn't match $setElementOrder list:
Resolution was to delete node from cluster kubectl delete node compute-0, reboot it, and let ignition rejoin it to the cluster.
This is a known bug
I try to stack up my kubeadm cluster with three masters. I receive this problem from my init command...
[kubelet-check] Initial timeout of 40s passed.
Unfortunately, an error has occurred:
timed out waiting for the condition
This error is likely caused by:
- The kubelet is not running
- The kubelet is unhealthy due to a misconfiguration of the node in some way (required cgroups disabled)
If you are on a systemd-powered system, you can try to troubleshoot the error with the following commands:
- 'systemctl status kubelet'
- 'journalctl -xeu kubelet'
Additionally, a control plane component may have crashed or exited when started by the container runtime.
To troubleshoot, list all containers using your preferred container runtimes CLI, e.g. docker.
Here is one example how you may list all Kubernetes containers running in docker:
- 'docker ps -a | grep kube | grep -v pause'
Once you have found the failing container, you can inspect its logs with:
- 'docker logs CONTAINERID'
error execution phase wait-control-plane: couldn't initialize a Kubernetes cluster
But I do not use no cgroupfs but systemd
And my kubelet complain for not knowing his nodename.
Jan 23 14:54:12 master01 kubelet[5620]: E0123 14:54:12.251885 5620 kubelet.go:2266] node "master01" not found
Jan 23 14:54:12 master01 kubelet[5620]: E0123 14:54:12.352932 5620 kubelet.go:2266] node "master01" not found
Jan 23 14:54:12 master01 kubelet[5620]: E0123 14:54:12.453895 5620 kubelet.go:2266] node "master01" not found
Please let me know where is the issue.
The issue can be because of docker version, as docker version < 18.6 is supported in latest kubernetes version i.e. v1.13.xx.
Actually I also got the same issue but it get resolved after downgrading the docker version from 18.9 to 18.6.
If the problem is not related to Docker it might be because the Kubelet service failed to establish connection to API server.
I would first of all check the status of Kubelet: systemctl status kubelet and consider restarting with systemctl restart kubelet.
If this doesn't help try re-installing kubeadm or running kubeadm init with other version (use the --kubernetes-version=X.Y.Z flag).
In my case,my k8s version is 1.21.1 and my docker version is 19.03. I solved this bug by upgrading docker to version 20.7.
After creating a simple hello world deployment, my pod status shows as "PENDING". When I run kubectl describe pod on the pod, I get the following:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 14s (x6 over 29s) default-scheduler 0/1 nodes are available: 1 NodeUnderDiskPressure.
If I check on my node health, I get:
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
OutOfDisk False Fri, 27 Jul 2018 15:17:27 -0700 Fri, 27 Jul 2018 14:13:33 -0700 KubeletHasSufficientDisk kubelet has sufficient disk space available
MemoryPressure False Fri, 27 Jul 2018 15:17:27 -0700 Fri, 27 Jul 2018 14:13:33 -0700 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure True Fri, 27 Jul 2018 15:17:27 -0700 Fri, 27 Jul 2018 14:13:43 -0700 KubeletHasDiskPressure kubelet has disk pressure
Ready True Fri, 27 Jul 2018 15:17:27 -0700 Fri, 27 Jul 2018 14:13:43 -0700 KubeletReady kubelet is posting ready status. AppArmor enabled
So it seems the issue is that "kubelet has disk pressure" but I can't really figure out what that means. I can't SSH into minikube and check on its disk space because I'm using VMWare Workstation with --vm-driver=none.
This is an old question but I just saw it and because it doesn't have an answer yet, I will write my answer.
I was facing this problem and my pods were getting evicted many times because of disk pressure and different commands such as df or du were not helpful.
With the help of the answer that I wrote here, I found out that the main problem is the log files of the pods and because K8s is not supporting log rotation they can grow to hundreds of Gigs.
There are different log rotation methods available but I currently I am searching for the best practice for K8s so I can't suggest any specific one, yet.
I hope this can be helpful.
Personally I couldn't solve the problem using kube commands because ...
It was said to be due to an antivirus (McAfee).
Reinstalling the company-endorsed docker-desktop version solved the problem.
Had the similar issue.
My_error_log :
Warning FailedScheduling 3m23s default-scheduler 0/3 nodes are available: 1 node(s) didn't match Pod's node affinity/selector, 1 node(s) had taint {node-role.kubernetes.io/controlplane: true}, that the pod didn't tolerate, 1 node(s) had taint {node.kubernetes.io/disk-pressure: }, that the pod didn't tolerate
For me, the / partition was filled to 82%. Cleaning up some unwanted folders resolved the issue.
Command used:-
ssh uname#IP_or_hostname (login to the worker node )
df -h (to check the disk usage)
rm -rf folder_name (delete the unwanted folder,you are forcefully deleting the file, so make sure you really want to delete it).
I hope this can save someone's time.
Community hinted you the comments above. Will try to consolidate it.
The kubelet maps one or more eviction signals to a corresponding node condition.
If a hard eviction threshold has been met, or a soft eviction
threshold has been met independent of its associated grace period, the
kubelet reports a condition that reflects the node is under
pressure.
DiskPressure
Available disk space and inodes on either the node’s root filesystem
or image filesystem has satisfied an eviction threshold
So the problem might be not enough disk space or filesystem has run out of inodes. You have to learn about the conditions of your environment and then apply them in your kubelet configuration.
You do not need to ssh into the minikube since you are running it inside of your host:
--vm-driver=none -
option that runs the Kubernetes components on the host and not in a
VM. Docker is required to use this driver but no hypervisor. If you
use --vm-driver=none, be sure to specify a bridge
network for docker. Otherwise it might change between network restarts,
causing loss of connectivity to your cluster.
You might try to check if there are some issues related to the mentioned topics:
kubectl describe nodes
Look at df reports:
df -i
df -h
Some further reading so you can grasp the topic:
Configure Out Of Resource Handling - section Node Conditions.
I have deployed a 4 node Couchbase cluster using Google GKE.
The master node exposes ports 8091, 8093 to the Loadbaancer.
When connecting to the Loadbalancer IP (external) via a Java app to insert data, I get the timeout error with this stack:
Apr 04, 2017 3:32:15 PM com.couchbase.client.core.endpoint.AbstractEndpoint$2 operationComplete
WARNING: [null][ViewEndpoint]: Socket connect took longer than specified timeout.
Apr 04, 2017 3:32:15 PM com.couchbase.client.core.endpoint.AbstractEndpoint$2 operationComplete
WARNING: [null][KeyValueEndpoint]: Socket connect took longer than specified timeout.
Apr 04, 2017 3:32:15 PM com.couchbase.client.deps.io.netty.util.concurrent.DefaultPromise notifyListener0
WARNING: An exception was thrown by com.couchbase.client.core.endpoint.AbstractEndpoint$2.operationComplete()
rx.exceptions.OnErrorNotImplementedException: connection timed out: /10.4.0.3:8093
at rx.Observable$26.onError(Observable.java:7955)
at rx.observers.SafeSubscriber._onError(SafeSubscriber.java:159)
at rx.observers.SafeSubscriber.onError(SafeSubscriber.java:120)
at rx.internal.operators.OperatorMap$1.onError(OperatorMap.java:48)
What's puzzling is that the stack shows 10.4.0.3:8093 which is actually the the IP of the docker container.
Appreciate all suggestions.
Have you checked the firewall rules for the master node and the workers? You need to allow ingress for the ports you have set up.
See this answer