Why pods aren't rescheduled when the remote kubelet is unreachable - kubernetes

I'm currently doing some tests on a kubernetes cluster.
I was wondering why the pods aren't rescheduled in some cases :
When the node is unreachable
When the remote kubelet doesn't answer
Actually the only case when a pod got rescheduled is when the kubelet notify the master.
Is it on purpose ? Why ?
If i shut down a server where there's a rc with a unique pod running, my service is down.
Maybe there's something i misunderstood.
Regards,
Smana

There is a quite long default timeout for detecting unreachable nodes and for re-scheduling pods, maybe you did not wait long enough?
You can adjust the timeouts with several flags:
node-status-update-frequency on the kubelet (http://kubernetes.io/v1.0/docs/admin/kubelet.html)
node-monitor-grace-period and pod_eviction_timeout on the kube-controller-manager (http://kubernetes.io/v1.0/docs/admin/kube-controller-manager.html)

Related

Kubernetes worker node went down, what will happen to the pod?

I have setup a EKS in AWS, setup 2 worker node and configured the autosclaing on those nodes with 3 as desired capacity.
Sometime my worker node goes down due to "an EC2 health check indicating it has been terminated or stopped." which results my pod get restarted. I have not enabled any replicas for the pods. It is one now.
Just wanted to know, how can my services (pod) will be highly available despite of any worked node goes down or restart?
If you have only one pod for your service, then your service is NOT highly available. It is a single point of failure. If that pod dies or is restarted, as has happened here, then during the time the pod is being restarted, your service is dead.
You need a bare minimum, TWO pods for a service to be highly available, they they should be on different nodes (you can force Kuberentes to schedule the pods on different nodes using pod antiaffinity (https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/) so that if one node goes down as in your example, it takes out only pod, leaving the other pod(s) to handle the requests until the other pod can be rescheduled.

Kubernetes scaling pods by number of active connections

I have a kubernetes cluster that runs some legacy containers (windows containers) .
To simplify , let's say that the container can handle max 5 requests at a time something like
handleRequest(){
requestLock(semaphore_Of_5)
sleep(2s)
return "result"
}
So the cpu is not spiked . I need to scale based on nr of active connections
I can see from the documentation https://kubernetes.io/docs/concepts/services-networking/service/#proxy-mode-iptables
You can use Pod readiness probes to verify that backend Pods are working OK, so that kube-proxy in iptables mode only sees backends that test out as healthy. Doing this means you avoid having traffic sent via kube-proxy to a Pod that’s known to have failed.
So there is a mechanism to make pods available for routing new requests but it is the livenessProbe that actually mark the pod as unhealthy and subject to restart policy. But my pods are just busy. They don't need restarting.
How can I increase the nr of pods in this case ?
You can enable HPA for the deployment.
You can autoscale on the no of requests metrics and perform autoscaling on this metric.
https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale-walkthrough/#autoscaling-on-metrics-not-related-to-kubernetes-objects
I would also recommend to configure liveness probe failureThreshold and timeoutSeconds, check if it helps.
https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/#configure-probes

Does Liveness Probes restart or kill the Pod?

I had read on the documentation that liveness probes make a new pod and stop the other one. But in the kubernetes dashboard it shows me only restarts with my tcp livness probe. I was wondering what kubernetes does during a liveness probe. Can i control it?
The kubelet uses liveness probes to know when to restart a Container, not recreate the pods.
Probes have a number of fields that you can use to more precisely control the behavior of the checks (initialDelaySeconds,periodSeconds, timeoutSeconds, successThreshold and failureThreshold). You can find details about them here.
For container restart, SIGTERM is first sent with waits for a parameterized grace period, and then Kubernetes sends SIGKILL. You can control some of this behavior by tweaking the terminationGracePeriodSeconds value and/or Attaching Handlers to Container Lifecycle Events.

Kubernetes pod eviction schedules evicted pod to node already under DiskPressure

We are running a kubernetes (1.9.4) cluster with 5 masters and 20 worker nodes. We are running one statefulset pod with replication 3 among other pods in this cluster. Initially the statefulset pods are distributed to 3 nodes. However the pod-2 on node-2 got evicted due to the disk pressure on node-2. However, when the pod-2 is evicted it went to node-1 where pod-1 was already running and node-1 was already experiencing node pressure. As per our understanding, the kubernetes-scheduler should not have scheduled a pod (non critical) to a node where there is already disk pressure. Is this the default behavior to not schedule the pods to a node under disk pressure or is it allowed. The reason is, at the same time we do observe, node-0 without any disk issue. So we were hoping that evicted pod on node-2 should have ideally come on node-0 instead of node-1 which is under disk pressure.
Another observation we had was, when the pod-2 on node-2 was evicted, we see that same pod is successfully scheduled and spawned and moved to running state in node-1. However we still see "Failed to admit pod" error in node-2 for many times for the same pod-2 that was evicted. Is this any issue with the kube-scheduler.
Yes, Scheduler should not assign a new pod to a node with a DiskPressure Condition.
However, I think you can approach this problem from few different angles.
Look into configuration of your scheduler:
./kube-scheduler --write-config-to kube-config.yaml
and check it needs any adjustments. You can find info about additional options for kube-scheduler here:
You can also configure aditional scheduler(s) depending on your needs. Tutorial for that can be found here
Check the logs:
kubeclt logs: kube-scheduler events logs
journalctl -u kubelet: kubelet logs
/var/log/kube-scheduler.log (on the master)
Look more closely at Kubelet's Eviction Thresholds (soft and hard) and how much node memory capacity is set.
Bear in mind that:
Kubelet may not observe resources pressure fast enough
or
Kubelet may evict more Pods than needed due to stats collection timing gap
Please check out my suggestions and let me know if they helped.

How to kickoff the dead replicas of Kubernetes Deployment

Now we have deployed services as Kubernetes Deployments with multiple replicas. Once the server crashes, Kubernetes will migrate its containers to another available server which tasks about 3~5 minutes.
While migrating, the client can access the the Deployment service because we still have other running replicas. But sometimes the requests fail because the load balancer redirect to the dead or migrating containers.
It would be great if Kubernetes could kickoff the dead replicas automatically and add them once they run in other servers. Otherwise, we need to setup LB like haproxy to do the same job with multiple Deployment instances.
You need to configure health checking to have properly working load balancing for a Service. Please have a read of:
https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/
The kubelet uses readiness probes to know when a Container is ready to start accepting traffic. A Pod is considered ready when all of its Containers are ready. One use of this signal is to control which Pods are used as backends for Services. When a Pod is not ready, it is removed from Service load balancers.
1、kubelet
--node-status-update-frequency duration
Specifies how often kubelet posts node status to master. Note: be cautious when changing the constant, it must work with nodeMonitorGracePeriod in nodecontroller. Default: 10s (default 10s)
2、controller-manager
--node-monitor-grace-period duration
Amount of time which we allow running Node to be unresponsive before marking it unhealthy. Must be N times more than kubelet's nodeStatusUpdateFrequency, where N means number of retries allowed for kubelet to post node status. (default 40s)
--pod-eviction-timeout duration
The grace period for deleting pods on failed nodes. (default 5m0s)