I have an Kubernetes cluster on AWS with Cluster Autoscaler (a component to automically adjusts the desired number of nodes based on usage). The cluster previously had node A on AZ-1 and node B on AZ-2. When I deploy my statefulset with dynamic PVC, the PVC and PV are created on AZ-2, and the pods are created on node B.
I deleted the statefulset to perform some testing. The Cluster Autoscaler decides that one node is now enough and adjusted the desired number down to 1. Now that node B is deleted, when I redeploy my statefulset, the pods are in pending state and can't be created on node A with the following error:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 2m8s (x997 over 18h) default-scheduler 0/1 nodes are available: 1 node(s) had volume node affinity conflict.
Normal NotTriggerScaleUp 95s (x6511 over 18h) cluster-autoscaler pod didn't trigger scale-up: 1 node(s) had volume node affinity conflict
I know it is because the PVs are created in AZ-2 and can't be attached to pods in AZ-1, but how do I overcome this issue?
Use multiple node groups, each scoped to a single Availability Zone. In addition, you should enable the --balance-similar-node-groups feature.
https://docs.aws.amazon.com/eks/latest/userguide/managed-node-groups.html
Important
If you are running a stateful application across multiple Availability Zones that is backed by Amazon EBS volumes and using the Kubernetes Cluster Autoscaler, you should configure multiple node groups, each scoped to a single Availability Zone. In addition, you should enable the --balance-similar-node-groups feature.
Related
I am going to implement PDB on AKS. Can someone please tell me why do we need it when we can use node autoscaler.
Also, does PDB allow zero unavailability by creating a node when one of the nodes fails?
PDB allows you to set rules before evicting your pods from a node.
Let's say you have a 2 nodes cluster and a deployment with 1 replica and you want to update your nodes.
kubectl drain will cordon node 1 so no pods can be schedule on that node
kubectl drain will remove the pod schedule on node 1
kubelet will then deploy your pod over node 2
Now if you set a PDB with a minAvailable: 50%, that drain command would fail as it would violates the rule.
The pods is killed and then kubelet tries to schedule it somewhere.
PDB allows you to prevent downtime by budgeting pods before evicting them.
Scenario without PDB
You perform node 1 update and node 2 cannot host the evicted pod :
pod is killed on node 1
kubelet cannot schedule pod anywhere
autoscaling provisions a third node
pod is scheduled on that new node
During that whole time your evicted pod was not running anywhere and your application was down.
I have a pod that won't start with a volume affinity conflict. This is a bare-metal cluster so it's unrelated to regions. The pod has 4 persistent volume claims which are all reporting bound so I'm assuming it's not one of those. There are 4 nodes, one of them is tainted so that the pod will not start on it, one of them is tainted specifically so that the pod WILL start on it. That's the only affinity I have set up to my knowledge. The message looks like this:
0/4 nodes are available: 1 node(s) had taint {XXXXXXX},
that the pod didn't tolerate, 1 node(s) had volume node
affinity conflict, 2 Insufficient cpu, 2 Insufficient memory.
This is what I would have expected apart from the volume affinity conflict. There are no other affinities set other than to point it at this node. I'm really not sure why it's doing this or where to even begin. The message isn't super helpful. It does NOT say which node or which volume there is a problem with. The one thing I don't really understand is how binding works. One of the PVC's is mapped to a PV on another node however it is reporting as bound so I'm not completely certain if that's the problem. I am using local-storage as the storage class. I'm wondering if that's the problem but I'm fairly new to Kubernetes and I'm not sure where to look.
You got 4 Nodes but none of them are available for scheduling due to a different set of conditions. Note that each Node can be affected by multiple issues and so the numbers can add up to more than what you have on total nodes. Let's try to address these issues one by one:
Insufficient memory: Execute kubectl describe node <node-name> to check how much free memory is available there. Check the requests and limits of your pods. Note that Kubernetes will block the full amount of memory a pod requests regardless how much this pod uses.
Insufficient cpu: Analogical as above.
node(s) had volume node affinity conflict: Check out if the nodeAffinity of your PersistentVolume (kubectl describe pv) matches the node label (kubectl get nodes). Check if the nodeSelector in your pod also matches. Make sure you set up the Affinity and/or AntiAffinity rules correctly. More details on that can be found here.
node(s) had taint {XXXXXXX}, that the pod didn't tolerate: You can use kubectl describe node to check taints and kubectl taint nodes <node-name> <taint-name>- in order to remove them. Check the Taints and Tolerations for more details.
We are seeing an issue with the GKE kubernetes scheduler being unable or unwilling to schedule Daemonset pods on nodes in an auto scaling node pool.
We have three node pools in the cluster, however the pool-x pool is used to exclusively schedule a single Deployment backed by an HPA, with the nodes having the taint "node-use=pool-x:NoSchedule" applied to them in this pool. We have also deployed a filebeat Daemonset on which we have set a very lenient tolerations spec of operator: Exists (hopefully this is correct) set to ensure the Daemonset is scheduled on every node in the cluster.
Our assumption is that, as pool-x is auto-scaled up, the filebeat Daemonset would be scheduled on the node prior to scheduling any of the pods assigned to on that node. However, we are noticing that as new nodes are added to the pool, the filebeat pods are failing to be placed on the node and are in a perpetual "Pending" state. Here is an example of the describe output (truncated) of one the filebeat Daemonset:
Number of Nodes Scheduled with Up-to-date Pods: 108
Number of Nodes Scheduled with Available Pods: 103
Number of Nodes Misscheduled: 0
Pods Status: 103 Running / 5 Waiting / 0 Succeeded / 0 Failed
And the events for one of the "Pending" filebeat pods:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 18m (x96 over 68m) default-scheduler 0/106 nodes are available: 105 node(s) didn't match node selector, 5 Insufficient cpu.
Normal NotTriggerScaleUp 3m56s (x594 over 119m) cluster-autoscaler pod didn't trigger scale-up (it wouldn't fit if a new node is added): 6 node(s) didn't match node selector
Warning FailedScheduling 3m14s (x23 over 15m) default-scheduler 0/108 nodes are available: 107 node(s) didn't match node selector, 5 Insufficient cpu.
As you can see, the node does not have enough resources to schedule the filebeat pod CPU requests are exhausted due to the other pods running on the node. However, why is the Daemonset pod not being placed on the node prior to scheduling any other pods. Seems like the very definition of a Daemonset necessitates priority scheduling.
Also of note, if I delete a pod on a node where filebeat is "Pending" scheduling due to being unable to satisfy the CPU requests, filebeat is immediately scheduled on that node, indicating that there is some scheduling precedence being observed.
Ultimately, we just want to ensure the filebeat Daemonset is able to schedule a pod on every single node in the cluster and have that priority work nicely with our cluster autoscaling and HPAs. Any ideas on how we can achieve this?
We'd like to avoid having to use Pod Priority, as its apparently an alpha feature in GKE and we are unable to make use of it at this time.
The behavior you are expecting of the DaemonSet pods being scheduled first on a node is no longer the reality (since 1.12). Since 1.12, DaemonSet pods are handled by the default scheduler and relies on pod priority to determine the order in which pods are scheduled. You may want to consider creating a priorityCLass specific for DaemonSets with a relatively high value to ensure they are scheduled ahead of most of your other pods.
Before kubernetes 1.12 daemonset were scheduled by his own controller, after that version, deploying daemonset was managed by the default scheduler, in the hope that priority, preemption and toleration cover all the cases.
If you want schedule of daemonsets managed by daemonset scheduler, check
ScheduleDaemonSetPods feature.
https://kubernetes.io/docs/reference/command-line-tools-reference/feature-gates/
I have a Kubernetes cluster deployed on GCP with a single node, 4 CPU's and 15GB memory. There are a few pods with all the pods bound to the persistent volume by a persistent volume claim. I have observed that the pods have restarted automatically and the data in the persistent volume is lost.
After some research, I suspect that this could be because of the pod eviction policy. When I used kubectl describe pod , I noticed the below error.
0/1 nodes are available: 1 node(s) were not ready, 1 node(s) were out of disk space, 1 node(s) were unschedulable.
The restart policy of my pods is "always". So I think that the pods have restarted after being resource deprived.
How do I identify the pod eviction policy of my cluster and change it? so that this does not happen in the future
pod eviction policy of my cluster and change
These thresholds ( pod eviction) are flags of kubelet, you can tune these values according to your requirement. you can edit the kubelet config file, here is the detail config-file
Dynamic Kubelet Configuration allows you to edit these values in the live cluster
The restart policy of my pods is "always". So I think that the pods have restarted after being resource deprived.
Your pod has been rescheduled due to node's issue (not enough disk space ).
The restart policy of my pods is "always".
It means if the pod is not up and running then try to restart it .
Trying to create pod but getting following error:
0/3 nodes are available: 1 node(s) had no available volume zone.
I tried to attach more volume but still the error is same.
Warning FailedScheduling 2s (x14 over 42s) default-scheduler 0/3 nodes are available: 1 node(s) had no available volume zone, 2 node(s) didn't have free ports for the requested pod ports.
My problem was that the AWS EC2 Volume and Kubernetes PersistentVolume (PV) state got somehow out of sync / corrupted. Kubernetes believed there was a bound PV while the EC2 Volume showed as "available", not mounted to a worker node.
Update: The volume was in a different avail. zone then either of the 2 EC2 nodes and thus could not be attached to them.
The solution was to delete all relevant resources - StatefulSet, PVC (crucial!), PV. Then I was able to apply them again and Kubernetes succeeded in creating a new EC2 Volume and attaching it to the instance.
As you can see in my configuration, I have a StatefulSet with a "volumeClaimTemplate" (=> PersistentVolumeClaim, PVC) (and a matching StorageClass definition) so Kubernetes should dynamically provision an EC2 Volume, attach it to a worker and expose it as a PersistentVolume.
See kubectl get pvc, kubectl get pv and in the AWS Console - EC2 - Volumes.
NOTE: "Bound" = the PV is bound to the PVC.
Here is a description of a laborious way to restore a StatefulSet on AWS if you have a snapshot of the EBS volume (5/2018): https://medium.com/#joatmon08/kubernetes-statefulset-recovery-from-aws-snapshots-8a6159cda6f1