How to make sure Kubernetes autoscaler not deleting the nodes which runs specific pod - kubernetes

I am running a Kubernetes cluster(AWS EKS one) with Autoscaler pod So that Cluster will autoscale according to the resource request within the cluster.
Also, cluster will shrink no of nodes when the load is reduced. As I observed, Autosclaer can delete any node in this process.
I want to control this behavior such as asking Autoscaler to stop deleting nodes that runs a specific pod.
For example, If a node runs the Jenkins pod, Autoscaler should skip that node and delete other matching node from the cluster.
Will there a way to achieve this requirement. Please give your thoughts.

You can use "cluster-autoscaler.kubernetes.io/safe-to-evict": "false"
...
template:
metadata:
labels:
app: jenkins
annotations:
"cluster-autoscaler.kubernetes.io/safe-to-evict": "false"
spec:
nodeSelector:
failure-domain.beta.kubernetes.io/zone: us-west-2b
...

You should set a pod disruption budget that references specific pods by label. If you want to ensure that there is at least one Jenkins worker pod running at all times, for example, you could create a PDB like
apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
name: jenkins-worker-pdb
spec:
minAvailable: 1
selector:
matchLabels:
app: jenkins
component: worker
(adapted from the basic example in Specifying a Disruption Budget in the Kubernetes docs).
Doing this won't prevent nodes from being destroyed; the cluster autoscaler is still free to scale things down. What it will do is temporarily delay destroying a node until the disruption budget can be met again.
For example, say you've configured your Jenkins setup so that there are three workers. Two get scheduled on the same node, and the autoscaler takes that node offline. The ordinary Kubernetes Deployment system will create two new replicas on nodes that still exist. If the autoscaler also decides it wants to destroy the node that has the last worker, the pod disruption budget above will prevent it from doing so until at least one other worker is running.
When you say "the Jenkins pod" in the question, there are two other important implications to this. One is that you should almost always configure your applications using higher-level objects like Deployments or StatefulSets and not bare Pods. The other is that it is generally useful to run multiple copies of things for redundancy if nothing else. Even absent the cluster autoscaler, disks fail, Amazon occasionally arbitrarily decommissions EC2 instances, and nodes otherwise can go offline outside of your control; you often don't want just one copy of something running in your cluster, especially if you're considering it a critical service.

In autoscaler FAQ on github you can read the following:
What types of pods can prevent CA from removing a node?
Pods with restrictive PodDisruptionBudget.
Kube-system pods that:
are not run on the node by default, *
don't have a pod disruption
budget
set or their PDB is too restrictive (since CA 0.6).
Pods that are not backed by a controller object (so not created by deployment, replica set, job, stateful set etc). *
Pods with local storage. *
Pods that cannot be moved elsewhere due to various constraints (lack of resources, non-matching node selectors or affinity, matching
anti-affinity, etc)
Pods that have the following annotation set: "cluster-autoscaler.kubernetes.io/safe-to-evict": "false"
*Unless the pod has the following annotation (supported in
CA 1.0.3 or later): "cluster-autoscaler.kubernetes.io/safe-to-evict": "true"

Related

Can a node autoscaler automatically start an extra pod when replica count is 1 & minAvailable is also 1?

our autoscaling (horizontal and vertical) works pretty fine, except the downscaling is not working somehow (yeah, we checked the usual suspects like https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#i-have-a-couple-of-nodes-with-low-utilization-but-they-are-not-scaled-down-why ).
Since we want to save resources and have pods which are not ultra-sensitive, we are setting following
Deployment
replicas: 1
PodDisruptionBudget
minAvailable: 1
HorizontalPodAutoscaler
minReplicas: 1
maxReplicas: 10
But it seems now that this is the problem that the autoscaler is not scaling down the nodes (even though the node is only used by 30% by CPU + memory and we have other nodes which have absolutely enough memory + cpu to move these pods).
Is it possible in general that the auto scaler starts an extra pod on the free node and removes the old pod from the old node?
Is it possible in general that the auto scaler starts an extra pod on the free node and removes the old pod from the old node?
Yes, that should be possible in general, but in order for the cluster autoscaler to remove a node, it must be possible to move all pods running on the node somewhere else.
According to docs there are a few type of pods that are not movable:
Pods with restrictive PodDisruptionBudget.
Kube-system pods that:
are not run on the node by default
don't have a pod disruption budget set or their PDB is too restrictive >(since CA 0.6).
Pods that are not backed by a controller object (so not created by >deployment, replica set, job, stateful set etc).
Pods with local storage.
Pods that cannot be moved elsewhere due to various constraints (lack of >resources, non-matching node selectors or affinity, matching anti-affinity, etc)
Pods that have the following annotation set:
cluster-autoscaler.kubernetes.io/safe-to-evict: "false
You could check the cluster autoscaler logs, they may provide a hint to why no scale in happens:
kubectl -n kube-system logs -f deployment.apps/cluster-autoscaler
Without having more information about your setup it is hard to guess what is going wrong, but unless you are using local storage, node selectors or affinity/anti-affinity rules etc Pod disruption policies is a likely candidate. Even if you are not using them explicitly they can still prevent node scale in if they there are pods in the kube-system namespace that are missing pod disruption policies (See this answer for an example of such a scenario in GKE)

Google Kubernetes: worker pool not scaling down to zero

I'm setting up a GKE cluster on Google Kubernetes Engine to run some heavy jobs. I have a render-pool of big machines that I want to autoscale from 0 to N (using the cluster autoscaler). My default-pool is a cheap g1-small to run the system pods (those never go away so the default pool can't autoscale to 0, too bad).
My problem is that the render-pool doesn't want to scale down to 0. It has some system pods running on it; are those the problem? The default pool has plenty of resources to run all of them as far as I can tell. I've read the autoscaler FAQ, and it looks like it should delete my node after 10 min of inactivity. I've waited an hour though.
I created the render pool like this:
gcloud container node-pools create render-pool-1 --cluster=test-zero-cluster-2 \
--disk-size=60 --machine-type=n2-standard-8 --image-type=COS \
--disk-type=pd-standard --preemptible --num-nodes=1 --max-nodes=3 --min-nodes=0 \
--enable-autoscaling
The cluster-autoscaler-status configmap says ScaleDown: NoCandidates and it is probing the pool frequently, as it should.
What am I doing wrong, and how do I debug it? Can I see why the autoscaler doesn't think it can delete the node?
As pointed out in the comments, some pods, under specific circumstances will prevent the CA from downscaling.
In GKE, you have logging pods (fluentd), kube-dns, monitoring, etc., all considered system pods. This means that any node where they're scheduled, will not be a candidate for downscaling.
Considering this, it all boils down to creating an scenario where all the previous conditions for downscaling are met.
Since you only want to scale down an specific node-pool, I'd use Taints and tolerations to keep system pods in the default pool.
For GKE specifically, you can pick each app by their k8s-app label, for instance:
$ kubectl taint nodes GPU-NODE k8s-app=heapster:NoSchedule
This will prevent the tainted nodes from scheduling Heapster.
Not recommended but, you can go broader and try to get all the GKE system pods using kubernetes.io/cluster-service instead:
$ kubectl taint nodes GPU-NODE kubernetes.io/cluster-service=true:NoSchedule
Just be careful as the scope of this label is broader and you'll have to keep track of oncoming changes, as this label is possibily going to be deprecated someday.
Another thing that you might want to consider is using Pod Disruption Budgets. This might be more effective in stateless workloads, but setting it very tight is likely to cause inestability.
The idea of a PDB is to tell GKE what's the very minimal amount of pods that can be run at any given time, allowing the CA to evict them. It can be applied to system pods like below:
apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
name: dns-pdb
spec:
minAvailable: 1
selector:
matchLabels:
k8s-app: kube-dns
This tell GKE that, although there's usually 3 replicas of kube-dns, the application might be able to take 2 disruptions and sustain temporarily with only 1 replica, allowing the CA to evict these pods and reschedule them in other nodes.
As you probably noticed, this will put stress on DNS resolution in the cluster (in this particular example), so be careful.
Finally and regarding how to debug the CA. For now, consider that GKE is a managed version of Kubernetes where you don't really have direct access to tweak some features (for better or worse). You cannot set flags in the CA and access to logs could be through GCP support. The idea is to protect the workloads running in the cluster rather than to be cost-wise.
Downscaling in GKE is more about using different features in Kubernetes together until the CA conditions for downscaling are met.

Does ReplicaSets replace Pods?

I have a conceptional question, does ReplicaSets use Pod settings?
Before i applied my ReplicaSets i deleted my Pods, so there is no information about my old Pods ?
If I apply now the Replicaset does this reference to the Pod settings, so with all settings like readinessProbe/livenessProbe ... ?
My Questions came up because in my replicaset.yml is a container section where I specified my docker image, but why does it need that information, isn't it a redundant information, because this information is in my pods.yml ?
apiVersion: extensions/v1beta1
kind: ReplicaSet
metadata:
name: test1
spec:
replicas: 2
template:
metadata:
labels:
app: web
spec:
containers:
- name: test1
image: test/test
Pods are the smallest deployable units of computing that can be
created and managed in Kubernetes.
A Pod (as in a pod of whales or pea pod) is a group of one or more
containers (such as Docker containers), with shared storage/network,
and a specification for how to run the containers.
See, https://kubernetes.io/docs/concepts/workloads/pods/pod/.
So, you can specify how your Pod will be scheduled (one or more containers, ports, probes, volumes, etc.).
But in case of the node failure or anything bad that can harm to the Pod, then that Pod won't be rescheduled (you have to rescheduled manually). So, in that case, you need a controller. Kubernetes provides some controllers (each one for different purposes). They are -
ReplicaSet
ReplicationController
Deployment
StatefulSet
DaemonSet
Job
CronJob
All of the above controllers and the Pod together are called as Workload. Because they all have a podTemplate section. And they all create some number of identical Pods ass specified by the spec.replicas field (if this field exists in the corresponding workload manifest). They all are upper-level concept than Pod.
Though the Deployment is more suitable than the ReplicaSet, this answer focuses on ReplicaSet over Pod cause the question is between the Pod and ReplicaSet.
In addition, each one of the above controllers has it's own purpose. Like a ReplicaSet’s purpose is to maintain a stable set of replica Pods running at any given time. As such, it is often used to guarantee the availability of a specified number of identical Pods.
A ReplicaSet contains a podTemplate field including selectors to identify and acquire Pod(s). A pod template specifying the configuration of new Pods it should create to meet the number of replicas criteria. It creates and deletes Pod(s) as needed to reach the desired number. When a ReplicaSet needs to create new Pod(s), it uses its Pod template.
The Pod(s) maintained by a ReplicaSet has metadata.ownerReferences field, to tell which resource owns the current Pod(s).
A ReplicaSet identifies new Pods to acquire by using its selector. If there is a Pod that has no OwnerReference or the OwnerReference is not a controller and it matches a ReplicaSet’s selector, it will be immediately acquired by said ReplicaSet.
Ref: https://kubernetes.io/docs/concepts/workloads/controllers/replicaset/
**
Now, its time to answer your questions
Since ReplicaSet is one of the Pod controller (listed above), obviously, it needs a podTemplate (using this template, your Pods will be scheduled). All of the Pods the ReplicaSet creates will have the same Pod configuration (same containers, same ports, same readiness/livelinessProbe, volumes, etc.). And having this podTemplate is not redundant info, it's needed. So, if you have a Pod controller like ReplicaSet or other (as your need), you don't need the Pod itself anymore. Because the ReplicaSet (or the other controllers) will create Pod(s).
**
Guess, you got the answer.
Does ReplicaSets replace Pods?
Yes, if you have replicaset.yml you don't need pods.yml.
I have a conceptional question, does ReplicaSets use Pod settings?
Before i applied my ReplicaSets i deleted my Pods, so there is no
information about my old Pods ? If I apply now the Replicaset does
this reference to the Pod settings, so with all settings like
readinessProbe/livenessProbe ... ?
No, the ReplicaSet manifest has to contain the Pod specification in order to determine what is the configuration of the pods that should be deployed.
With the labels, you link the ReplicaSet to running Pods.
You don't link the ReplicaSet.yml manifest to the Pods.yml manifest.
Don’t use naked Pods (that is, Pods not bound to a ReplicaSet or Deployment) if you can avoid it. Naked Pods will not be rescheduled in the event of a node failure.
In 99% of the cases, there isn't a separate pods.yml manifest.
The pods + the ReplicaSet are defined in a single manifest, hence the containers section in the replicaset.yml.

How to fix "pods is not balanced" in Kubernetes cluster

Pods doesn't balance in node pool. why doesn't spread to each node?
I have 9 instance in 1 node pool. In the past, I’ve tried add to 12 instance. Pods doesn't balance.
image description here
Would like to know if there is any solution that can help solve this problem and used 9 instance in 1 node pool?
Pods are scheduled to run on nodes by the kube-scheduler. And once they are scheduled, they are not rescheduled unless they are removed.
So if you add more nodes, the already running pods won't reschedule.
There is a project in incubator that solves exactly this problem.
https://github.com/kubernetes-incubator/descheduler
Scheduling in Kubernetes is the process of binding pending pods to
nodes, and is performed by a component of Kubernetes called
kube-scheduler. The scheduler's decisions, whether or where a pod can
or can not be scheduled, are guided by its configurable policy which
comprises of set of rules, called predicates and priorities. The
scheduler's decisions are influenced by its view of a Kubernetes
cluster at that point of time when a new pod appears first time for
scheduling. As Kubernetes clusters are very dynamic and their state
change over time, there may be desired to move already running pods to
some other nodes for various reasons:
Some nodes are under or over utilized.
The original scheduling decision does not hold true any more, as taints or labels are added to or removed from nodes, pod/node
affinity requirements are not satisfied any more.
Some nodes failed and their pods moved to other nodes.
New nodes are added to clusters.
You should look into inter-pod anti-affinity. This feature allows you to constrain where your pods should not be scheduled based on the labels of the pods running on a node. In your case, given your app has label app-label, you can use it to ensure pods do not get scheduled on nodes that have pods with the label app-label. For example:
apiVersion: apps/v1
kind: Deployment
...
spec:
selector:
matchLabels:
label-key: label-value
template:
metadata:
labels:
label-key: label-value
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: label-key
operator: In
values:
- label-value
topologyKey: "kubernetes.io/hostname"
...
PS: If you use requiredDuringSchedulingIgnoredDuringExecution, you can have at most as many pods as you have nodes. If you expect to have more pods than nodes available, you will have to use preferredDuringSchedulingIgnoredDuringExecution, which makes antiaffinity be a preference, rather than an obligation.

What is the difference between a pod and a deployment?

I have been creating pods with type:deployment but I see that some documentation uses type:pod, more specifically the documentation for multi-container pods:
apiVersion: v1
kind: Pod
metadata:
name: ""
labels:
name: ""
namespace: ""
annotations: []
generateName: ""
spec:
? "// See 'The spec schema' for details."
: ~
But to create pods I can just use a deployment type:
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: ""
spec:
replicas: 3
template:
metadata:
labels:
app: ""
spec:
containers:
etc
I noticed the pod documentation says:
The create command can be used to create a pod directly, or it can
create a pod or pods through a Deployment. It is highly recommended
that you use a Deployment to create your pods. It watches for failed
pods and will start up new pods as required to maintain the specified
number. If you don’t want a Deployment to monitor your pod (e.g. your
pod is writing non-persistent data which won’t survive a restart, or
your pod is intended to be very short-lived), you can create a pod
directly with the create command.
Note: We recommend using a Deployment to create pods. You should use
the instructions below only if you don’t want to create a Deployment.
But this raises the question of what kind:pod is good for? Can you somehow reference pods in a deployment? I didn't see a way. It looks like what you get with pods is some extra metadata but none of the deployment options such as replica or a restart policy. What good is a pod that doesn't persist data, survives a restart? I think I'd be able to create a multi-container pod with a deployment as well.
Radek's answer is very good, but I would like to pitch in from my experience, you will almost never use an object with the kind pod, because that doesn't make any sense in practice.
Because you need a deployment object - or other Kubernetes API objects like a replication controller or replicaset - that needs to keep the replicas (pods) alive (that's kind of the point of using kubernetes).
What you will use in practice for a typical application are:
Deployment object (where you will specify your apps container/containers) that will host your app's container with some other specifications.
Service object (that is like a grouping object and gives it a so-called virtual IP (cluster IP) for the pods that have a certain label - and those pods are basically the app containers that you deployed with the former deployment object).
You need to have the service object because the pods from the deployment object can be killed, scaled up and down, and you can't rely on their IP addresses because they will not be persistent.
So you need an object like a service, that gives those pods a stable IP.
Just wanted to give you some context around pods, so you know how things work together.
Hope that clears a few things for you, not long ago I was in your shoes :)
Both Pod and Deployment are full-fledged objects in the Kubernetes API. Deployment manages creating Pods by means of ReplicaSets. What it boils down to is that Deployment will create Pods with spec taken from the template. It is rather unlikely that you will ever need to create Pods directly for a production use-case.
Kubernetes has three Object Types you should know about:
Pods - runs one or more closely related containers
Services - sets up networking in a Kubernetes cluster
Deployment - Maintains a set of identical pods, ensuring that they have the correct config and that the right number of them exist.
Pods:
Runs a single set of containers
Good for one-off dev purposes
Rarely used directly in production
Deployment:
Runs a set of identical pods
Monitors the state of each pod, updating as necessary
Good for dev
Good for production
And I would agree with other answers, forget about Pods and just use Deployment. Why? Look at the second bullet point, it monitors the state of each pod, updating as necessary.
So, instead of struggling with error messages such as this one:
Forbidden: pod updates may not change fields other than spec.containers[*].image
So just refactor or completely recreate your Pod into a Deployment that creates a pod to do what you need done. With Deployment you can change any piece of configuration you want to and you need not worry about seeing that error message.
Pod is container instance.
That is the output of replicas: 3
Think of one deployment can have many running instances(replica).
//deployment.yaml
apiVersion: apps/v1beta2
kind: Deployment
metadata:
name: tomcat-deployment222
spec:
selector:
matchLabels:
app: tomcat
replicas: 3
template:
metadata:
labels:
app: tomcat
spec:
containers:
- name: tomcat
image: tomcat:9.0
ports:
- containerPort: 8080
I want to add some informations from Kubernetes In Action book, so you can see all picture and connect relation between Kubernetes resources like Pod, Deployment and ReplicationController(ReplicaSet)
Pods
are the basic deployable unit in Kubernetes. But in real-world use cases, you want your deployments to stay up and running automatically and remain healthy without any manual intervention. For this the recommended approach is to use a Deployment, which under the hood create a ReplicaSet.
A ReplicaSet, as the name implies, is a set of replicas (Pods) maintained with their Revision history.
(ReplicaSet extends an older object called ReplicationController -- which is exactly the same but without the Revision history.)
A ReplicaSet constantly monitors the list of running pods and makes sure the running number of pods matching a certain specification always matches the desired number.
Removing a pod from the scope of the ReplicationController comes in handy
when you want to perform actions on a specific pod. For example, you might
have a bug that causes your pod to start behaving badly after a specific amount
of time or a specific event.
A Deployment
is a higher-level resource meant for deploying applications and updating them declaratively.
When you create a Deployment, a ReplicaSet resource is created underneath (eventually more of them). ReplicaSets replicate and manage pods, as well. When using a Deployment, the actual pods are created and managed by the Deployment’s ReplicaSets, not by the Deployment directly
Let’s think about what has happened. By changing the pod template in your Deployment resource, you’ve updated your app to a newer version—by changing a single field!
Finally, Roll back a Deployment either to the previous revision or to any earlier revision so easy with Deployment resource.
These images are from Kubernetes In Action book, too.
Pod is a collection of containers and basic object of Kuberntes. All containers of pod lie in same node.
Not suitable for production
No rolling updates
Deployment is a kind of controller in Kubernetes.
Controllers use a Pod Template that you provide to create the Pods for which it is responsible.
Deployment creates a ReplicaSet which in turn make sure that,
CurrentReplicas is always same as desiredReplicas .
Advantages :
You can rollout and rollback your changes using deployment
Monitors the state of each pod
Best suitable for production
Supports rolling updates
In Kubernetes we can deploy our workloads using different type of API objects like Pods, Deployment, ReplicaSet, ReplicationController and StatefulSets.
Out of those Pods are the smallest deployable unit in Kubernetes. Any workload/application that runs in Kubernetes, has to run inside a container part of a Pod. A Pod could run multiple containers (meaning multiple applications) within it. A Pod is a wrapper on top of one/many running containers. Using a Pod, kubernetes could control, monitor, operate the containers.
Now using stand alone Pods we can't do lot of things. We can't change configurations, volumes inside Pods. We can't restart the Pod if one is down.
So there is another API Object called Deployment comes into picture which maintains the desired state (how many instances, how much compute resource application uses) of the application. The Deployment maintaines multiple instances of same application by running multiple Pods. Deployments unlike Pods are mutable. Deployments uses another API Object called ReplicaSet to maintain the desired state. Deployments through ReplicaSet spawns another Pod if one is down.
So Pod runs applications in containers. Deployments run Pods and maintains desired state of the application.
Try to avoid Pods and implement Deployments instead for managing containers as objects of kind Pod will not be rescheduled (or self healed) in the event of a node failure or pod termination.
A Deployment is generally preferable because it defines a ReplicaSet to ensure that the desired number of Pods is always available and specifies a strategy to replace Pods, such as RollingUpdate.
May be this example will be helpful for beginners !!
1) Listing PODs
controlplane $ kubectl -n my-namespace get pods
NAME READY STATUS RESTARTS AGE
mysql 1/1 Running 0 92s
webapp-mysql-75dfdf859f-9c54j 1/1 Running 0 92s
2) Deleting web-app pode - which is created using deployment
controlplane $ kubectl -n my-namespace delete pod webapp-mysql-75dfdf859f-9c54j
pod "webapp-mysql-75dfdf859f-9c54j" deleted
3) Listing PODs ( You can see, it is recreated automatically)
controlplane $ kubectl -n my-namespace get pods
NAME READY STATUS RESTARTS AGE
mysql 1/1 Running 0 2m42s
webapp-mysql-75dfdf859f-mqrcx 1/1 Running 0 45s
4) Deleting mysql POD whcih is created directly ( with out deployment)
controlplane $ kubectl -n my-namespace delete pod mysql
pod "mysql" deleted
5) Listing PODs ( You can see mysql POD is lost for ever )
controlplane $ kubectl -n my-namespace get pods
NAME READY STATUS RESTARTS AGE
webapp-mysql-75dfdf859f-mqrcx 1/1 Running 0 76s
In kubernetes Pods are the smallest deployable units. Every time when we create a kubernetes object like Deployments, replica-sets, statefulsets, daemonsets it creates pod.
As mentioned above deployments create pods based on desired state mentioned in your deployment object. So for example you want 5 replicas of a application, you mentioned replicas: 5 in your deployment manifest. Now deployment controller is responsible to create 5 identical replicas (no less, no more) of given application with all metadata like RBAC policy, networks policy, labels, annotations, health check, resource quotas, taint/tolerations and others and associate with each pods it creates.
There are some cases when you wants to create pod, for example if you are running a test sidecar where you don't need to run application forever, you don't need multiple replicas, and you run application when you wants to execute in that case pod is suitable. For example helm test, which is a pod definition that specifies a container with a given command to run.
I am also a beginner in k8s so correct me if I am wrong.
We know that a pod is created when we create a deployment. What I observed is that if you see the YAML file of the deployment, you can see its kind:deployment. But if you see the YAML file of the pod, you see its kind:pod.