Re Scheduling pods from one node to another - kubernetes

So, I am writing a custom auto-rescheduler for my clusters and I am using Python Client library to do so. As the rescheduler is still in proposal and nothing has been done for it, the only known way is to delete the pod from overused node and let the replication controller and scheduler take care of the rest (make a new pod and assign it to an appropriate node). What I want to know is can I use the client library to move the pods from one node to another without deleting the pod. Basically, I want to create a pod in an appropriate node first and then delete the pod in the over-used node. Is that possible?

Using node label you can start the container in matching nodes. for this first you need set the node label and update the deployment file and apply it.
Here is the sample yml file I used for blue green deployment, see this help.
web server running on node labeled web
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: webserver-blue
spec:
replicas: 2
template:
metadata:
labels:
type: webserver
color: blue
spec:
containers:
- image: nginx:1.12.0
name: webserver-container
ports:
- containerPort: 80
name: http-server
nodeSelector:
svrtype: web
set another node label as newweb and update and deployment with different name and node label the config.
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: webserver-green
spec:
replicas: 2
template:
metadata:
labels:
type: webserver
color: green
spec:
containers:
- image: nginx:1.13.0
name: webserver-container
ports:
- containerPort: 80
name: http-server
nodeSelector:
svrtype: newweb
After testing you can remove the old one. the issue here is you can direct the traffic to only one deployment at a time.

Related

Kubernetes non specific spec.selector does not prevent Kubernetes from working correctly

I've experienced a surprising behavior when playing around with Kubernetes and I wanted to know if there is any good explanation behind it.
I've noticed that when two Kubernetes deployments are created with the same labels, and with the same spec.selector, the deployments still function correctly, even though using the same selector "should" cause them to be confused regarding which pods is related to each one.
Example configurations which present this -
example_deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
labels:
app: nginx
extra_label: one
spec:
replicas: 2
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:1.14.2
ports:
- containerPort: 80
example_deployment_2.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment-2
labels:
app: nginx
extra_label: two
spec:
replicas: 3
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:1.14.2
ports:
- containerPort: 80
I expected the deployments not to work correctly, since they will select pods from each other and assume it is theirs.
The actual result is that the deployments seem to be created correctly, but entering the deployment from k9s returns all of the pods. This is true for both deployments.
Can anyone please shed light regarding why this is happening? Is there additional internal filtering in Kubernetes to to prevent pods which were not really created by the deployment from being associated with it?
I'll note that I've seen this behavior in AWS and have reproduced it in Minikube.
When you create a K8S Deployment, K8S creates a ReplicaSet to manage the pods, then this ReplicaSet creates the pods based on the number of replicas provided or patched by the hpa. Addition to the provided labels and annotations you provide, the ReplicaSet add ownerReferences which contains its name and uid, so even if you have 4 pods with the same labels, each two pods will have a different ownerReferences used by the ReplicaSet to manage them:
apiVersion: v1
kind: Pod
metadata:
ownerReferences:
- apiVersion: apps/v1
blockOwnerDeletion: true
controller: true
kind: ReplicaSet
name: <replicaset name>
uid: <replicaset uid>
...

Which names should be same in this k8s yaml

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: my-app
labels:
run: my-app
spec:
replicas: 3
selector:
matchLabels:
run: my-app
template:
metadata:
labels:
run: my-app
spec:
containers:
- image: gcr.io/google-samples/hello-app:1.0
name: my-app
ports:
- containerPort: 8080
This is a sample yaml from kubenetes site, there are so many my-app, do they all have to be same? what are their purpose?
This is a sample yaml from kubenetes site, there are so many my-app, do they all have to be same? what are their purpose?
No they don't have to be the same as far as the name field goes, that can be different. The my-app references seen in the metadata and selector sections
are labels that can be used to glue the different Kubernetes objects together or simply select a subset of objects when querying Kubernetes. They will sometimes be the same.
Depending on how you've created the Deployment you may have run: myapp throughout the Deployment and in the objects derived from it. Using kubectl run my-app --image=gcr.io/google-samples/hello-app:1.0 --replicas=3 would create a identical Deployment you're referring to.
Here's a picture showing how the different run: my-app labels are used, using the Deployment above as an inspiration:
The picture above shows you the Deployment and how the template box (blue) are used to create the number of specified replicas (Pods). Each Pod will get a run: my-app label in it's metadata section, from the Deployment point of view this will be used as a way of selecting the Pods it's responsible for.
A similar selection of a subset of Pods using kubectl would be:
kubectl get pods -l run=my-app
This will give you all Pods labeled run: my-app.
To sum up a bit, labels can be used to select a subset of resources when querying using e.g. kubectl or by other Kubernetes resources to do selections. You can create your own labels and they don't necessarily have to be the same throughout your specific Deployment but if they are it would be pretty easy to query for any resource with a specific label.
Personally, I think it can be helpful for checking pods grouping information.
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: my-app <--- Deployment object name, you can change it.
labels:
run: my-app <--- It is helpful for the management, e.g.> Deleting same label one
spec:
replicas: 3
selector:
matchLabels:
run: my-app <--- What labels are controlled over by this deployment object.
template:
metadata:
labels:
run: my-app <--- Yeah, it's pod's label. It can be used of grouping with other objects
spec:
containers:
- image: gcr.io/google-samples/hello-app:1.0
name: my-app
ports:

Is it possible to move the running pods from ReplicationController to a Deployment?

We are using RC to run our workload and want to migrate to Deployment. Is there a way to do that with out causing any impact to the running workload. I mean, can we move these running pods under Deployment?
Like, #matthew-l-daniel answered, the answer is yes. But I am more than 80% certain about it. Because I have tested it
Now whats the process we need to follow
Lets say I have a ReplicationController.
apiVersion: v1
kind: ReplicationController
metadata:
name: nginx
spec:
replicas: 3
selector:
app: nginx
template:
metadata:
name: nginx
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx
ports:
- containerPort: 80
Question: can we move these running pods under Deployment?
Lets follow these step to see if we can.
Step 1:
Delete this RC with --cascade=false. This will leave Pods.
Step 2:
Create ReplicaSet first, with same label as ReplicationController
apiVersion: apps/v1beta2
kind: ReplicaSet
metadata:
name: nginx
labels:
app: nginx
spec:
replicas: 3
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
---
So, now these Pods are under ReplicaSet.
Step 3:
Create Deployment Now with same label.
apiVersion: apps/v1beta2
kind: Deployment
metadata:
name: nginx
labels:
app: nginx
spec:
replicas: 3
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
----
And Deployment will find one ReplicaSet already exists and our job is done.
Now we can check increasing replicas to see if it works.
And It works.
Which way It doesn't work
After deleting ReplicationController, do not create Deployment directly. This will not work. Because, Deployment will find no ReplicaSet, and will create new one with additional label which will not match with your existing Pods
I'm about 80% certain the answer is yes, since they both use Pod selectors to determine whether new instances should be created. The key trick is to use the --cascade=false (the default is true) in kubectl delete, whose help even speaks to your very question:
--cascade=true: If true, cascade the deletion of the resources managed by this resource (e.g. Pods created by a ReplicationController). Default true.
By deleting the ReplicationController but not its subordinate Pods, they will continue to just hang out (although be careful, if a reboot or other hazard kills one or all of them, no one is there to rescue them). Creating the Deployment with the same selector criteria and a replicas count equal to the number of currently running Pods should cause a "no action" situation.
I regret that I don't have my cluster in front of me to test it, but I would think a small nginx RC with replicas=3 should be a simple enough test to prove that it behaves as you wish.

In Kubernetes, how to set pods' names when using replication controllers?

I have a simple replication controller yaml file which looks like this:
apiVersion: v1
kind: ReplicationController
metadata:
name: nginx
spec:
replicas: 3
selector:
app: nginx
template:
spec:
containers:
- image: library/nginx:3.2
imagePullPolicy: IfNotPresent
name: nginx
ports:
- containerPort: 80
metadata:
labels:
app: nginx
And after running this replication controller, I will get 3 different pods whose names are "nginx-xxx", where "xxx" represents a random string of letters and digits.
What I want is to specify names for the pods created by the replication controller, so that the pods' name can be "nginx-01", "nginx-02", "nginx-03". And further more, for say if pod "nginx-02" is down for some reason, and replication controller will automatically create another nginx pod, and I want this new nginx pod's name to remain as "nginx-02".
I wonder if this is possible? Thanks in advance.
You should be using statefulset instead of replication controllers. Moreover, replication controllers are replaced with ReplicaSets.
StatefulSet Pods have a unique identity that is comprised of an ordinal. For a StatefulSet with N replicas, each Pod in the StatefulSet will be assigned an integer ordinal, from 0 up through N-1, that is unique over the Set. Each Pod in a StatefulSet derives its hostname from the name of the StatefulSet and the ordinal of the Pod.
StatefulSets matches your requirements and hence use it in your deployment.
Try the deployment files below:
apiVersion: v1
kind: Service
metadata:
name: nginx
labels:
app: nginx
spec:
ports:
- port: 80
name: web
clusterIP: None
selector:
app: nginx
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: web
spec:
selector:
matchLabels:
app: nginx # has to match .spec.template.metadata.labels
serviceName: "nginx"
replicas: 3 # by default is 1
template:
metadata:
labels:
app: nginx # has to match .spec.selector.matchLabels
spec:
terminationGracePeriodSeconds: 10
containers:
- name: nginx
image: k8s.gcr.io/nginx-slim:0.8
ports:
- containerPort: 80
name: web
volumeMounts:
- name: www
mountPath: /usr/share/nginx/html
volumes:
- name: www
emptyDir:
This can be implemented using statefulsets which is out of beta since version 1.9. Quoting the documentation: When using kind: StatefulSet,
Pods have a unique identity that is comprised of an ordinal, a stable network identity, and stable storage. The identity sticks to the Pod, regardless of which node it’s (re)scheduled on.
Each Pod in a StatefulSet derives its hostname from the name of the StatefulSet and the ordinal of the Pod. The pattern for the constructed hostname is $(statefulset name)-$(ordinal).
So in the example above, you would get nginx-0,nginx-1,nginx-2
If you're running stateless workloads, I cannot imagine why you would want to have fixed identities associated with each object if your intention is to run N replicas of a particular pod.
There is no way to do this using a ReplicaSet/ReplicationController. When the controller creates new pods, it will have a generated name suffix after the pod name.
If that is what you really want (fixed identity/ordinal index), the property is satisfied by the StatefulSet resource which is stable since Kubernetes v1.9. However, it also comes with additional guarantees that you probably do not need.

How can I distribute a deployment across nodes?

I have a Kubernetes deployment that looks something like this (replaced names and other things with '....'):
# Please edit the object below. Lines beginning with a '#' will be ignored,
# and an empty file will abort the edit. If an error occurs while saving this file will be
# reopened with the relevant failures.
#
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
annotations:
deployment.kubernetes.io/revision: "3"
kubernetes.io/change-cause: kubectl replace deployment ....
-f - --record
creationTimestamp: 2016-08-20T03:46:28Z
generation: 8
labels:
app: ....
name: ....
namespace: default
resourceVersion: "369219"
selfLink: /apis/extensions/v1beta1/namespaces/default/deployments/....
uid: aceb2a9e-6688-11e6-b5fc-42010af000c1
spec:
replicas: 2
selector:
matchLabels:
app: ....
strategy:
rollingUpdate:
maxSurge: 1
maxUnavailable: 1
type: RollingUpdate
template:
metadata:
creationTimestamp: null
labels:
app: ....
spec:
containers:
- image: gcr.io/..../....:0.2.1
imagePullPolicy: IfNotPresent
name: ....
ports:
- containerPort: 8080
protocol: TCP
resources:
requests:
cpu: "0"
terminationMessagePath: /dev/termination-log
dnsPolicy: ClusterFirst
restartPolicy: Always
securityContext: {}
terminationGracePeriodSeconds: 30
status:
availableReplicas: 2
observedGeneration: 8
replicas: 2
updatedReplicas: 2
The problem I'm observing is that Kubernetes places both replicas (in the deployment I've asked for two) on the same node. If that node goes down, I lose both containers and the service goes offline.
What I want Kubernetes to do is to ensure that it doesn't double up containers on the same node where the containers are the same type - this only consumes resources and doesn't provide any redundancy. I've looked through the documentation on deployments, replica sets, nodes etc. but I couldn't find any options that would let me tell Kubernetes to do this.
Is there a way to tell Kubernetes how much redundancy across nodes I want for a container?
EDIT: I'm not sure labels will work; labels constrain where a node will run so that it has access to local resources (SSDs) etc. All I want to do is ensure no downtime if a node goes offline.
There is now a proper way of doing this.
You can use the label in "kubernetes.io/hostname" if you just want to spread it out across all nodes. Meaning if you have two replicas of a pod, and two nodes, each should get one if their names aren't the same.
Example:
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-service
labels:
app: my-service
spec:
replicas: 2
selector:
matchLabels:
app: my-service
template:
metadata:
labels:
app: my-service
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: my-service
containers:
- name: pause
image: k8s.gcr.io/pause:3.1
I think you're looking for the Affinity/Anti-Affinity Selectors.
Affinity is for co-locating pods, so I want my website to try and schedule on the same host as my cache for example. On the other hand, Anti-affinity is the opposite, don't schedule on a host as per a set of rules.
So for what you're doing, I would take a closer look at this two links:
https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#never-co-located-in-the-same-node
https://kubernetes.io/docs/tutorials/stateful-application/zookeeper/#tolerating-node-failure
If you create a Service for that Deployment, before creating the said Deployment, Kubernetes will spread your pods across nodes. This behavior comes from the Scheduler, it is provided on a best-effort basis, providing that you have enough resources available on both nodes.
From the Kubernetes documentation (Managing Resources):
it’s best to specify the service first, since that will ensure the scheduler can spread the pods associated with the service as they are created by the controller(s), such as Deployment.
Also related: Configuration best practices - Service.
I agree with Antoine Cotten to use a service for your deployment. A service always keeps any service up by creating a new pod if, for some reason, one pod is dying in a certain node. However, if you just want to distribute a deployment among all nodes then you can use pod anti affinity in your pod manifest file. I put an example on my gitlab page that you can also find in Kubernetes Blog. For your convenience, I'm providing the example here as well.
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: nginx
spec:
replicas: 2
template:
metadata:
labels:
app: nginx
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- nginx
topologyKey: kubernetes.io/hostname
containers:
- name: nginx
image: gcr.io/google_containers/nginx-slim:0.8
ports:
- containerPort: 80
In this example, each Deployment has a label which is app and the value of this label is nginx. In pod spec, you have podAntiAffinity that will restrict to have two same pods (label app:nginx) in one node. You can also use podAffinity if you would like to place multiple Deployments in one node.
If a node goes down, any pods running on it would be restarted automatically on another node.
If you start specifying exactly where you want them to run, then you actually loose the capability of Kubernetes to reschedule them on a different node.
The usual practice therefore is to simply let Kubernetes do its thing.
If however you do have valid requirements to run a pod on a specific node, due to requirements for certain local volume type etc, have a read of:
http://kubernetes.io/docs/user-guide/node-selection/
Maybe a DaemonSet will work better. I'm using DaemonStets with nodeSelector to run pods on specific nodes and avoid duplication.
http://kubernetes.io/docs/admin/daemons/