How to force Kubernetes to update deployment with a pod in every node

How to force Kubernetes to update deployment with a pod in every node - deployment

I would like to know if there is a way to force Kubernetes, during a deploy, to use every node in the cluster.
The question is due some attempts that I have done where I noticed a situation like this:
a cluster of 3 nodes
I update a deployment with a command like: kubectl set image deployment/deployment_name my_repo:v2.1.2
Kubernetes updates the cluster
At the end I execute kubectl get pod and I notice that 2 pods have been deployed in the same node.
So after the update, the cluster has this configuration:
one node with 2 pods
one node with 1 pod
one node without any pod (totally without any workload)

The scheduler will try to figure out the most reasonable way of scheduling at given point in time, which can change later on and results in situations like you described. Two simple ways to manage this in one way or another are :
use DaemonSet instead of Deployment : will make sure you have one and only one pod per node (matching nodeSelector / tolerations etc.)
use PodAntiAffinity : you can make sure that two pods of the same deployment in the same version are never deployed on the same node. This is what I personally prefer for many apps (unless I want more then one to be scheduled per node). Note that it will be in a bit of trouble if you decide to scale your deployment to more replicas then you have nodes.
Example for versioned PodAntiAffinity I use :
metadata:
labels:
app: {{ template "fullname" . }}
version: {{ .Values.image.tag }}
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values: ["{{ template "fullname" . }}"]
- key: version
operator: In
values: ["{{ .Values.image.tag }}"]
topologyKey: kubernetes.io/hostname
consider fiddling with Descheduler which is like an evil twin of Kubes Scheduler component which will cause deleting of pods for them tu reschedule differently

I tried some solutions and what is working at the moment is simply based on the change of version inside my deployment.yaml on DaemonSet controller.
I mean:
1) I have to deploy for the 1' time my application based on a pod with some containers. These pods should be deployed on every cluster node (I have 3 nodes). I have set up the deployment setting in the yaml file with the option replicas equal to 3:
apiVersion: apps/v1beta2 # for versions before 1.8.0 use apps/v1beta1
kind: Deployment
metadata:
name: my-deployment
labels:
app: webpod
spec:
replicas: 3
....
I have set up the daemonset (or ds) in the yaml file with the option updateStrategy equal to RollingUpdate:
apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
name: my-daemonset
spec:
updateStrategy:
type: RollingUpdate
...
The version used for one of my containers is 2.1 for example
2) I execute the deployment with the command: kubectl apply -f my-deployment.yaml
I execute the deployment with the command: kubectl apply -f my-daemonset.yaml
3) I get one pod for every node without problem
4) Now I want to update the deployment changing the version of the image that I use for one of my containers. So I simply change the yaml file editing 2.1 with 2.2. Then I re-launch the command: kubectl apply -f my-deployment.yaml
So I can simply change the version of the image (2.1 -> 2.2) with this command:
kubectl set image ds/my-daemonset my-container=my-repository:v2.2
5) Again, I obtain one pod for every node without problem
Behavior very different if instead I use the command:
kubectl set image deployment/my-deployment my-container=xxxx:v2.2
In this case I get a wrong result where a node has 2 pod, a node 1 pod and last node without any pod...
To see how the deployment evolves, I can launch the command:
kubectl rollout status ds/my-daemonset
getting something like that
Waiting for rollout to finish: 0 out of 3 new pods have been updated...
Waiting for rollout to finish: 0 out of 3 new pods have been updated...
Waiting for rollout to finish: 1 out of 3 new pods have been updated...
Waiting for rollout to finish: 1 out of 3 new pods have been updated...
Waiting for rollout to finish: 1 out of 3 new pods have been updated...
Waiting for rollout to finish: 2 out of 3 new pods have been updated...
Waiting for rollout to finish: 2 out of 3 new pods have been updated...
Waiting for rollout to finish: 2 out of 3 new pods have been updated...
Waiting for rollout to finish: 2 of 3 updated pods are available...
daemon set "my-daemonset" successfully rolled out

Related

How to scale up all OpenShift pods before scaling down old ones

I have a basic OpenShift deployment configuration:
kind: DeploymentConfig
spec:
replicas: 3
strategy:
type: Rolling
Additionaly I've put:
maxSurge: 3
maxUnavailable: 0%
because I want to scale up all new pods first and after that scale down old pods (so there will be 6 pods running during deploymentm that's why I decided to set up maxSurge).
I want to have all old pods running until all new pods are up but with this set of parameters there is something wrong. During deployment:
all 3 new pods are initialized at once and are trying to start, old pods are running (as expected)
if first new pod started sucessfully then the old one is terminated
if second new pod is ready then another old pod is terminated
I want to terminate all old pods ONLY if all new pods are ready to handle requests, otherwise all the old pods should handle requests.
What did I miss in this confgiuration?

The behavior you document is expected for a deployment rollout (that OpenShift will shut down each old pod as a new pod becomes ready). It will also start routing traffic to the new nodes as they become available, which you say that you don't want either.
A service is pretty much by definition going to route to pods as they are available. And a deployment pretty much handles pods independently, so I don't believe that anything will really give you the behavior you are looking for there either.
If you want a blue green style deployment like you describe, you are essentially going to have deploy the new pods as a separate deployment. Then once the new deployment is completely up, you can change the corresponding service to point at the new pods. Then you can shut down the old deployment.
Service Mesh can help with some of that. So could an operator. Or you could do it manually.

You can combine the rollout strategy with readiness checks with an initial delay to ensure that all the new pods have time to start up before the old ones are all shut down at the same time.
In the case below, the new 3 pods will be spun up (for a total of 6 pods) and then after 60 seconds, the readiness check will occur and the old pods will be shut down. You would just want to adjust your readiness delay to a large enough timeframe to give all of your new pods time to start up.
apiVersion: v1
kind: DeploymentConfig
spec:
replicas: 3
strategy:
rollingParams:
maxSurge: 3
maxUnavailable: 0
type: Rolling
template:
spec:
containers:
- readinessProbe:
httpGet:
path: /actuator/health/readiness
port: 8099
initialDelaySeconds: 60

How to do controlled rollout using Kubernetes deployment

We have 1000 store nodes and need to deploy an application image on every kubernetes node by rolling out in the below order and would like to specify the deployment node details during the deployment. Is there a way to specify node details in the command line when we execute kubectl create or apply deployment commands?
This application image would be configured to store/node specific details during container/POD creation.
1 node on day 1,
10 node on day 2,
100 node on day 3 etc.

Answering on the question from the title:
How to do controlled rollout using Kubernetes deployment
You can create a Deployment that will have specific fields in its manifest that will configure the way Kubernetes handles it.
With the fields like: podAntiAffinity, requiredDuringSchedulingIgnoredDuringExecution you can ensure that Kubernetes will distribute the Pods equally across the cluster Nodes. You can read more about it by following below documentation:
Kubernetes.io: Docs: Concepts: Scheduling eviction: Assign Pod to Node
Having in mind following rollout schedule:
DAY
REPLICAS_COUNT
1
1
2
10
3
100
4
1000
You could use CI/CD tools (like for example Jenkins) to rollout (change) the amount of replicas of your Deployment across a specific schedule.
You could create a Jenkins pipeline with a deploy stage where you could put your own command with it's scheduler (or delay).
The example of such Deployment that could be used with Jenkins is following:
cat << EOF | kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx
spec:
selector:
matchLabels:
app: nginx
replicas: ${REPLICAS_COUNT}
template:
metadata:
labels:
app: nginx
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- nginx
topologyKey: "kubernetes.io/hostname"
containers:
- name: nginx
image: nginx
EOF
This Deployment will assign Pods to the Nodes that aren't already having an already running replica of this Deployment (i.e. 1 Pod = 1 Node). If the amount of Pods exceeds the amount of Nodes they will remain in Pending state.
Additional resources:
Jenkins.io: Doc: Pipeline: Tour: Environment
Kubernetes.io: Docs: Concepts: Workloads: Controllers: Deployment

upgrading to bigger node-pool in GKE

I have a node-pool (default-pool) in a GKE cluster with 3 nodes, machine type n1-standard-1. They host 6 pods with a redis cluster in it (3 masters and 3 slaves) and 3 pods with an nodejs example app in it.
I want to upgrade to a bigger machine type (n1-standard-2) with also 3 nodes.
In the documentation, google gives an example to upgrade to a different machine type (in a new node pool).
I have tested it while in development, and my node pool was unreachable for a while while executing the following command:
for node in $(kubectl get nodes -l cloud.google.com/gke-nodepool=default-pool -o=name); do
kubectl cordon "$node";
done
In my terminal, I got a message that my connection with the server was lost (I could not execute kubectl commands). After a few minutes, I could reconnect and I got the desired output as shown in the documentation.
The second time, I tried leaving out the cordon command and I skipped to the following command:
for node in $(kubectl get nodes -l cloud.google.com/gke-nodepool=default-pool -o=name); do
kubectl drain --force --ignore-daemonsets --delete-local-data --grace-period=10 "$node";
done
This because if I interprete the kubernetes documentation correctly, the nodes are automatically cordonned when using the drain command. But I got the same result as with the cordon command: I lost connection to the cluster for a few minutes, and I could not reach the nodejs example app that was hosted on the same nodes. After a few minutes, it restored itself.
I found a workaround to upgrade to a new node pool with bigger machine types: I edited the deployment/statefulset yaml files and changed the nodeSelector. Node pools in GKE are tagged with:
cloud.google.com/gke-nodepool=NODE_POOL_NAME
so I added the correct nodeSelector to the deployment.yaml file:
apiVersion: apps/v1
kind: Deployment
metadata:
name: example-deployment
labels:
app: example
spec:
replicas: 3
selector:
matchLabels:
app: example
template:
metadata:
labels:
app: example
spec:
nodeSelector:
cloud.google.com/gke-nodepool: new-default-pool
containers:
- name: example
image: IMAGE
ports:
- containerPort: 3000
This works without downtime, but I'm not sure this is the right way to do in a production environment.
What is wrong with the cordon/drain command, or am I not using them correctly?

Cordoning a node will cause it to be removed from the load balancers backend list, so will a drain. The correct way to do it is to set up anti-affinity rules on the deployment so the pods are not deployed on the same node, or the same region for that matter. That will cause an even distribution of pods throught your node pool.
Then you have to disable autoscaling on the old node pool if you have it enabled, slowly drain 1-2 nodes a time and wait for them to appear on the new node pool, making sure at all times to keep one pod of the deployment alive so it can handle traffic.

What is the difference between kubectl autoscale vs kubectl scale

I am new to kubernetes and trying to understand when to use kubectl autoscale and kubectl scale commands

Scale in deployment tells how many pods should be always running to ensure proper working of the application. You have to specify it manually.
In YAMLs you have to define it in spec.replicas like in example below:
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
labels:
app: nginx
spec:
replicas: 3
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:1.7.9
ports:
- containerPort: 80
Second way to specify scale (replicas) of deployment is use command.
$ kubectl run nginx --image=nginx --replicas=3
deployment.apps/nginx created
$ kubectl get deployment
NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
nginx 3 3 3 3 11s
It means that deployment will have 3 pods running and Kubernetes will always try to maintain this number of pods (If any of the pods will crush, K8s will recreate it). You can always change it with in spec.replicas and use kubectl apply -f <name-of-deployment> or via command
$ kubectl scale deployment nginx --replicas=10
deployment.extensions/nginx scaled
$ kubectl get deployment
NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
nginx 10 10 10 10 4m48s
Please read in documentation about scaling and replicasets.
Horizontal Pod Autoscaling (HPA) was invented to scale deployment based on metrics produced by pods. For example, if your application have about 300 HTTP request per minute and each your pod allows to 100 HTTP requests for minute it will be ok. However if you will receive a huge amount of HTTP request ~ 1000, 3 pods will not be enough and 70% of request will fail. When you will use HPA, deployment will autoscale to run 10 pods to handle all requests. After some time, when number of request will drop to 500/minute it will scale down to 5 pods. Later depends on request number it might go up or down depends on your HPA configuration.
Easiest way to apply autoscale is:
$ kubectl autoscale deployment <your-deployment> --<metrics>=value --min=3 --max=10
It means that autoscale will automatically scale based on metrics to maximum 10 pods and later it will downscale minimum to 3.
Very good example is shown at HPA documentation with CPU usage.
Please keep in mind that Kubernetes can use many types of metrics based on API (HTTP/HTTP request, CPU/Memory load, number of threads, etc.)
Hope it help you to understand difference between Scale and Autoscaling.

Kubernetes keeps spawning Pods after deletion

The following is the file used to create the Deployment:
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: kloud-php7
namespace: kloud-hosting
spec:
replicas: 1
template:
metadata:
labels:
app: kloud-php7
spec:
containers:
- name: kloud-php7
image: 192.168.1.1:5000/kloud-php7
- name: kloud-nginx
image: 192.168.1.1:5000/kloud-nginx
ports:
- containerPort: 80
The Deployment and the Pod worked fine, but after deleting the Deployment and a generated ReplicaSet, the I cannot delete the spawn Pods permanently. New Pods will be created if old ones are deleted.
The kubernetes cluster is created with kargo, containing 4 nodes running CentOS 7.3, kubernetes version 1.5.6
Any idea how to solve this problem ?

This is working as intended. The Deployment creates (and recreates) a ReplicaSet and the ReplicaSet creates (and recreates!) Pods. You need to delete the Deployment, not the Pods or the ReplicaSet:
kubectl delete deploy -n kloud-hosting kloud-php7

This is Because the replication set always enables to recreate the pods as mentioned in the deployment file(suppose say 3 ..kube always make sure that 3 pods up and running)
so here we need to delete replication set first to get rid of pods.
kubectl get rs
and delete the replication set .this will in turn deletes the pods

It could be the deamonsets need to be deleted.
For example:
$ kubectl get DaemonSets
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
elasticsearch-operator-sysctl 5 5 5 5 5 <none> 6d
$ kubectl delete daemonsets elasticsearch-operator-sysctl
Now running get pods should not list elasticsearch* pods.