I'm working on deploying the Thanos monitoring system and one of its components, the metric compactor, warns that there should never be more than one compactor running at the same time. If this constraint is violated it will likely lead to corruption of metric data.
Is there any way to codify "Exactly One" pod via Deployment/StatefulSet/etc, aside from "just set replicas: 1 and never scale"? We're using Rancher as an orchestration layer and it's real easy to hit that + button without thinking about it.

Limit the replicas in your deployment to 1
apiVersion: extensions/v1beta1
kind: Deployment
name: my-app
replicas: 1

Be careful with Deployment, because they can be configured with two update strategy:
RollingUpdate: new pods are added while and old pods are terminated. This mean that, depending on the maxSurge option, if you set your replicas to 1, you may still be have at most 2 pods.
Recreate: all the previous pods are terminated before any new pods are created.
Instead, Statefulsets guarantee that there will never be more than 1 instance of a pod at any given time.
apiVersion: apps/v1beta1
kind: StatefulSet
replicas: 1
Unlike Deployments, pods are not replaced until the previous has been terminated.


Cluster Autoscaler and Horizontal Pod Autoscaler working together

I have a cluster with Cluster Autoscaler activated and HPA for one of my deployments.
This is the HPA definition:
kind: HorizontalPodAutoscaler
name: hpa-resource-metrics-cpu
apiVersion: apps/v1
kind: ReplicationController
name: hello-hpa-cpu
minReplicas: 1
maxReplicas: 10
- type: Resource
name: cpu
targetAverageUtilization: 50
Now in a situation where my cluster is being used very lightly, that means this deployment will only have 1 available replica.
And since the cluster is not under high usage, it could be the case that the node containing that replica is scheduled for deletion (downscaling).
In that case, it would make my deployment have a downtime (when the cluster node is deleted, the only replica for the deployment is deleted as well, so it needs to be rescheduled in a new pod). I don't want that to happen (the downtime).
From this issue: https://github.com/kubernetes/kubernetes/issues/48307, it seems that Pod Disruption Budgets are not applicable to deployments with only 1 replica.
So the only solution to my problem would be to have minReplicas set to 2?
Or is there something else I could do to prevent this downtime, and still let minReplicas as 1?
Kubernetes has the notion of a disruption. The cluster autoscaler (or an administrator) taking a node offline is a "voluntary" disruption (as distinct from, say, the node losing power) and so you have some control over it. If you create a pod disruption budget:
apiVersion: policy/v1beta1
kind: PodDisruptionBudget
name: hello-pdb
minAvailable: 1
app: hello
You have specified that there shouldn't be fewer than one pod, with a label app: hello, when the cluster tries to perform a voluntary disruption.
Doing this can prevent the cluster autoscaler from actually deleting the node. The examples in the PDB documentation generally have multiple replicas and can tolerate some of them being offline, so it's possible to delete 1 replica of 3 and recreate it on a different node. There is an extended example where there's not capacity in the cluster to start a rescheduled pod, and this blocks destroying a node. You might set the HPA to minReplicas: 3 to avoid this case, even if it means your system will be overprovisioned at the quietest times.

How to create deployments with more than one pod?

I'm hosting an application on the Google Cloud Platform via Kubernetes, and I've managed to set up this continuous deployment pipeline:
Application code is updated
New Docker image is automatically generated
K8s Deployment is automatically updated to use the new image
This works great, except for one issue - the deployment always seems to have only one pod. Because of this, when the next update cycle comes around, the entire application goes down, which is unacceptable.
I've tried modifying the YAML of the deployment to increase the number of replicas, and it works... until the next image update, where it gets reset back to one pod again.
This is the command I use to update the image deployment:
set image deployment foo-server gcp-cd-foo-server-sha256=gcr.io/project-name/gcp-cd-foo-server:$REVISION_ID
You can use this command if you dont want to edit deployment yaml file:
kubectl scale deployment foo-server --replicas=2
Also, look at update strategy with maxUnavailable and maxsurge properties.
In your orgional deployment.yml file keep the replicas to 2 or more, othervise you cant avoid down time if only one pod is running and you are going to re-deploy/upgrade etc.
Deployment with 3 replicas( example):
apiVersion: apps/v1
kind: Deployment
name: nginx-deployment
app: nginx
replicas: 3
app: nginx
app: nginx
- name: nginx
image: nginx:1.7.9
- containerPort: 80
Deployment can ensure that only a certain number of Pods may be down
while they are being updated. By default, it ensures that at least 25%
less than the desired number of Pods are up (25% max unavailable).
Deployment can also ensure that only a certain number of Pods may be
created above the desired number of Pods. By default, it ensures that
at most 25% more than the desired number of Pods are up (25% max
Nevermind, I had just set up my deployments wrong - had something to do with using the GCP user interface to create the deployments rather than console commands. I created the deployments with kubectl run app --image ... instead and it works now.

Is it possible for running pods on kubernetes to share the same PVC

I've currently set up a PVC with the name minio-pvc and created a deployment based on the stable/minio chart with the values
mode: standalone
replicas: 1
enabled: true
existingClaim: minio-pvc
What happens if I increase the number of replicas? Do i run the risk of corrupting data if more than one pod tries to write to the PVC at the same time?
Don't use deployment for stateful containers. Instead use StatefulSets.
StatefulSets are specifically designed for running stateful containers like databases. They are used to persist the state of the container.
Note that each pod is going to bind a separate persistent volume via pvc. There is no possibility of multiple instances of pods writing to same pv. Hope I answered your question.
In case you are sticking to Deployments instead of StatefulSets it won't be feasible for multiple replicas to write to the same PVC, since there is no guarantee that the different replicas are scheduled on the same node, and so you might have a pending pod waiting to establish a connection to the volume and fail. The solution is to choose a specific node and have all your replicas run on the same node.
Run the following and assign a label to one of your nodes:
kubectl label nodes <node-name> <label-key>=<label-value>
Say we choose label-key to be labelKey and label-value to be node1. Then you can go ahead and add the following to your YAML file and have the pods scheduled on the same node:
apiVersion: apps/v1
kind: Deployment
name: my-app
app: my-app
replicas: 3
labelKey: node1

How do I make Kubernetes scale my deployment based on the "ready"/ "not ready" status of my Pods?

I have a deployment with a defined number of replicas. I use readiness probe to communicate if my Pod is ready/ not ready to handle new connections – my Pods toggle between ready/ not ready state during their lifetime.
I want Kubernetes to scale the deployment up/ down to ensure that there is always the desired number of pods in a ready state.
If replicas is 4 and there are 4 Pods in ready state, then Kubernetes should keep the current replica count.
If replicas is 4 and there are 2 ready pods and 2 not ready pods, then Kubernetes should add 2 more pods.
How do I make Kubernetes scale my deployment based on the "ready"/ "not ready" status of my Pods?
I don't think this is possible. If pod is not ready, k8 will not make it ready as It is something which releated to your application.Even if it create new pod, how readiness will be guaranted. So you have to resolve the reasons behind non ready status and then k8. Only thing k8 does it keep them away from taking world load to avoid request failure
Ensuring you always have 4 pods running can be done by specifying the replicas property in your deployment definition:
apiVersion: apps/v1
kind: Deployment
name: nginx-deployment
app: nginx
replicas: 4 #here we define a requirement for 4 replicas
app: nginx
app: nginx
- name: nginx
image: nginx:1.7.9
- containerPort: 80
Kubernetes will ensure that if any pods crash, replacement pods will be created so that a total of 4 are always available.
You cannot schedule deployments on unhealthy nodes in the cluster. The master api will only create pods on nodes which are healthy and meet the quota criteria to create any additional pods on the nodes which are schedulable.
Moreover, what you define is called an auto-heal concept of k8s which in basic terms will be taken care of.

Have kube jobs start on waiting pods

I am working on a scenario where I want to be able to maintain some X number of pods in waiting (and managed by kube) and then upon user request (via some external system) have a kube job start on one of those waiting pods. So now the waiting pods count is X-1 and kube starts another pod to bring this number back to X.
This way I'll be able to cut down on the time taken to create a pod, start a container and getting is ready to start actual processing. The processing data can be sent to those pods via some sort of messaging (akka or rabbitmq).
I think the ReplicationControllers are best place to keep idle pods, but when I create a job how can I specify that I want to be able to use one of the pods that are in waiting and are managed by ReplicationController.
I think I got this to work upto a state on top of which I can build this solution.
So what I am doing is starting a RC with replicas: X (X is the number of idle pods I wish to maintain, usually not a very large number). The pods that it starts have custom label status: idle or something like that. The RC spec.selector has the same custom label value to match with the pods that it manages, so spec.selector.status: idle. When creating this RC, kube ensures that it creates X pods with their status=idle. Somewhat like below:
apiVersion: v1
kind: ReplicationController
name: testrc
replicas: 3
status: idle
name: idlepod
status: idle
On the other hand I have a job yaml that has spec.manualSelector: true (and yes I have taken into account that the label set has to be unique). With manualSelector enabled, I can now define selectors on the job like below.
apiVersion: batch/v1
kind: Job
generateName: testjob-
manualSelector: true
status: active
So clearly, RC creates pods with status=idle and job expects to use pods with status=active because of the selector.
So now whenever I have a request to start a new job, I'll update label on one of the pods managed by RC so that its status=active. The selector on RC will effect the release of this pod from its control and start another one because of replicas: X set on it. And the released pod is no longer controller by RC and is now orphan. Finally, when I create a job, the selector on this job template will match the label of the orphaned pod and this pod will then be controlled by the new job. I'll send messages to this pod that will start the actual processing and finally bring it to complete.
P.S.: Pardon my formatting. I am new here.