Kubernetes: customizing Pod scheduling and Volume scheduling - kubernetes

I'm trying to use Kubernetes to manage a scenario where I need to run several instances of an application (that is, several Pods). These are my requirements:
When I need to scale up my application, I want to deploy one single Pod on a specific Node (not a random one).
When I need to scale down my application, I want to remove a specific Pod from a specific Node (not a random one).
When a new Pod is deployed, I want it to mount a specific PersistentVolume (not a random one) that I have manually provisioned.
After a Pod has been deleted, I want its PersistentVolume to be re-usable by a different Pod.
So far, I used this naive solution to do all of the above: every time I needed to create a new instance of my application, I created one new Deployment (with exactly one replica) and one PersistentVolumeClaim. So for example, if I need five instances of my application, then I need five Deployments. Though, this solution is not very scalable and it doesn't exploit the full potential of Kubernetes.
I think it would be a lot smarter to use one single template for all the Pods, but I'm not sure whether I should use a Deployment or a Statefulset.
I've been experimenting with Labels and Node Affinity, and I found out that I can satisfy requirement 1, but I cannot satisfy requirement 2 this way. In order to satisfy requirement 2, would it be possible to delete a specific Pod by writing my own custom scheduler?
I don't understand how Kubernetes decides to tie a specific PersistentVolume to a specific PersistentVolumeClaim. Is there a sort of volume scheduler? Can I customize it somehow? This way, every time a new Pod is created, I'd be able to tie it to a specific volume.

There may be a good reason for these requirements so I'm not going to try to convince you that it may not be a good idea to use Kubernetes for this...
Yes - with nodeSelector using labels, node affinity, and anti-affinity rules, pods can be scheduled on "appropriate" nodes.
Static Pods may be something close to what you are looking for. I've never used static pods/bare pods on Kubernetes...they kind of don't (to quote something from the question) "...exploit the full potential of Kubernetes" ;-)
Otherwise, here is what I think will work with out-of-the-box constructs for the four requirements:
Use Deployment like you have - this will give you requirements #1 and #2. I don't believe requirement #2 (nor #1, actually) can be satisfied with StatefulSet. Neither with a ReplicaSet.
Use statically provisioned PVs and selector(s) to (quote) "...tie a specific PersistentVolume to a specific PersistentVolumeClaim" for requirement #3.
Then requirement #4 will be possible - just make sure the PVs use the proper reclaim policy.

Related

Best method to keep client-server traffic in the same region in Kubernetes/Openshift?

We run a Kubernetes-compatible (OKD 3.11) on-prem / private cloud cluster with backend apps communicating with low-latency Redis databases used as caches and K/V stores. The new architecture design is about to divide worker nodes equally between two geographically distributed data centers ("regions"). We can assume static pairing between node names and regions, an now we have added labeling of nodes with region names as well.
What would be the recommended approach to protect low-latency communication with the in-memory databases, making client apps stick to the same region as the database they are allowed to use? Spinning up additional replicas of the databases is feasible, but does not prevent round-robin routing between the two regions...
Related: Kubernetes node different region in single cluster
Posting this out of comments as community wiki for better visibility, feel free to edit and expand.
Best option to solve this question is to use istio - Locality Load Balancing. Major points from the link:
A locality defines the geographic location of a workload instance
within your mesh. The following triplet defines a locality:
Region: Represents a large geographic area, such as us-east. A region typically contains a number of availability zones. In
Kubernetes, the label topology.kubernetes.io/region determines a
node’s region.
Zone: A set of compute resources within a region. By running services in multiple zones within a region, failover can occur between
zones within the region while maintaining data locality with the
end-user. In Kubernetes, the label topology.kubernetes.io/zone
determines a node’s zone.
Sub-zone: Allows administrators to further subdivide zones for more fine-grained control, such as “same rack”. The sub-zone concept
doesn’t exist in Kubernetes. As a result, Istio introduced the custom
node label topology.istio.io/subzone to define a sub-zone.
That means that a pod running in zone bar of region foo is not
considered to be local to a pod running in zone bar of region baz.
Another option that can be considered with traffic balancing adjusting is suggested in comments:
use nodeAffinity to achieve consistency between scheduling pods and nodes in specific "regions".
There are currently two types of node affinity, called
requiredDuringSchedulingIgnoredDuringExecution and
preferredDuringSchedulingIgnoredDuringExecution. You can think of
them as "hard" and "soft" respectively, in the sense that the former
specifies rules that must be met for a pod to be scheduled onto a node
(similar to nodeSelector but using a more expressive syntax), while
the latter specifies preferences that the scheduler will try to
enforce but will not guarantee. The "IgnoredDuringExecution" part of
the names means that, similar to how nodeSelector works, if labels on
a node change at runtime such that the affinity rules on a pod are no
longer met, the pod continues to run on the node. In the future we
plan to offer requiredDuringSchedulingRequiredDuringExecution which
will be identical to requiredDuringSchedulingIgnoredDuringExecution
except that it will evict pods from nodes that cease to satisfy the
pods' node affinity requirements.
Thus an example of requiredDuringSchedulingIgnoredDuringExecution
would be "only run the pod on nodes with Intel CPUs" and an example
preferredDuringSchedulingIgnoredDuringExecution would be "try to run
this set of pods in failure zone XYZ, but if it's not possible, then
allow some to run elsewhere".
Update: based on #mirekphd comment, it will still not be fully functioning in a way it was asked to:
It turns out that in practice Kubernetes does not really let us switch
off secondary zone, as soon as we spin up a realistic number of pod
replicas (just a few is enough to see it)... they keep at least some
pods in the other zone/DC/region by design (which is clever when you
realize that it removes the dependency on the docker registry
survival, at least under default imagePullPolicy for tagged images),
GibHub issue #99630 - NodeAffinity preferredDuringSchedulingIgnoredDuringExecution doesn't work well
Please refer to #mirekphd's answer
So effective region-pinning solution is more complex than just using nodeAffinity in the "preferred" version. This alone will cause you a lot of unpredictable surprises due to the opinionated character of Kubernetes that has zone spreading hard-coded, as seen in this Github issue, where they clearly try to put at least some eggs in another basket and see zone selection as an antipattern.
In practice the usefulness of nodeAffinity alone is restricted to a scenario with a very limited number of pod replicas, because when the pods number exceeds the number of nodes in a region (i.e. typically for the 3rd replica in a 2-nodes / 2-regions setup), scheduler will start "correcting" or "fighting" with user preference weights (even as unbalanced of 100:1) very much in favor of spreading, placing at least one "representative" pod on every node in every region (including the non-preferred ones with minimum possible weights of 1).
But this default zone spreading issue can be overcome if you create a single-replica container that will act as a "master" or "anchor" (a natural example being a database). For this single-pod "master" nodeAffinity will still work correctly - of course in the HA variant, i.e. "preferred" not "required" version. As for the remaining multi-pod apps, you use something else - podAffinity (this time in the "required" version), which will make the "slave" pods follow their "master" between zones, because setting any pod-based spreading disables the default zone spreading. You can have as many replicas of the "slave" pods as you want and never run into a single misplaced pod (at least at schedule time), because of the "required" affinity used for "slaves". Note that the known limitation of nodeAffinity applies here as well: the number of "master" pod replicas must not exceed the number of nodes in a region, or else "zone spreading" will kick in.
And here's an example of how to label the "master" pod correctly for the benefit of podAffinity and using a deployment config YAML file: https://stackoverflow.com/a/70041308/9962007

Require that k8s pods have a certain label before they run on certain nodes

Say I have some special nodes in my cluster, and I want to be able to identify with a pod label all pods running on these nodes.
Taints & tolerations are close to this, but afaik tolerations aren’t labels, they’re their own thing, and can’t be referenced in places where you expect a label selector, right? Is there any way to require all pods running on these nodes to identify themselves in a way that can then be queried via label selectors?
There are two parts of problem to solve:
1. Ensure that pods with specific labels are scheduled to specific set of nodes.
This can be done by using custom scheduler or a plugin for default scheduler (more here, here, here, here and here).
2. Prevent other schedulers to schedule pods to that nodes
This can by achieved by using nodeAffinity or node taints (more here and here)
A complete solution may require additional apiserver admission controller to set all those properties to a pod that suppose to have a specific label only.
Unfortunately, there is no easy built-in solution as of yet. Hopefuly the above will point you in the right direction.

Kubernetes scale up pod with dynamic environment variable

I am wondering if there is a way to set up dynamically environment variables on a scale depending on the high load.
Let's imagine that we have
Kubernetes with service called Generic Consumer which at the beginning have 4 pods. First of all
I would like to set that 75% of pods should have env variable Gold and 25% Platinium. Is that possible? (% can be changed to static number for example 3 nodes Gold, 1 Platinium)
Second question:
If Platinium pod is having a high load is there a way to configure Kubernetes/charts to scale only the Platinium and then decrease it after higher load subsided
So far I came up with creating 2 separate YAML files with diff env variables and replicas numbers.
Obviously, the whole purpose of this is to prioritize some topics
I have used this as a reference https://www.confluent.io/blog/prioritize-messages-in-kafka.
So in the above example, Generic Consumer would be the Kafka consumer which would use env variable to get bucket config
configs.put(ConsumerConfig.PARTITION_ASSIGNMENT_STRATEGY_CONFIG,
BucketPriorityAssignor.class.getName());
configs.put(BucketPriorityConfig.TOPIC_CONFIG, "orders-per-bucket");
configs.put(BucketPriorityConfig.BUCKETS_CONFIG, "Platinum, Gold");
configs.put(BucketPriorityConfig.ALLOCATION_CONFIG, "70%, 30%");
configs.put(BucketPriorityConfig.BUCKET_CONFIG, "Platinum");
consumer = new KafkaConsumer<>(configs);
If you have any alternatives, please let me know!
As was mention in comment section, the most versitale option (and probably the best for your scenario with prioritization) is to keep two separate deployments with gold and platinium labels.
Regarding first question, like #David Maze pointed, pods from Deployment are identical and you cannot have few pods with one label and few others with different. Even if you would create manually (3 pods with gold and 1 with platiunuim) you won't be able to use HPA.
This option allows you to adjustments depends on the situation. For example you would be able to scale one deployment using HPA and another with VPA or both with HPA. Would help you maintain budget, i.e for gold users you might limit to have maximum 5 pods to run simultaneously and for platinium you could set this maximum to 10.
You could consider Istio Traffic Management to routing request, but in my opinion, method with two separated deployments is more suitable.

How to assign resource limits dynamically?

I would like to define a policy to dynamically assigns resource limits to pods and containers. For example, if there are 4 number of pods scheduled in a specific node, and the memory capacity is 100mi, each pod to be assigned with 25mi memory limit. In other words, the fair share of the node capacity.
So, is it necessary to change the codes in scheduler.go or I need to change other objects as well?
I do agree with Arslanbekov answer, it's contrary to the ideology of scalability used by kubernetes.
The principle is that you define what resources is needed by your application and the cluster do all it can to give this resource to the pod, scalling the resources (pod, nodes) depending on the global consumption of all apps.
What you are asking is the reverse, give resources to the pod depending on the node resources, this way could prove very difficult to allow automatic scallability of the nodes as it would be the resource aim to attain (I may be confusing in my explanation but that shows how difficult it could be).
One way to do what you want would be to size all your pod to the same size to use 80% of the nodes but this would prove wrong if an app need more resources.
I think this is contrary to the ideology of the kubernetes. In this approach, the new application will not be able to get to the node.
At each point in time for the scheduler will be the utilization of 100% each node.

What to set .spec.replicas to for autoscaled deployments in kubernetes?

When creating a kubernetes deployment I set .spec.replicas to my minimum desired number of replicas. Then I create a horizontal pod autoscaler with minimum and maximum replicas.
The easiest way to do the next deployment is to use this same lower bound. When combining it with autoscaling should I set replicas to the minimum desired as used before or should I get the current number of replicas and start from there? This would involve an extra roundtrip to the api, so if it's not needed would be preferable.
There are two interpretations of your question:
1. You have an existing Deployment object and you want to update it - "deploy new version of your app".
In this case you don't need to change either replicas in Deployment object (it's managed by horizontal pod autoscaler) or horizontal pod autoscaler configuration. It will work out of the box. It's enough to change the important bits of deployment spec.
See rolling update documentation for more details.
2. You have an existing Deployment object and you want to create second one with the same application
If you create a separate application it may have a different load characteristics, so it's probably that the desired number of replicas will be different. In any case HPA will adjust it relatively quickly, so IMO setting initial number of replicas to the same number is not needed.