schedule kubernetes pods on different physical server - kubernetes

In my cluster there are 30 VMs which are located in 3 different physical servers. I want to deploy different replicas of each workload on different physical server.
I know I can use podAntiAffinity to deploy replicas on different VMs but I cant find any way to guarantee spread replication on different physical server.
I want to know is there any way to solve this challenge?

I believe you gave the answer ;)
I went to the Kubernetes Patterns book (PDF available for free in here) to see if there was something related to that over there, and found exactly that:
To express how Pods should be spread to achieve high availability, or be packed and co-located together to improve latency, Pod affinity and antiaffinity can be used.
Node affinity works at node granularity, but Pod affinity is not limited to nodes and
can express rules at multiple topology levels. Using the topologyKey field, and the
matching labels, it is possible to enforce more fine-grained rules, which combine
rules on domains like node, rack, cloud provider zone, and region [...]
I really like the k8s docs as well, they are super complete and full of examples, so maybe you can get some ideas from here. I think the main idea will be to create your own affinity/antiaffinity rule.
----------------------------------- EDIT -----------------------------------
There is a new feature within k8s version 1.18 that may be a better solution.
It's called: Pod Topology Spread Constraints:
You can use topology spread constraints to control how Pods are spread across your cluster among failure-domains such as regions, zones, nodes, and other user-defined topology domains. This can help to achieve high availability as well as efficient resource utilization.

Related

Best method to keep client-server traffic in the same region in Kubernetes/Openshift?

We run a Kubernetes-compatible (OKD 3.11) on-prem / private cloud cluster with backend apps communicating with low-latency Redis databases used as caches and K/V stores. The new architecture design is about to divide worker nodes equally between two geographically distributed data centers ("regions"). We can assume static pairing between node names and regions, an now we have added labeling of nodes with region names as well.
What would be the recommended approach to protect low-latency communication with the in-memory databases, making client apps stick to the same region as the database they are allowed to use? Spinning up additional replicas of the databases is feasible, but does not prevent round-robin routing between the two regions...
Related: Kubernetes node different region in single cluster
Posting this out of comments as community wiki for better visibility, feel free to edit and expand.
Best option to solve this question is to use istio - Locality Load Balancing. Major points from the link:
A locality defines the geographic location of a workload instance
within your mesh. The following triplet defines a locality:
Region: Represents a large geographic area, such as us-east. A region typically contains a number of availability zones. In
Kubernetes, the label topology.kubernetes.io/region determines a
node’s region.
Zone: A set of compute resources within a region. By running services in multiple zones within a region, failover can occur between
zones within the region while maintaining data locality with the
end-user. In Kubernetes, the label topology.kubernetes.io/zone
determines a node’s zone.
Sub-zone: Allows administrators to further subdivide zones for more fine-grained control, such as “same rack”. The sub-zone concept
doesn’t exist in Kubernetes. As a result, Istio introduced the custom
node label topology.istio.io/subzone to define a sub-zone.
That means that a pod running in zone bar of region foo is not
considered to be local to a pod running in zone bar of region baz.
Another option that can be considered with traffic balancing adjusting is suggested in comments:
use nodeAffinity to achieve consistency between scheduling pods and nodes in specific "regions".
There are currently two types of node affinity, called
requiredDuringSchedulingIgnoredDuringExecution and
preferredDuringSchedulingIgnoredDuringExecution. You can think of
them as "hard" and "soft" respectively, in the sense that the former
specifies rules that must be met for a pod to be scheduled onto a node
(similar to nodeSelector but using a more expressive syntax), while
the latter specifies preferences that the scheduler will try to
enforce but will not guarantee. The "IgnoredDuringExecution" part of
the names means that, similar to how nodeSelector works, if labels on
a node change at runtime such that the affinity rules on a pod are no
longer met, the pod continues to run on the node. In the future we
plan to offer requiredDuringSchedulingRequiredDuringExecution which
will be identical to requiredDuringSchedulingIgnoredDuringExecution
except that it will evict pods from nodes that cease to satisfy the
pods' node affinity requirements.
Thus an example of requiredDuringSchedulingIgnoredDuringExecution
would be "only run the pod on nodes with Intel CPUs" and an example
preferredDuringSchedulingIgnoredDuringExecution would be "try to run
this set of pods in failure zone XYZ, but if it's not possible, then
allow some to run elsewhere".
Update: based on #mirekphd comment, it will still not be fully functioning in a way it was asked to:
It turns out that in practice Kubernetes does not really let us switch
off secondary zone, as soon as we spin up a realistic number of pod
replicas (just a few is enough to see it)... they keep at least some
pods in the other zone/DC/region by design (which is clever when you
realize that it removes the dependency on the docker registry
survival, at least under default imagePullPolicy for tagged images),
GibHub issue #99630 - NodeAffinity preferredDuringSchedulingIgnoredDuringExecution doesn't work well
Please refer to #mirekphd's answer
So effective region-pinning solution is more complex than just using nodeAffinity in the "preferred" version. This alone will cause you a lot of unpredictable surprises due to the opinionated character of Kubernetes that has zone spreading hard-coded, as seen in this Github issue, where they clearly try to put at least some eggs in another basket and see zone selection as an antipattern.
In practice the usefulness of nodeAffinity alone is restricted to a scenario with a very limited number of pod replicas, because when the pods number exceeds the number of nodes in a region (i.e. typically for the 3rd replica in a 2-nodes / 2-regions setup), scheduler will start "correcting" or "fighting" with user preference weights (even as unbalanced of 100:1) very much in favor of spreading, placing at least one "representative" pod on every node in every region (including the non-preferred ones with minimum possible weights of 1).
But this default zone spreading issue can be overcome if you create a single-replica container that will act as a "master" or "anchor" (a natural example being a database). For this single-pod "master" nodeAffinity will still work correctly - of course in the HA variant, i.e. "preferred" not "required" version. As for the remaining multi-pod apps, you use something else - podAffinity (this time in the "required" version), which will make the "slave" pods follow their "master" between zones, because setting any pod-based spreading disables the default zone spreading. You can have as many replicas of the "slave" pods as you want and never run into a single misplaced pod (at least at schedule time), because of the "required" affinity used for "slaves". Note that the known limitation of nodeAffinity applies here as well: the number of "master" pod replicas must not exceed the number of nodes in a region, or else "zone spreading" will kick in.
And here's an example of how to label the "master" pod correctly for the benefit of podAffinity and using a deployment config YAML file: https://stackoverflow.com/a/70041308/9962007

How do I make pods spread evenly across nodes when a new host is added?

Overview
Kubernetes scheduling errs on the side of 'not shuffling things around once scheduled and happy' which can lead to quite the level of imbalance in terms of CPU, Memory, and container count distribution. It can also mean that sometimes Affinity and Topology rules may not be enforced / as the state of affair changes:
With regards to topology spread constraints introduced in v1.19 (stable)
There's no guarantee that the constraints remain satisfied when Pods are removed. For example, scaling down a Deployment may result in imbalanced Pods distribution.
Context
We are currently making use of pod topology spread contraints, and they are pretty superb, aside from the fact that they only seem to handle skew during scheduling, and not execution (unlike the ability to differentiate with Taints and Tolerations).
For features such as Node affinity, we're currently waiting on the ability to add RequiredDuringExecution requirements as opposed to ScheduledDuringExecution requirements
Question
My question is, is there a native way to make Kubernetes re-evaluate and attempt to enforce topology spread skew when a new fault domain (topology) is added, without writing my own scheduler?
Or do I need to wait for Kubernetes to advance a few more releases? ;-) (I'm hoping someone may have a smart answer with regards to combining affinity / topology constraints)
After more research I'm fairly certain that using an outside tool like Descheduler is the best way currently.
There doesn't seem to be a combination of Taints, Affinity rules, or Topology constraints that can work together to achieve the re-evaluation of topology rules during execution.
Descheduler allows you to kill of certain workloads based on user requirements, and let the default kube-scheduler reschedule killed pods. It can be installed easily with manifests or Helm and ran on a schedule. It can even be triggered manually when the topology changes, which is what I think we will implement to suit our needs.
This will be the best means of achieving our goal while waiting for RequiredDuringExecution rules to mature across all feature offerings.
Given our topology rules mark each node as a topological zone, using a Low Node Utilization strategy to spread workloads across new hosts as they appear will be what we go with.
apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
"LowNodeUtilization":
enabled: true
params:
nodeResourceUtilizationThresholds:
thresholds:
"memory": 20
targetThresholds:
"memory": 70

In Kubernetes, how many namespaces can you have?

I want to use the Kubernetes namespace for each user of my application. So potentially I'll need to create thousands of namespaces each with kubernetes resources in them. I want to make sure this is scalable, so I want to ensure that I can have millions of namespaces on a Kubernetes Cluster before I use this construct on a per user basis.
I'm building a web hosting application. So I'm giving resources to each user, but I want them separated by namespaces.
Are there any limitations to the number of Kubernetes namespaces you can create?
"In majority of cases, thresholds are NOT hard limits - crossing the limit results in degraded performance and doesn't mean cluster immediately fails over.
Many of the thresholds (for cluster scope) are given for the largest possible cluster. For smaller clusters, the limits are proportionally lower.
"
#Namespaces = 10000 scope=cluster
source with more data
kube Talk explaining how the data is computed
You'll usually run into limitations with resources and etcd long before you hit a namespace limit.
For scaling, you're probably going to want to scale your clusters which most companies treat as cattle rather than create a giant cluster which will be a Pet, which is not a scenario you want to be dealing with.

High total CPU request but low total usage (kubernetes resources)

I have a bunch of pods in a cluster that is almost requesting all (7.35/8) available CPU resources on a node:
even though their actual total usage is almost nothing (0.34/8).
The pod that is currently requesting the most only requests 210m which I guess is not an outrageous amount - also I would like to enforce some sensible minimum request size for all pods in the cluster. Of course that will accumulate when there are lots of pods.
It seems I could easily scale down the request by a factor of 10 and leave the limits where they are to begin with.
But is there something else that I should look into instead before doing that - reducing replica count etc.?
Also it looks a bit strange that the pods are not more evenly distributed between the nodes.
Your request values seems overestimated.
You need time and metrics to find the right request/limit for your workload.
Keep in mind that if you change those values, your pods will restart.
Also, It's normal that you can find some unbalance nodes on your cluster. Kubernetes will never remove a pod if you don't ask.
For example, if your create a cluster with 3 nodes, fill those 3 nodes with pods and then add another 3 nodes. The new nodes will stay empty.
You can setup some HorizontalPodAutoScaler on your cluster to adapt your number of pod to your workload.
Doing that, your workload will spread among nodes and with a correct balance. (if you use the default Scheduling Policy
I suggest following:
Resource Allocation: Based on history value set your request to meaningful value with buffer. Also to have guaranteed pod resource allocation it may be a good idea to set request and limit as same value. But that means you pod cannot burst for new resource. One more thing to note is scheduling only happens based on requested value, so if node has no more resource left, then pod will be killed and rescheduled if you request is trying to burst to limit.
Resource quotas: Check Kubernetes Resource Quotas to have sensible namespace level quotas to control overly provisioned resources by developers
Affinity/AntiAffinity: Check concept of Anti-affinity to have your replicas or different pods scheduled across your cluster. You can ensure for eg., that one host or Avalability zone etc can have only one replica of your pod (helps in HA), spread different pods to different nodes (layer scheduling etc) - Check this video
There are good answers already but I would like to add some more info.
It is very important to have a good strategy when calculating how much resources you would need for each container.
Optimally, your pods should be using exactly the amount of resources you requested but that's almost impossible to achieve. If the usage is lower than your request, you are wasting resources. If it's higher, you are risking performance issues. Consider a 25% margin up and down the request value as a good starting point. Regarding limits, achieving a good setting would depend on trying and adjusting. There is no optimal value that would fit everyone as it depends on many factors related to the application itself, the demand model, the tolerance to errors etc.
Kubernetes best practices: Resource requests and limits is a very good guide explaining the idea behind these mechanisms with a detailed explanation and examples.
Also, Managing Resources for Containers will provide you with the official docs regarding:
Requests and limits
Resource types
Resource requests and limits of Pod and Container
Resource units in Kubernetes
How Pods with resource requests are scheduled
How Pods with resource limits are run, etc
Just in case you'll need a reference.

Kubernetes : Disadvantages of an all Master cluster

Hy !!
I was wondering if it could be possible to replicate an VMWare architecture in Kubernetes.
What I mean by that :
In place of having the Control-Panel always separated from the Worker Nodes, I would like to put them all together, at the end we would obtain a cluster of Master Nodes on which we can schedule applications. For now I'm using kata-container with containerd as such all applications are deployed in 'mini' VMs so there isn't the 'escape from the container' problem. The management of the Cluster would be done trough a special interface (eth0 1Gb). The users would be able to communicate with the apps that are deployed within the cluster trough another interface (eth1 10Gb). I would use Keepalived and HAProxy to elect my 'Main Master' and load balance the traffic.
The question might be 'why would you do that ?'. Well to assure High Availability at all time and reduce the management overhead, in place of having 2 sets of "entities" to manage (the control-plane and the worker nodes) simply reduce it to one, as such there won't be any problems such as 'I don't have more than 50% of my masters online so there won't be a leader elect', so now I would have to either eliminate master nodes from my cluster until the percentage of online master nodes > 50%, that would ask for technical intervention and as fast as possible which might result in human errors etc..
Another positive point would be the scaling, in place of having 2 parts of the cluster that I would need to scale (masters and workers) there would be only one, I would need to add another master/worker to the cluster and that's it. All the management traffic would be redirected to the Main Master that uses a Virtual IP (VIP) and in case of an overcharge the request would be redirected to another Node.
In the end I would have something resembling to this :
Photo - Architecture VMWare-like
I try to find disadvantages to this kind of architecture, I know that there would be etcd traffic on each Node but how impactful is it ? I know that there will be wasted resources for the Pods of the control-plane on each node, but knowing that these pods (except etcd) wont do much beside waiting, how impactful would it be ? Having each Node being capable to take the Master role there won't be any down time. Right now if my control-plane (3 masters) go down I have to reboot them or find the solution as fast as possible before there's a problem with one of the apps that turn on the worker Nodes.
The topology I'm using right now resembles the following :
Architecture basic Kubernetes
I'm new to kuberentes so the question might be seen as stupid but I would really like to know the advantages/disadvantages between the two and understand why it wouldn't be a good idea.
Thanks a lot for any help !! :slightly_smiling_face:
There are two reasons for keeping control planes on their own. The big one is that you only want a small number of etcd nodes, usually 3 or 5 and that's usually the bounding factor on the size of the control plane. You usually want the ability to scale worker nodes independently from that. The second issue is Etcd is very sensitive to IOPS brownouts and can get bad cascade failures if the machine runs low on IOPS.
And given that you are doing things on top of VMWare anyway, the overhead of managing 3 vs 6 VMs is not generally a difference in kind. This seems like a false savings in the long run.