Can I have nodes with different machine specifications (e.g, low
performer, medium performer, high performer) in my cluster?
Can I target a partition to run on a specific node?
Can I specify I may use up to 100 nodes, initially specify 10
partitions each running on its own node of the 100 (so 10 partitions
10 nodes), but over time drop and add partitions such that hours
later I'm using 5 partitions on 5 nodes and later 96 partitions on
96 nodes (all this ignores replicas)?
Yes, you can use Node Types. NodeType is the node definition used to create the cluster virtual machines. It is based on Virtual Machines Scale Sets, this scale set has the definition of OS, Memory, Disk, and so on. In you case you would create the nodeTypes low performer, medium performer, high performer, and can define how many instances(VMs) each nodeType will have. For more information, check here
On Service Fabric you have placement constraints where you define conditions for your services to be validated before deploying the service to a particular node, for example, one of the constraints you can create is (NodeType==MediumPerformer). This will make SF place your service on any node of Type MediumPerformer. The only caveat is that it use the same rule for all replicas and partitions, if you want different behavior, you would have to create new service named instances with different rules. For more information, check here
Service Partitions are immutable, so you won't be able to change the number of partitions after the service is deployed. You can bypass this limitation by creating multiple named services instead. For more information, check here
Related
We run a Kubernetes-compatible (OKD 3.11) on-prem / private cloud cluster with backend apps communicating with low-latency Redis databases used as caches and K/V stores. The new architecture design is about to divide worker nodes equally between two geographically distributed data centers ("regions"). We can assume static pairing between node names and regions, an now we have added labeling of nodes with region names as well.
What would be the recommended approach to protect low-latency communication with the in-memory databases, making client apps stick to the same region as the database they are allowed to use? Spinning up additional replicas of the databases is feasible, but does not prevent round-robin routing between the two regions...
Related: Kubernetes node different region in single cluster
Posting this out of comments as community wiki for better visibility, feel free to edit and expand.
Best option to solve this question is to use istio - Locality Load Balancing. Major points from the link:
A locality defines the geographic location of a workload instance
within your mesh. The following triplet defines a locality:
Region: Represents a large geographic area, such as us-east. A region typically contains a number of availability zones. In
Kubernetes, the label topology.kubernetes.io/region determines a
node’s region.
Zone: A set of compute resources within a region. By running services in multiple zones within a region, failover can occur between
zones within the region while maintaining data locality with the
end-user. In Kubernetes, the label topology.kubernetes.io/zone
determines a node’s zone.
Sub-zone: Allows administrators to further subdivide zones for more fine-grained control, such as “same rack”. The sub-zone concept
doesn’t exist in Kubernetes. As a result, Istio introduced the custom
node label topology.istio.io/subzone to define a sub-zone.
That means that a pod running in zone bar of region foo is not
considered to be local to a pod running in zone bar of region baz.
Another option that can be considered with traffic balancing adjusting is suggested in comments:
use nodeAffinity to achieve consistency between scheduling pods and nodes in specific "regions".
There are currently two types of node affinity, called
requiredDuringSchedulingIgnoredDuringExecution and
preferredDuringSchedulingIgnoredDuringExecution. You can think of
them as "hard" and "soft" respectively, in the sense that the former
specifies rules that must be met for a pod to be scheduled onto a node
(similar to nodeSelector but using a more expressive syntax), while
the latter specifies preferences that the scheduler will try to
enforce but will not guarantee. The "IgnoredDuringExecution" part of
the names means that, similar to how nodeSelector works, if labels on
a node change at runtime such that the affinity rules on a pod are no
longer met, the pod continues to run on the node. In the future we
plan to offer requiredDuringSchedulingRequiredDuringExecution which
will be identical to requiredDuringSchedulingIgnoredDuringExecution
except that it will evict pods from nodes that cease to satisfy the
pods' node affinity requirements.
Thus an example of requiredDuringSchedulingIgnoredDuringExecution
would be "only run the pod on nodes with Intel CPUs" and an example
preferredDuringSchedulingIgnoredDuringExecution would be "try to run
this set of pods in failure zone XYZ, but if it's not possible, then
allow some to run elsewhere".
Update: based on #mirekphd comment, it will still not be fully functioning in a way it was asked to:
It turns out that in practice Kubernetes does not really let us switch
off secondary zone, as soon as we spin up a realistic number of pod
replicas (just a few is enough to see it)... they keep at least some
pods in the other zone/DC/region by design (which is clever when you
realize that it removes the dependency on the docker registry
survival, at least under default imagePullPolicy for tagged images),
GibHub issue #99630 - NodeAffinity preferredDuringSchedulingIgnoredDuringExecution doesn't work well
Please refer to #mirekphd's answer
So effective region-pinning solution is more complex than just using nodeAffinity in the "preferred" version. This alone will cause you a lot of unpredictable surprises due to the opinionated character of Kubernetes that has zone spreading hard-coded, as seen in this Github issue, where they clearly try to put at least some eggs in another basket and see zone selection as an antipattern.
In practice the usefulness of nodeAffinity alone is restricted to a scenario with a very limited number of pod replicas, because when the pods number exceeds the number of nodes in a region (i.e. typically for the 3rd replica in a 2-nodes / 2-regions setup), scheduler will start "correcting" or "fighting" with user preference weights (even as unbalanced of 100:1) very much in favor of spreading, placing at least one "representative" pod on every node in every region (including the non-preferred ones with minimum possible weights of 1).
But this default zone spreading issue can be overcome if you create a single-replica container that will act as a "master" or "anchor" (a natural example being a database). For this single-pod "master" nodeAffinity will still work correctly - of course in the HA variant, i.e. "preferred" not "required" version. As for the remaining multi-pod apps, you use something else - podAffinity (this time in the "required" version), which will make the "slave" pods follow their "master" between zones, because setting any pod-based spreading disables the default zone spreading. You can have as many replicas of the "slave" pods as you want and never run into a single misplaced pod (at least at schedule time), because of the "required" affinity used for "slaves". Note that the known limitation of nodeAffinity applies here as well: the number of "master" pod replicas must not exceed the number of nodes in a region, or else "zone spreading" will kick in.
And here's an example of how to label the "master" pod correctly for the benefit of podAffinity and using a deployment config YAML file: https://stackoverflow.com/a/70041308/9962007
In my cluster there are 30 VMs which are located in 3 different physical servers. I want to deploy different replicas of each workload on different physical server.
I know I can use podAntiAffinity to deploy replicas on different VMs but I cant find any way to guarantee spread replication on different physical server.
I want to know is there any way to solve this challenge?
I believe you gave the answer ;)
I went to the Kubernetes Patterns book (PDF available for free in here) to see if there was something related to that over there, and found exactly that:
To express how Pods should be spread to achieve high availability, or be packed and co-located together to improve latency, Pod affinity and antiaffinity can be used.
Node affinity works at node granularity, but Pod affinity is not limited to nodes and
can express rules at multiple topology levels. Using the topologyKey field, and the
matching labels, it is possible to enforce more fine-grained rules, which combine
rules on domains like node, rack, cloud provider zone, and region [...]
I really like the k8s docs as well, they are super complete and full of examples, so maybe you can get some ideas from here. I think the main idea will be to create your own affinity/antiaffinity rule.
----------------------------------- EDIT -----------------------------------
There is a new feature within k8s version 1.18 that may be a better solution.
It's called: Pod Topology Spread Constraints:
You can use topology spread constraints to control how Pods are spread across your cluster among failure-domains such as regions, zones, nodes, and other user-defined topology domains. This can help to achieve high availability as well as efficient resource utilization.
Given a K8s Cluster(managed cluster for example AKS) with 2 worker nodes, I've read that if one node fails all the pods will be restarted on the second node.
Why would you need more than 2 worker nodes per cluster in this scenario? You always have the possibility to select the number of nodes you want. And the more you select the more expensive it is.
It depends on the solution that you are deploying in the kubernetes cluster and the nature of high-availability that you want to achieve
If you want to work on an active-standby mode, where, if one node fails, the pods would be moved to other nodes, two nodes would work fine (as long as the single surviving node has the capacity to run all the pods)
Some databases / stateful applications, for instance, need minimum of three replica, so that you can reconcile if there is a mismatch/conflict in data due to network partition (i.e. you can pick the content held by two replicas)
For instance, ETCD would need 3 replicas
If whatever you are building needs only two nodes, then you wouldn't need more than 2. If you are building anything big where the amount of compute, memory needed is much more, then instead of opting for expensive nodes with huge CPU and RAM, you could instead join more and more lower priced nodes to the cluster. This is called horizontal scaling.
Let's say I have Hello-Service. In Lagom, this service can run across multiple nodes of a single cluster.
So within Cluster 1, we can have multiple "copies" of Hello-Service:
Cluster1: Hello-Service-1, Hello-Service-2, Hello-Service-3
But is it possible to run service Hello-Service across multiple clusters?
Like this:
Cluster1: Hello-Service-1, Hello-Service-2, Hello-Service-3,
Cluster2: Hello-Service-4, Hello-Service-5, Hello-Service-6
What I want to achieve is better scalability of the read-side processors and event consumers:
In Lagom, we need to set up front the number of shards of given event tag within the cluster.
So I wonder if I can just add another cluster to distribute the load across them.
And, of course, I'd like to shard persistent entities by some key.
(Let's say that I'm building a multi-tenant application, I would shard entities by organization id, so all entities of some set of organizations would go into Cluster 1, and entities of another set of organizations would go into Cluster 2, so I can have sharded read side processors per each cluster which handle only subset of events/entities within the cluster (for better scalability)).
With a single cluster approach, as a system grows, a sharded processor within a single cluster may become slower and slower because it needs to handle more and more events.
So as the system grows, I would just add a new cluster (Let's say, Cluster 2, then Cluster 3, which would handle their own subset of events/entities)
If you are using sharded read sides, Lagom will distribute the processing of the shards across all the nodes in the cluster. So, if you have 10 shards, and 6 nodes in 1 cluster, then each node will process between 1-2 shards. If you try to deploy two clusters, 3 nodes each, then you'll end up each node processing 3-4 shards, but every event will be processed twice, once in each cluster. That's not helping scalability, that's doing twice as much work as needs to be done. So I don't see why you would want two clusters, just have one cluster, and the Lagom will distribute the shards evenly across it.
If you are not using sharded read sides, then it doesn't matter how many nodes you have in your cluster, all events will be processed by one node. If you deploy a second cluster, it won't share the load, it will also process the same events, so you'll get double processing of each event by each cluster, which is not what you want.
So, just use sharded read sides, and let Lagom distribute the work across your single cluster for you, that's what it's designed to do.
With the understanding that Ubernetes is designed to fully solve this problem, is it currently possible (not necessarily recommended) to span a single K8/OpenShift cluster across multiple internal corporate datacententers?
Additionally assuming that latency between data centers is relatively low and that infrastructure across the corporate data centers is relatively consistent.
Example: Given 3 corporate DC's, deploy 1..* masters at each datacenter (as a single cluster) and have 1..* nodes at each DC with pods/rc's/services/... being spun up across all 3 DC's.
Has someone implemented something like this as a stop gap solution before Ubernetes drops and if so, how has it worked and what would be some considerations to take into account on running like this?
is it currently possible (not necessarily recommended) to span a
single K8/OpenShift cluster across multiple internal corporate
datacententers?
Yes, it is currently possible. Nodes are given the address of an apiserver and client credentials and then register themselves into the cluster. Nodes don't know (or care) of the apiserver is local or remote, and the apiserver allows any node to register as long as it has valid credentials regardless of where the node exists on the network.
Additionally assuming that latency between data centers is relatively
low and that infrastructure across the corporate data centers is
relatively consistent.
This is important, as many of the settings in Kubernetes assume (either implicitly or explicitly) a high bandwidth, low-latency network between the apiserver and nodes.
Example: Given 3 corporate DC's, deploy 1..* masters at each
datacenter (as a single cluster) and have 1..* nodes at each DC with
pods/rc's/services/... being spun up across all 3 DC's.
The downside of this approach is that if you have one global cluster you have one global point of failure. Even if you have replicated, HA master components, data corruption can still take your entire cluster offline. And a bad config propagated to all pods in a replication controller can take your entire service offline. A bad node image push can take all of your nodes offline. And so on. This is one of the reasons that we encourage folks to use a cluster per failure domain rather than a single global cluster.