Can service fabric autoscaling scale out nodes as well? - azure-service-fabric

Based on this
Link, auto scaling instances or partitions are provided from service fabric.
However, what's confusing is if this can also provide auto-scaling in/out of the nodes(VMs / actual physical environment), which seems not mentioned explicitly.

Yes, you can auto scale the cluster as well, assuming that you are running in Azure. This will be done based on performance counter data. It works by defining rules on the VM scaleset.
Note that in order to automatically scale down gracefully, it's recommended you use the durability level Gold or Silver, otherwise you'll be responsible to drain the node before it's taken out of the cluster.
More info here and here.


Is it relevent to set “requests” values if i’m not using HPA?

I was wondering if it was really relevant to set “requests” (CPU/MEM) values if I’m not using HPA ?
If those values are not used to scale up or down some pods, what is the point ?
It's fine and it will work if you don't provide the requests (CPU/MEM) to workloads.
But consider the scenario, suppose you have 1-2 Nodes with a capacity of 1 GB and you have not mentioned the requests.
Already running application utilizing half of the node around 0.5 GB. Your new app needs now 1 GB to start so K8s will schedule the PODs onto that node as not aware of the minimum requirement to start the application.
After that whatever happens, we call it a Crash.
If you have extra resources in the cluster, setting affinity and confidence in the application code you can go without putting the requests (not best practice).

How to provision jobs in Kubernetes with very wide range of memory use

I am fairly new to Kubernetes, and I think I understand the basics of provisioning nodes and setting memory limits for pods. Here's the problem I have: my application can require dramatically different amounts of memory, depending on the input (and there is no fool-proof way to predict it). Some jobs require 50MB, some require 50GB. How can I set up my K8s deployment to handle this situation?
I have one strategy that I'd like to try out, but I don't know how to do it: start with small instances (nodes with not a lot of memory), and if the job fails with out-of-memory, then automatically send it to increasingly bigger instances until it succeeds. How hard would this be to implement in Kubernetes?
Natively K8S supports horizontal autoscalling i.e. automatically deplying more replicas of a deployment basing on chosen metric like CPU usage, memory usage etc.: Horizontal Pod Autoscaling
What you are describing here though is vertical scaling. It is not supported out of the box, but there is a subproject that seems to be able to fulfill your requirements: vertical-pod-autoscaler

kubernetes / prometheus custom metric for horizontal autoscaling

I'm wondering about an approach one has to take for our server setup. We have pods that are short lived. They are started up with 3 pods at a minimum and each server is waiting on a single request that it handles - then the pod is destroyed. I'm not sure of the mechanism that this pod is destroyed, but my question is not about this part anyway.
There is an "active session count" metric that I am envisioning. Each of these pod resources could make a rest call to some "metrics" pod that we would create for our cluster. The metrics pod would expose a sessionStarted and sessionEnded endpoint - which would increment/decrement the kubernetes activeSessions metric. That metric would be what is used for horizontal autoscaling of the number of pods needed.
Since having a pod as "up" counts as zero active sessions, the custom event that increments the session count would update the metric server session count with a rest call and then decrement again on session end (the pod being up does not indicate whether or not it has an active session).
Is it correct to think that I need this metric server (and write it myself)? Or is there something that Prometheus exposes where this type of metric is supported already - rest clients and all (for various languages), that could modify this metric?
Looking for guidance and confirmation that I'm on the right track. Thanks!
It's impossible to give only one way to solve this and your question is more "opinion-based". However there is an useful similar question on StackOverFlow, please check the comments that can give you some tips. If nothing works, probably you should write the script. There is no exact solution from Kubernetes's side.
Please also take into the consideration of Apache Flink. It has Reactive Mode in combination of Kubernetes:
Reactive Mode allows to run Flink in a mode, where the Application Cluster is always adjusting the job parallelism to the available resources. In combination with Kubernetes, the replica count of the TaskManager deployment determines the available resources. Increasing the replica count will scale up the job, reducing it will trigger a scale down. This can also be done automatically by using a Horizontal Pod Autoscaler.

How do I make pods spread evenly across nodes when a new host is added?

Kubernetes scheduling errs on the side of 'not shuffling things around once scheduled and happy' which can lead to quite the level of imbalance in terms of CPU, Memory, and container count distribution. It can also mean that sometimes Affinity and Topology rules may not be enforced / as the state of affair changes:
With regards to topology spread constraints introduced in v1.19 (stable)
There's no guarantee that the constraints remain satisfied when Pods are removed. For example, scaling down a Deployment may result in imbalanced Pods distribution.
We are currently making use of pod topology spread contraints, and they are pretty superb, aside from the fact that they only seem to handle skew during scheduling, and not execution (unlike the ability to differentiate with Taints and Tolerations).
For features such as Node affinity, we're currently waiting on the ability to add RequiredDuringExecution requirements as opposed to ScheduledDuringExecution requirements
My question is, is there a native way to make Kubernetes re-evaluate and attempt to enforce topology spread skew when a new fault domain (topology) is added, without writing my own scheduler?
Or do I need to wait for Kubernetes to advance a few more releases? ;-) (I'm hoping someone may have a smart answer with regards to combining affinity / topology constraints)
After more research I'm fairly certain that using an outside tool like Descheduler is the best way currently.
There doesn't seem to be a combination of Taints, Affinity rules, or Topology constraints that can work together to achieve the re-evaluation of topology rules during execution.
Descheduler allows you to kill of certain workloads based on user requirements, and let the default kube-scheduler reschedule killed pods. It can be installed easily with manifests or Helm and ran on a schedule. It can even be triggered manually when the topology changes, which is what I think we will implement to suit our needs.
This will be the best means of achieving our goal while waiting for RequiredDuringExecution rules to mature across all feature offerings.
Given our topology rules mark each node as a topological zone, using a Low Node Utilization strategy to spread workloads across new hosts as they appear will be what we go with.
apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
enabled: true
"memory": 20
"memory": 70

schedule kubernetes pods on different physical server

In my cluster there are 30 VMs which are located in 3 different physical servers. I want to deploy different replicas of each workload on different physical server.
I know I can use podAntiAffinity to deploy replicas on different VMs but I cant find any way to guarantee spread replication on different physical server.
I want to know is there any way to solve this challenge?
I believe you gave the answer ;)
I went to the Kubernetes Patterns book (PDF available for free in here) to see if there was something related to that over there, and found exactly that:
To express how Pods should be spread to achieve high availability, or be packed and co-located together to improve latency, Pod affinity and antiaffinity can be used.
Node affinity works at node granularity, but Pod affinity is not limited to nodes and
can express rules at multiple topology levels. Using the topologyKey field, and the
matching labels, it is possible to enforce more fine-grained rules, which combine
rules on domains like node, rack, cloud provider zone, and region [...]
I really like the k8s docs as well, they are super complete and full of examples, so maybe you can get some ideas from here. I think the main idea will be to create your own affinity/antiaffinity rule.
----------------------------------- EDIT -----------------------------------
There is a new feature within k8s version 1.18 that may be a better solution.
It's called: Pod Topology Spread Constraints:
You can use topology spread constraints to control how Pods are spread across your cluster among failure-domains such as regions, zones, nodes, and other user-defined topology domains. This can help to achieve high availability as well as efficient resource utilization.