Load balancing onto replicas of pods - kubernetes

We have an AKS cluster and we want to achieve below two points in our architecture:
We have replicas of pods and we want to have only 1 request served by one pod. basically one pod - one request design.
When all pods are busy, then next coming request should not be queued at POD level, instead it should be queued at service level and once any of busy pod become idle or available then only queued request should be dispatched on idle pod.
How to achieve above things?

Generally, this could be achieved by creating a custom proxy that creates pods on demand, but in practice it will be very difficult and performance will be poor. This was very well explained by David Maze in his comment:
You need to write a custom proxy with access to the Kubernetes API that can create new pods on demand; this is not a standard Kubernetes setup. This is also an extremely heavy-weight setup (if it takes tens of seconds to pull and deploy a new pod you can hit HTTP request timeouts very easily) and every Web framework supports handling multiple requests per process.

Related

In kubernetes, is there a way to make statefulset pods linger to finish requests on rolling update?

In Kubernetes, I have a statefulset with a number of replicas.
I've set the updateStrategy to RollingUpdate.
I've set podManagementPolicy to Parallel.
My statefulset instances do not have a persistent volume claim -- I use the statefulset as a way to allocate ordinals 0..(N-1) to pods in a deterministic manner.
The main reason for this, is to keep availability for new requests while rolling out software updates (freshly built containers) while still allowing each container, and other services in the cluster, to "know" its ordinal.
The behavior I want, when doing a rolling update, is for the previous statefulset pods to linger while there are still long-running requests processing on them, but I want new traffic to go to the new pods in the statefulset (mapped by the ordinal) without a temporary outage.
Unfortunately, I don't see a way of doing this -- what am I missing?
Because I don't use volume claims, you might think I could use deployments instead, but I really do need each of the pods to have a deterministic ordinal, that:
is unique at the point of dispatching new service requests (incoming HTTP requests, including public ingresses)
is discoverable by the pod itself
is persistent for the duration of the pod lifetime
is contiguous from 0 .. (N-1)
The second-best option I can think of is using something like zookeeper or etcd to separately manage this property, using some of the traditional long-poll or leader-election mechanisms, but given that kubernetes already knows (or can know) about all the necessary bits, AND kubernetes service mapping knows how to steer incoming requests from old instances to new instances, that seems more redundant and complicated than necessary, so I'd like to avoid that.
I assume that you need this for a stateful workload, a workload that e.g. requires writes. Otherwise you can use Deployments with multiple pods online for your shards. A key feature with StatefulSet is that they provide unique stable network identities for the instances.
The behavior I want, when doing a rolling update, is for the previous statefulset pods to linger while there are still long-running requests processing on them, but I want new traffic to go to the new pods in the statefulset.
This behavior is supported by Kubernetes pods. But you also need to implement support for it in your application.
New traffic will not be sent to your "old" pods.
A SIGTERM signal will be sent to the pod - your application may want to listen to this and do some action.
After a configurable "termination grace period", your pod will get killed.
See Kubernetes best practices: terminating with grace for more info about pod termination.
Be aware that you should connect to services instead of directly to pods for this to work. E.g. you need to create headless services for the replicas in a StatefulSet.
If your clients are connecting to a specific headless service, e.g. N, this means that it will not be available for some times during upgrades. You need to decide if your clients should retry their connections during this time period or if they should connect to another headless service if N is not available.
If you are in a case where you need:
stateful workload (e.g. support for write operations)
want high availability for your instances
then you need a form of distributed system that does some form of replication/synchronization, e.g. using raft or a product that implements this. Such system is easiest deployed as a StatefulSet.
You may be able to do this using Container Lifecycle Hooks, specifically the preStop hook.
We use this to drain connections from our Varnish service before it terminates.
However, you would need to implement (or find) a script to do the draining.

Scaling down video conference software in Kubernetes

I'm planning to deploy a WebRTC custom videoconference software (based on NodeJS, using websockets) with Kubernetes, but I have some doubts about scaling down this environment.
Actually, I'm planning to use cloud hosted Kubernetes (GKE, EKS, AKS or any) to be able to auto-scale nodes in the cluster to attend the demand increase and decrease. But, scaling up is not the problem, but it's about scaling down.
The cluster will scale down based on some CPU average usage metrics across the cluster, as I understand, and if it tries to remove some node, it will start to drain connections and stop receiving new connections, right? But now, imagine that there's a videoconference still running in this "pending deletion" node. There are two problems:
1 - Stopping the node before the videoconference finishes (it will drop the meeting)
2 - With the draining behaviour when it starts to scale down, it will stop receiving new connections, so if someone tries to join in this running video conference, it will receive a timeout, right?
So, which is the best strategy to scale down nodes for a video conference solution? Any ideas?
Thanks
I would say this is not a matter of resolving it on kubernetes level by some specific scaling strategy but rather application ability to handle such situations. It isn't even specific to kubernetes. Imagine that you deploy it directly on compute instances which are also subject to autoscale and you'll end up in exactly the same situation when the load decreases and one of the instances is removed from the set.
You should rather ask yourself if such application is suitable to be deployed as kubernetes workload. I can imagine that such videoconference session doesn't have to rely on the backend deployed on a single node only. You can even define some affinity or anti-affinity rules to prevent your Pods from being scheduled on the same node. So if the whole application cluster is still up and running (it's Pods are running on different nodes), eviction of a limited subset of Pods should not have a big impact.
You can actually face the same issue with any other application as vast majority of them base on some session which needs to be established between the client software and the server part. I would say it's application responsibility to be able to handle such scenarios. If some of the users unexpectedly loses the connection it should be possible to immediately redirect them to the running instance e.g. different Pod which is still able to accept new requests.
So basically if the application is designed to be highly available, scaling in (when we talk about horizontal scaling we actually talk about scaling in and scaling out) the underyling VMs, or more specifically kubernetes nodes, shouldn't affect it's high availability capabilities. From the other hand if it is not designed to be highly available, solution such as kubernetes probably won't help much.
There is no best strategy at your use case. When a cloud provider scales down, it is going to get one node randomly and kill it. It's not going to check whether this node has less resource consumption, so let's kill this one. It might end up killing the node with most pods running on it.
I would focus on how you want to schedule your pods. I would try to schedule them, if possible, on a node with running pods already (Pod inter-affinity), and would set up a Pod Disruption Budget to all Deployments/StatefulSets/etc (depending on how you want to run the pods). As a result it would only scale down when there are no pods running on a specific node, and it would kill that node, because on the other nodes there are pods; protected by a PDB.

Smooth load rebalancing for Kubernetes HPA

I have configured my ingress controller with nginx-ingress hashing and I define HPA for my deployments. When we do load testing we hit a problem on the newly created pods that aren't warmed up enough and while the load balancing shifts immediately target portion of the traffic the latency spikes and service is choking. Is there a way to define some smooth load rebalancing that would rather move the traffic gradually and thus warm up the service in more natural way ?
Here is an example effect we see now:
At glance I see 2 possible reasons for that behaviour:
I think there is a chance that you are facing the same problem as encountered in this question: Some requests fails during autoscaling in kubernetes. In that case, Nginx was sending requests to Pods that were not completely ready. To solve this you can configure a Readiness Probe. Personally, I configure my Readiness Probes to send a http request to a /health endpoint of my services.
There is a chance however that your application naturally performs slowly during the first requests, usually because of caching or some other operation that needs to be done at the beginning of its life. I encountered this problem in a Django+Gunicorn app where the Gunicorn only started my app after the first request. To solve this I used a PostStart Container Hook which sends a request to my app right after the container is created. Here is an example of its use. You may also have a look at this question: Kubernetes Pod warm-up for load balancing.

Kubernetes: multiple pods or multiple deployments?

I am using kubernetes to deploy a simple application. The pieces are:
a rabbitMQ instance
a stateless HTTP server
a worker that takes jobs from the message queue and processes them
I want to be able to scale the HTTP server and the worker up and down independently of each other. Would it be more appropriate for me to create a single deployment containing one pod for the HTTP server and one for the worker, or separate deployments for the HTTP server / worker?
You should definitely choose different deployment for HTTP Server and the worker. For following reasons:
Your scaling characteristics are different for both of them. It does not make sense to put them in the same deployment
The parameters on which you will scale will be different too. For HTTP server it might be RPS and for worker application, it will number of items pending/to be processed state. You can create HPA and scale them for different parameters that suit them best
The metrics & logs that you want to collect and measure for each would be again different and would make sense to keep them separate.
I think the Single Responsibility principle fits well too and would unnecessarily mix things up if you keep then in same pod/deployment.

What happens when the Kubernetes master fails?

I've been trying to figure out what happens when the Kubernetes master fails in a cluster that only has one master. Do web requests still get routed to pods if this happens, or does the entire system just shut down?
According to the OpenShift 3 documentation, which is built on top of Kubernetes, (https://docs.openshift.com/enterprise/3.2/architecture/infrastructure_components/kubernetes_infrastructure.html), if a master fails, nodes continue to function properly, but the system looses its ability to manage pods. Is this the same for vanilla Kubernetes?
In typical setups, the master nodes run both the API and etcd and are either largely or fully responsible for managing the underlying cloud infrastructure. When they are offline or degraded, the API will be offline or degraded.
In the event that they, etcd, or the API are fully offline, the cluster ceases to be a cluster and is instead a bunch of ad-hoc nodes for this period. The cluster will not be able to respond to node failures, create new resources, move pods to new nodes, etc. Until both:
Enough etcd instances are back online to form a quorum and make progress (for a visual explanation of how this works and what these terms mean, see this page).
At least one API server can service requests
In a partially degraded state, the API server may be able to respond to requests that only read data.
However, in any case, life for applications will continue as normal unless nodes are rebooted, or there is a dramatic failure of some sort during this time, because TCP/ UDP services, load balancers, DNS, the dashboard, etc. Should all continue to function for at least some time. Eventually, these things will all fail on different timescales. In single master setups or complete API failure, DNS failure will probably happen first as caches expire (on the order of minutes, though the exact timing is configurable, see the coredns cache plugin documentation). This is a good reason to consider a multi-master setup–DNS and service routing can continue to function indefinitely in a degraded state, even if etcd can no longer make progress.
There are actions that you could take as an operator which would accelerate failures, especially in a fully degraded state. For instance, rebooting a node would cause DNS queries and in fact probably all pod and service networking functionality until at least one master comes back online. Restarting DNS pods or kube-proxy would also be bad.
If you'd like to test this out yourself, I recommend kubeadm-dind-cluster, kind or, for more exotic setups, kubeadm on VMs or bare metal. Note: kubectl proxy will not work during API failure, as that routes traffic through the master(s).
Kubernetes cluster without a master is like a company running without a Manager.
No one else can instruct the workers(k8s components) other than the Manager(master node)(even you, the owner of the cluster, can only instruct the Manager)
Everything works as usual. Until the work is finished or something stopped them.(because the master node died after assigning the works)
As there is no Manager to re-assign any work for them, the workers will wait and wait until the Manager comes back.
The best practice is to assign multiple managers(master) to your cluster.
Although your data plane and running applications does not immediately starts breaking but there are several scenarios where cluster admins will wish they had multi-master setup. Key to understanding the impact would be understanding which all components talk to master for what and how and more importantly when will they fail if master fails.
Although your application pods running on data plane will not get immediately impacted but imagine a very possible scenario - your traffic suddenly surged and your horizontal pod autoscaler kicked in. The autoscaling would not work as Metrics Server collects resource metrics from Kubelets and exposes them in Kubernetes apiserver through Metrics API for use by Horizontal Pod Autoscaler and vertical pod autoscaler ( but your API server is already dead).If your pod memory shoots up because of high load then it will eventually lead to getting killed by k8s OOM killer. If any of the pods die, then since controller manager and scheduler talks to API Server to watch for current state of pods so they too will fail. In short a new pod will not be scheduled and your application may stop responding.
One thing to highlight is that Kubernetes system components communicate only with the API server. They don’t
talk to each other directly and so their functionality themselves could fail I guess. Unavailable master plane can mean several things - failure of any or all of these components - API server,etcd, kube scheduler, controller manager or worst the entire node had crashed.
If API server is unavailable - no one can use kubectl as generally all commands talk to API server ( meaning you cannot connect to cluster, cannot login into any pods to check anything on container file system. You will not be able to see application logs unless you have any additional centralized log management system).
If etcd database failed or got corrupted - your entire cluster state data is gone and the admins would want to restore it from backups as early as possible.
In short - a failed single master control plane although may not immediately impact traffic serving capability but cannot be relied on for serving your traffic.