Internal k8s services communication not balanced - kubernetes

I’m running a k8s cluster on aws-eks.
I have two services A and B.
Service A listens to a rabbit queue and sends http request to service B (which takes a couple of seconds).
Both services scale based on the number of message in the queue.
The problem is that the requests from A to B are not balanced.
When scaled to about 100 pods each, I see that service A pods sends requests to only about 60% of service B pods at a given time,
Meaning, eventually all pods gets messages, but some pods are at 100 cpu receiving 5 messages at a time, while others at 2 cpu
Receiving 1 message every minute or so..
That obviously causes low performance and timeouts.
I’ve read that it should work in round robin, but when I tried to set 10 fixed replicas of each service (all pods already up and running)
And pushing 10 messages to queue, I’ve seen that all service A pods pulled a message to send to service B, but some of service B pods never got any requests while other got more than one - resulting in one whole process to finish within 4 second while another took about 12 second.
Any ideas for why it’s working like that and how to change it to be more balanced?
Thanks

Related

is RabbitMQ queueing system unnecessary in a Kubernetes cluster?

I have just been certified CKAD (Kubernetes Application Developer) by The Linux Foundation.
And from now on I am wondering : is RabbitMQ queueing system unnecessary in a Kubernetes cluster ?
We use workers with queueing system in order to avoid http 30 seconds timeout : let's say for example we have a microservice which generates big pdf documents in average of 50 seconds each and you have 20 documents to generate right now, the classical schema would be to make a worker which will queue each documents one by one (this is the case for the company I have been working for lately)
But in a Kubernetes cluster by default there is no timeout for http request going inside the cluster. You can wait 1000 seconds without any issue (20 documents * 50 seconds = 1000 seconds)
With this last point, is it enought to say that RabbitMQ queueing system (via the amqplib module) is unuseful in a Kubernetes cluster ? moreover Kubernetes manages so well load balancing on each of your microservice replicas...
But in a Kubernetes cluster by default there is no timeout for http request going inside the cluster.
Not sure where you got that idea. Depending on your config there might be no timeouts at the proxy level but there's still client and server timeouts to consider. Kubernetes doesn't change what you deploy, just how you deploy it. There's certainly other options than RabbitMQ specifically, and other system architectures you could consider, but "queue workers" is still a very common pattern and likely will be forever even as the tech around it changes.

How to prevent data inconsistency when one node lose network connectivity in kubernetes

I have a situation where I have a cluster with a service (we named it A1) and its data which is on a remote storage like cephFS in my case. the number of replica for my service is 1. Assume I have 5 node in my cluster and service A1 reside in node 1. something happens with node 1 network and it lose the connectivity with cephFS cluster and my Kubernetes cluster as well (or docker-swarm). cluster mark it as unreachable and start a new service (we named it A2) on node 2 to keep replica as 1. after for example 15 min node 1 network fixed and node 1 get back to cluster and have service A1 running already (assume it didn't crash while it loses its connectivity with remote storage).
I worked with docker-swarm and recently switched to Kubernetes. I see Kuber has a feature call StatefulSet but when I read about it. it doesn't answer my question. (or I may miss something when I read about it)
Question A: what does cluster do. does it keep A2 and shutdown A1 or let A1 keeps working and shutdown A2 (Logically it should shutdown A1)
Question B (and my primary question as well!): Assume that the cluster wants to shutdown on of these services (for example A1). This service does some save on storage when it wants to shutdown. in this case state A1 save to disk and A2 with newer state saved something before A1 network get fixed.
There must be some locks when we mount the volume to the container in which when it attached to one container other container cant write to that (let A1 failed when want to save its old state data on disk)
The way it works - using docker swarm terminology -
You have a service. A service is a description of some image you'd like to run, how many replicas and so on. Assuming the service specifies at least 1 replica should be running it will create a task that will schedule a container on a swarm node.
So the service is associated with 0 to many tasks, where each task has 0 - if its still starting or 1 container - if the task is running or stopped - which is on a node.
So, when swarm (the orcestrator) detects a node go offline, it principally sees that a number of tasks associated with a service have lost their containers, and so the replication (in terms of running tasks) is no longer correct for the service, and it creates new tasks which in turn will schedule new containers on the available nodes.
On the disconnected node, the swarm worker notices that it has lost connection to the swarm managers so it cleans up all the tasks it is holding onto as it no longer has current information about them. In the process of cleaning the tasks up, the associated containers get stopped.
This is good because when the node finally reconnects there is no race condition where there are two tasks running. Only "A2" is running and "A1" has been shut down.
This is bad if you have a situation where nodes can lose connectivity to the managers frequently, but you need the services to keep running on those nodes regardless, as they will be shut down each time the workers detach.
The process on K8s is pretty much the same just change the terminology.

Which kubernetes mode to chose

I have a situation where each message in a message queue has to be processed by a separate instance (one pod can process one message at a time). Many messages can be processed at once, but there is a limit of parallel executions. Once it's reached, no new messages are being pulled from the queue. Message processing takes about 30 minutes. No state needs to be stored on the pods between calls (all data is read from a database when pod starts processing a message). A new message should spawn a new pod, once the processing finishes, the pod should die.
Should I use Deployments, ReplicaSets, StatefulSets, Services? (we use Kubernetes with Azure) I guess the main
I've tried ReplicaSets, but in a situation when three messages are being processed and one finishes, scaling down a ReplicaSet can kill a working pod, which is definately not what I need.
I would say that since you do not need to handle state you must discard the StatefulSets, on the other hand, a Deployment is a higher-level concept of ReplicaSets, so, you should use a Deployment as it takes care of the replica set. Lastly, as your processing is under demand I would consider using Jobs, once the job completes the task it frees the resources and dies, this would require extra code to create the jobs based on a helper but could be very handy.

Is there downtime when a partition is moved to a new node?

Service Fabric offers the capability to rebalance partitions whenever a node is removed or added to the cluster. The Service Fabric Cluster Resource Manager will move one or more partitions to this node so more work can be done.
Imagine a reliable actor service which has thousands of actors running who are distributed across multiple partitions. If the Resource Manager decides to move one or more partitions, will this cause any downtime? Or does rebalancing partitions work the same as upgrading a service?
They act pretty much the same way, The main difference I can point is that Upgrades might affect only the services being updated, and re-balancing might affect multiple services at once. During an upgrade, the cluster might re-balance the services as well to fit the new service instance in a node.
Adding or Removing nodes I would compare more with node failures. In any of these cases they will be rebalanced because of the cluster capacity changes, not because of the service metric\load changes.
The main difference between a node failure and a cluster scaling(Add/remove node) is that the rebalance will take in account the services states during the process, when a infrastructure notification comes in telling that a node is being shutdown(for updates or maintenance, or scaling down) the SF will ask the Infrastructure to wait so it can prepare for this announced 'failure', and then start re-balancing the services.
Even though re-balancing cares about the service states for a scale down it should not be considered more reliable than a node failure, because the infrastructure will wait for a while before shutting down the node(the limit it can wait will depend on the reliability tier you defined for your cluster), until SF check if the services meet health conditions, like turn down services and creating new ones, checking if they will run fine without errors, if this process takes too long, these service might be killed once the timeout is reached and the infrastructure proceed with the changes, Also, the new instances of the services might fail on new nodes, forcing the services to move again.
When you design the services is safer to consider the re-balancing as a node failure, because at the end is not much different. Your services will move around, data stored in memory will be lost if not persisted, the service address will change, and etc. The services should have replicated data and the clients should always use a retry logic and refresh the services location to reduce the down time.
The main difference between service upgrade and service rebalancing is that during upgrade all replicas from all partitions are get turned off on particular node. According to documentation here balancing is done on replica basis i.e. only some replicas from some partitions will get moved, so there shouldn't be any outage.

How to use the Python Kubernetes client in a way resilient to GKE Kubernetes Master disruptions?

We sometimes use Python scripts to spin up and monitor Kubernetes Pods running on Google Kubernetes Engine using the Official Python client library for kubernetes. We also enable auto-scaling on several of our node pools.
According to this, "Master VM is automatically scaled, upgraded, backed up and secured". The post also seems to indicate that some automatic scaling of the control plane / Master VM occurs when the node count increases from 0-5 to 6+ and potentially at other times when more nodes are added.
It seems like the control plane can go down at times like this, when many nodes have been brought up. In and around when this happens, our Python scripts that monitor pods via the control plane often crash, seemingly unable to find the KubeApi/Control Plane endpoint triggering some of the following exceptions:
ApiException, urllib3.exceptions.NewConnectionError, urllib3.exceptions.MaxRetryError.
What's the best way to handle this situation? Are there any properties of the autoscaling events that might be helpful?
To clarify what we're doing with the Python client is that we are in a loop reading the status of the pod of interest via read_namespaced_pod every few minutes, and catching exceptions similar to the provided example (in addition we've tried also catching exceptions for the underlying urllib calls). We have also added retrying with exponential back-off, but things are unable to recover and fail after a specified max number of retries, even if that number is high (e.g. keep retrying for >5 minutes).
One thing we haven't tried is recreating the kubernetes.client.CoreV1Api object on each retry. Would that make much of a difference?
When a nodepool size changes, depending on the size, this can initiate a change in the size of the master. Here are the nodepool sizes mapped with the master sizes. In the case where the nodepool size requires a larger master, automatic scaling of the master is initiated on GCP. During this process, the master will be unavailable for approximately 1-5 minutes. Please note that these events are not available in Stackdriver Logging.
At this point all API calls to the master will fail, including the ones from the Python API client and kubectl. However after 1-5 minutes the master should be available and calls from both the client and kubectl should work. I was able to test this by scaling my cluster from 3 node to 20 nodes and for 1-5 minutes the master wasn't available .
I obtained the following errors from the Python API client:
Max retries exceeded with url: /api/v1/pods?watch=False (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at>: Failed to establish a new connection: [Errno 111] Connection refused',))
With kubectl I had :
“Unable to connect to the server: dial tcp”
After 1-5 minutes the master was available and the calls were successful. There was no need to recreate kubernetes.client.CoreV1Api object as this is just an API endpoint.
According to your description, your master wasn't accessible after 5 minutes which signals a potential issue with your master or setup of the Python script. To troubleshoot this further on side while your Python script runs, you can check for availability of master by running any kubectl command.