How to spin up/down workers programmatically at run-time on Kubernetes based on new Redis queues and their load? - kubernetes

Suppose I want to implement this architecture deployed on Kubernetes cluster:
Gateway
Simple RESTful HTTP microservice accepting scraping tasks (URLs to scrape along with postback urls)
Request Queues - Redis (or other message broker) queues created dynamically per unique domain (when new domain is encountered, gateway should programmatically create new queue. If queue for domain already exists - just place message in it.
Response Queue - Redis (or other message broker) queue used to post Worker results as scraped HTML pages along with postback URLs.
Workers - worker processes which should spin-up at runtime when new queue is created and scale-down to zero when queue is emptied.
Response Workers - worker processes consuming response queue and sending postback results to scraping client. (should be available to scale down to zero).
I would like to deploy the whole solution as dockerized containers on Kubernetes cluster.
So my main concerns/questions would be:
Creating Redis or other message broker queues dynamically at run-time via code. Is it viable? Which broker is best for that purpose? I would prefer Redis if possible since I heard it's the easiest to set up and also it supports massive throughput, ideally my scraping tasks should be short-lived so I think Redis would be okay if possible.
Creating Worker consumers at runtime via code - I need some kind of Kubernetes-compatible technology which would be able to react on newly created queue and spin up Worker consumer container which would listen to that queue and later on would be able to scale up/down based on the load of that queue. Any suggestions for such technology? I've read a bit about KNative, and it's Eventing mechanism, so would it be suited for this use-case? Don't know if I should continue investing my time in reading it's documentation.
Best tools for Redis queue management/Worker management: I would prefer C# and Node.JS tooling. Something like Bull for Node.JS would be sufficient? But ideally I would want to produce queues and messages in Gateway by using C# and consume them in Node.JS (Workers).

If you mean vertical scaling it definitely won't be a viable solution, since it requires pod restarts. Horizontal scaling is somewhat viable when compared to vertical scaling, however you need to consider a fact that even for spinning up your nodes or pods it takes some time and it is always suggested to have proper resources in place for serving your upcoming traffic else this delay will affect some features of your application and there might be a business impact. Just having auto scalers isn’t an option; you should also have proper metrics in place for monitoring your application.
This documentation details how to scale your redis and worker pods respectively using the KEDA mechanism. KEDA stands for Kubernetes Event-driven Autoscaling, KEDA is a plugins which sits on top of existing kubernetes primitives (such as Horizontal pod autoscaler) to scale any number of kubernetes containers based on the number of events which needs to be processed.

Related

What is common strategy for synchronized communication between replica's of same PODS?

Lets say we have following apps ,
API app : Responsible for serving the user requests.
Backend app: Responsible for handling the user requests which are long running tasks. It updates the progress to database (postgres) and distributed cache (Redis).
Both apps are scalable service. Single Backend app handles multiple tenants e.g. Customer here but one customer is assigned to single backend app only.
I have a usecase where I need API layer to connect to specific replica which is handling that customer. Do we have a common Pattern for this ?
Few strategies in mind
Pub/Sub: Problem is we want sync guranteed response , probably using Redis
gRPC : Using POD IP to connect to specific pod is not a standard way
Creating a Service at runtime by adding labels to the replicas and use those. -- Looks promising
Do let me know if there is common pattern or example architecture of this or standard way of doing this?
Note :[Above is a simulation of production usecase, names and actual use case is changed]
You should aim to keep your services stateless, in a Kubernetes environment there is no telling when one pod might be replaced by another due to worker node maintenance.
If you have long running task that cannot be completed during the configured grace period for pods to shutdown during a worked node drain/evacuation you need to implement some kind of persistent work queue as your are think about in option 1. I suggest you look into the saga pattern.
Another pattern we usually employ is to let the worker service write the current state of the job into the database and let the client pull the status every few seconds. This does however require some way of handling half finished jobs that might be abandoned by pods that are forced to shutdown.

How to scale knative service based on custom metrics?

I'm using Knative serving with KPA. Autoscaling is available in Knative based on concurrency and RPS. But we need to scale different services based on queue lengths because there are long running async processes. Is there any way we can achieve this in Knative?
I can't use Knative HPA because we need scale to zero feature of Knative.
Thanks in advance!
If you have async (background or scheduled processes), it's likely that Knative is not a good match for your application. There has been some investigation into exposing the HPA v2 custom metrics scanning options (which would might preclude scale to zero, as you note), but even with HPA2 scaling, you'll still run into problems.
The problem with background processes is that Knative and Kubernetes don't have visibility into which Pods are still doing work, so they are equally likely to shut down a Pod doing work as one that is idle.
One workaround would be to move the async work to be synchronous with a request (possibly by using eventing to send a "do work" event), and then processing those events synchronously -- the eventing Broker won't get upset if your requests take a long time to complete. If you're worried about non-uniform processing times, you can even run a second copy of the Knative Service just for handling the long-running requests.

Design k8s app which get data from external source and send to same destination

I have an app that get data from a third-party data source, it will send data to my app automatically and I can't filter it, I can only receive all. When data arrive, my app will transmit this data to a rocketmq topic.
Now I have to make this app a container and deploy it in k8s deployment with 3 replica. But these pods will all get same data and send to the same rocketmq topic.
How do I make this app horizontal scalable without sending duplicate msg to the same rocketmq topic?
Now I have to make this app a container and deploy it in k8s deployment with 3 replica. But these pods will all get same data and send to the same rocketmq topic.
There is no request. My app connect to a server and it will send data to app by TCP. Every Pod will connect to that server.
If you want to do this with more than one instance, they need to coordinate in some way.
Leader Election pattern is a way to run multiple instances, but only one can be active (e.g. when you read from the same queue). This is a pattern to coordinate - only one instance is active at the time. So this pattern only use your replicas for higher availability.
If you want that all your replicas actively work, this can be done with techniques like sharding or partitioning. This is also how e.g. Kafka (e.g. similar to a queue) makes concurrent work from queues.
There are other ways to solve this problem as well, e.g. to implement some form of locks to coordinate - but partitioning or sharding as in Kafka is probably the most "cloud native" solution.

How to avoid parallel requests to a pod belonging to a K8s service?

I have an (internal) K8s deployment (Python, TensorFlow, Guinicorn) with around 60 replicas and an attached K8s service to distribute incoming HTTP requests. Each of these 60 pods can only actually process one HTTP request at a time (because of TensorFlow reasons). Processing a request takes between 1 and 4 seconds. If a second request is sent to a pod while it's still processing one, the second request is just queued in the Gunicorn backlog.
Now I'd like to reduce the probability of that queuing happening as much as possible, i.e., have new requests routed to one of the non-occupied pods as long as such a non-occupied one exists.
Round-robin would not do the trick, because not every request takes the same amount of time to answer (see above).
The Python application itself could make the endpoint used for the ReadinessProbe fail while it's processing a normal request, but as far as I understand, readiness probes are not meant for something that dynamic (K8s would need to poll them multiple times per second).
So how could I achieve the goal?
Can't you implement the pub/sub or message broker in between?
saver data into a queue based on the ability you worker will fetch the message or data from queue and request will get processed.
You can use Redis for creating queues and in queue, you can use pub/sub also possible using the library. i used one before in Node JS however could be possible to implement the same using python also.
in 60 replicas ideally worker or we can say scriber will be running.
As soon as you get a request one application will publish it and scribers will be continuously working for processing those messages.
We also implemented one step further, scaling the worker count automatically depends on the message count in the queue.
This library i am using with the Node js : https://github.com/OptimalBits/bull
...kubectl get service shows "TYPE ClusterIP" and "EXTERNAL-IP <none>
Your k8s service will be routing requests at random in this case... obviously not good to your app. If you would like to stick with kube-proxy, you can switch to ipvs mode with sed. Here's a good article about it. Otherwise, you can consider using some sort of ingress controller like the one mentioned earlier on; ingress-nginx with "ewma" mode.

Request buffering in Kubernetes clusters

This is a purely theoretical question. A standard Kubernetes clusted is given with autoscaling in place. If memory goes above a certain targetMemUtilizationPercentage than a new pod is started and it takes on the flow of requests that is coming to the contained service. The number of minReplicas is set to 1 and the number of maxReplicas is set to 5.
What happens when the number of pods that are online reaches maximum (5 in our case) and requests from clients are still coming towards the node? Are these requests buffered somewhere of they are discarded? Can I take any actions to avoid request loss?
Natively Kubernetes does not support messaging queue buffering. Depends on the scenario and setup you use your requests will most likely 'timeout'. To efficiently manage those you`ll need custom resource running inside Kubernetes cluster.
In that situations it very common to use a message broker which ensures communication between microservices is reliable and stable, that the messages are managed and monitored within the system and that messages don’t get lost.
RabbitMQ, Kafka and Redis appears to be most popular but choosing the right one will heaving depend on your requirement and features needed.
Worth to note since Kubernetes essentially runs on linux is that linux itself also manages/limits the requests coming in socket. You may want to read more about it here.
Another thing is that if you have pods limits set or lack of resource it is most likely that pods might be restarted or cluster will become unstable. Usually you can prevent it by configuring some kind of "circuit breaker" to limit amount of requests that could go to backed without overloading it. If the amount of requests goes beyond the circuit breaker threshold, excessive requests will be dropped.
It is better to drop some request than having cascading failure.
I managed to test this scenario and I get 503 Service Unavailable and 403 Forbidden on my requests that do not get processed.
Knative Serving actually does exactly this. https://github.com/knative/serving/
It buffers requests and informs autoscaling decisions based on in-flight request counts. It also can enforce per-Pod max in-flight requests and hold onto request until newly scaled-up Pods come up and then Knative proxies the request to them as it has this container named queue-proxy as a sidecar to its workload type called "Service".