What is common strategy for synchronized communication between replica's of same PODS? - kubernetes

Lets say we have following apps ,
API app : Responsible for serving the user requests.
Backend app: Responsible for handling the user requests which are long running tasks. It updates the progress to database (postgres) and distributed cache (Redis).
Both apps are scalable service. Single Backend app handles multiple tenants e.g. Customer here but one customer is assigned to single backend app only.
I have a usecase where I need API layer to connect to specific replica which is handling that customer. Do we have a common Pattern for this ?
Few strategies in mind
Pub/Sub: Problem is we want sync guranteed response , probably using Redis
gRPC : Using POD IP to connect to specific pod is not a standard way
Creating a Service at runtime by adding labels to the replicas and use those. -- Looks promising
Do let me know if there is common pattern or example architecture of this or standard way of doing this?
Note :[Above is a simulation of production usecase, names and actual use case is changed]

You should aim to keep your services stateless, in a Kubernetes environment there is no telling when one pod might be replaced by another due to worker node maintenance.
If you have long running task that cannot be completed during the configured grace period for pods to shutdown during a worked node drain/evacuation you need to implement some kind of persistent work queue as your are think about in option 1. I suggest you look into the saga pattern.
Another pattern we usually employ is to let the worker service write the current state of the job into the database and let the client pull the status every few seconds. This does however require some way of handling half finished jobs that might be abandoned by pods that are forced to shutdown.

Related

How to spin up/down workers programmatically at run-time on Kubernetes based on new Redis queues and their load?

Suppose I want to implement this architecture deployed on Kubernetes cluster:
Gateway
Simple RESTful HTTP microservice accepting scraping tasks (URLs to scrape along with postback urls)
Request Queues - Redis (or other message broker) queues created dynamically per unique domain (when new domain is encountered, gateway should programmatically create new queue. If queue for domain already exists - just place message in it.
Response Queue - Redis (or other message broker) queue used to post Worker results as scraped HTML pages along with postback URLs.
Workers - worker processes which should spin-up at runtime when new queue is created and scale-down to zero when queue is emptied.
Response Workers - worker processes consuming response queue and sending postback results to scraping client. (should be available to scale down to zero).
I would like to deploy the whole solution as dockerized containers on Kubernetes cluster.
So my main concerns/questions would be:
Creating Redis or other message broker queues dynamically at run-time via code. Is it viable? Which broker is best for that purpose? I would prefer Redis if possible since I heard it's the easiest to set up and also it supports massive throughput, ideally my scraping tasks should be short-lived so I think Redis would be okay if possible.
Creating Worker consumers at runtime via code - I need some kind of Kubernetes-compatible technology which would be able to react on newly created queue and spin up Worker consumer container which would listen to that queue and later on would be able to scale up/down based on the load of that queue. Any suggestions for such technology? I've read a bit about KNative, and it's Eventing mechanism, so would it be suited for this use-case? Don't know if I should continue investing my time in reading it's documentation.
Best tools for Redis queue management/Worker management: I would prefer C# and Node.JS tooling. Something like Bull for Node.JS would be sufficient? But ideally I would want to produce queues and messages in Gateway by using C# and consume them in Node.JS (Workers).
If you mean vertical scaling it definitely won't be a viable solution, since it requires pod restarts. Horizontal scaling is somewhat viable when compared to vertical scaling, however you need to consider a fact that even for spinning up your nodes or pods it takes some time and it is always suggested to have proper resources in place for serving your upcoming traffic else this delay will affect some features of your application and there might be a business impact. Just having auto scalers isn’t an option; you should also have proper metrics in place for monitoring your application.
This documentation details how to scale your redis and worker pods respectively using the KEDA mechanism. KEDA stands for Kubernetes Event-driven Autoscaling, KEDA is a plugins which sits on top of existing kubernetes primitives (such as Horizontal pod autoscaler) to scale any number of kubernetes containers based on the number of events which needs to be processed.

How to avoid parallel requests to a pod belonging to a K8s service?

I have an (internal) K8s deployment (Python, TensorFlow, Guinicorn) with around 60 replicas and an attached K8s service to distribute incoming HTTP requests. Each of these 60 pods can only actually process one HTTP request at a time (because of TensorFlow reasons). Processing a request takes between 1 and 4 seconds. If a second request is sent to a pod while it's still processing one, the second request is just queued in the Gunicorn backlog.
Now I'd like to reduce the probability of that queuing happening as much as possible, i.e., have new requests routed to one of the non-occupied pods as long as such a non-occupied one exists.
Round-robin would not do the trick, because not every request takes the same amount of time to answer (see above).
The Python application itself could make the endpoint used for the ReadinessProbe fail while it's processing a normal request, but as far as I understand, readiness probes are not meant for something that dynamic (K8s would need to poll them multiple times per second).
So how could I achieve the goal?
Can't you implement the pub/sub or message broker in between?
saver data into a queue based on the ability you worker will fetch the message or data from queue and request will get processed.
You can use Redis for creating queues and in queue, you can use pub/sub also possible using the library. i used one before in Node JS however could be possible to implement the same using python also.
in 60 replicas ideally worker or we can say scriber will be running.
As soon as you get a request one application will publish it and scribers will be continuously working for processing those messages.
We also implemented one step further, scaling the worker count automatically depends on the message count in the queue.
This library i am using with the Node js : https://github.com/OptimalBits/bull
...kubectl get service shows "TYPE ClusterIP" and "EXTERNAL-IP <none>
Your k8s service will be routing requests at random in this case... obviously not good to your app. If you would like to stick with kube-proxy, you can switch to ipvs mode with sed. Here's a good article about it. Otherwise, you can consider using some sort of ingress controller like the one mentioned earlier on; ingress-nginx with "ewma" mode.

How to notify POD in statefull set about other PODS in Kubernetes

I was reading the tutorial on deploying a Cassandra ring and zookeeper with statefulsets. What I don't understand is if I decide to add another replica into the statefulset, how do I notify the other PODS that there is another one. What are best practices for it? I want to be able for one POD to redirect request to another POD in my custom application in case the request doesn't belong to it (ie. it doesn't have the data)
Well, seems like you want to run a clustered application inside kubernetes. It is not something that kubernetes is directly responsible for. The cluster coordination for given solution should be handled within it, and a response to a "how to" question can not be generic.
Most of the softwares out there will have some kind of coordination, discovery and registration mechanism. Be it preconfigured members, external dioscovery catalog/db or some networ broadcasting.
StatefulSet helps a lot in it by retaining network identity under service/pod, or helping to keep storage, so you can ie. always point your new replicas to register with first replica (or preferably one of the first two, cause what if your no.1 is the one that restarted), but as a wrote above, this is pretty much depending on capabilities available on the solution you want to deploy.

Behaviour when reducing instances of a Bluemix application

I have an orchestrator service which keeps track of the instances that are running and what request they are currently dealing with. If a new instance is required, I make a REST call to increase the instances and wait for the new instance to connect to the orchestrator. It's one request per instance.
The orchestrator tracks whether an instance is doing anything and knows which instances can be stopped, however there is nothing in the API that allows me to reduce the number of instances stopping a particular instance, which is what I am trying to achieve.
Is there anything I can do to manipulate the platform into deterministically stopping the instances that I want to stop? Perhaps by having long running HTTP requests to the instances I require and killing the request when it's no longer required, then making the API call to reduce the number of instances?
Part of the issue here is that I don't know the specifics of the current behavior...
Assuming you're talking about CloudFoundry/Instant Runtime applications, all of the instances of an applications are running behind a load balancer which uses round-robin to distribute requests across the instances (unless you have session affinity cookie set up). Differentiating between each instances for incoming requests or manual scaling is not recommended and it's an anti-pattern. You can not control which instance the scale down task will choose.
If you really want that level of control with each instance, maybe you should deploy them as separate applications. MyApp1, MyApp2, MyApp3, etc. All of your applications can have the same route (myapp.mybluemix.net). Each of the applications can now distinguish themselves by their name (VCAP_APPLICATION) allowing you terminate them.

Rolling Over Streaming Connections During Upgrades

I am working on an application that uses Amazon Kinesis, and one of the things I was wondering about is how you can roll over an application during an upgrade without data loss on streams. I have heard about things like blue/green deployments and such, but I was wondering what is the best practice for upgrading a data streaming service so you don't loose data from your streams.
For example, my application has an HTTP endpoint that ingests data as a series of POST operations. If I want to replace the service with a newer version, how do I manage existing application streaming to my endpoint?
One common method is having a software load balancer (LB) with a virtual IP; behind this LB there would be at least two HTTP ingestion endpoints during normal operation. During upgrade, each endpoint is announced out and upgraded in turn. The LB ensures that no traffic is forwarded to an announced out endpoint.
(The endpoints themselves can be on separate VMs, Docker containers or physical nodes).
Of course, the stream needs to be finite; the TCP socket/HTTP stream is owned by one of the endpoints. However, as long as the stream can be stopped gracefully, the following flow works, assuming endpoint A owns the current ingestion:
Tell endpoint A not to accept new streams. All new streams will be redirected only to endpoint B by the LB.
Gracefully stop existing streams on endpoint A.
Upgrade A.
Announce A back in.
Rinse and repeat with endpoint B.
As a side point, you would need two endpoints with a load balanced (or master/slave) set-up if you require any reasonable uptime and reliability guarantees.
There are more bespoke methods which allow hot code swap on the same endpoint, but they are more bespoke and rely on specific internal design (e.g. separate process between networking and processing stack connected by IPC).