Anyone experienced issue with Kafka for Blue-green deployments? - apache-kafka

Our Services utilize Kafka to publish and consume messages. We deploy our services using Blue-Green deployment strategy.
Consider below scenario :
- Suppose we have App1.0 service in Blue(which is live and taking traffic) currently ,consuming from topic,
- when we start deploying a new version App1.1, it will get deployed to Green first,so now
Green has : App1.1 (not consuming)
Blue has : App1.0 (consuming messages)
- once we switch green to blue , Blue has App1.1 , and green has App1.0.
Our issue here is that once the switch happens, our green pods(in which we have older code App1.0) are still consuming messages from that kafka topic . Ideally green pods should now stop consuming from the topic.
We are looking for a solution wherein when we deploy ,our green pods should stop consuming .

Related

is RabbitMQ queueing system unnecessary in a Kubernetes cluster?

I have just been certified CKAD (Kubernetes Application Developer) by The Linux Foundation.
And from now on I am wondering : is RabbitMQ queueing system unnecessary in a Kubernetes cluster ?
We use workers with queueing system in order to avoid http 30 seconds timeout : let's say for example we have a microservice which generates big pdf documents in average of 50 seconds each and you have 20 documents to generate right now, the classical schema would be to make a worker which will queue each documents one by one (this is the case for the company I have been working for lately)
But in a Kubernetes cluster by default there is no timeout for http request going inside the cluster. You can wait 1000 seconds without any issue (20 documents * 50 seconds = 1000 seconds)
With this last point, is it enought to say that RabbitMQ queueing system (via the amqplib module) is unuseful in a Kubernetes cluster ? moreover Kubernetes manages so well load balancing on each of your microservice replicas...
But in a Kubernetes cluster by default there is no timeout for http request going inside the cluster.
Not sure where you got that idea. Depending on your config there might be no timeouts at the proxy level but there's still client and server timeouts to consider. Kubernetes doesn't change what you deploy, just how you deploy it. There's certainly other options than RabbitMQ specifically, and other system architectures you could consider, but "queue workers" is still a very common pattern and likely will be forever even as the tech around it changes.

Rabbit MQ Shovel Plugin- Creating duplicate data in case of node failure

I am creating shovel plugin in rabbit mq, that is working fine with one pod, However, We are running on Kubernetes cluster with multiple pods and in case of pod restart, it is creating multiple instance of shovel on each pod independently, which is causing duplicate message replication on destination.
Detail steps are below
We are deploying rabbit mq on Kubernetes cluster using helm chart.
After that we are creating shovel using Rabbit MQ Management UI. Once we are creating it from UI, shovels are working fine and not replicating data multiple time on destination.
When any pod get restarted, it create separate shovel instance. That start causing issue of duplicate message replication on destination from different shovel instance.
When we saw shovel status on Rabbit MQ UI then we found that, there are multiple instance of same shovel running on each pod.
When we start shovel from Rabbit MQ UI manually, then it will resolved this issue and only once instance will be visible in UI.
So problem which we concluded that, in case of pod failure/restart, shovel is not able to sync with other node/pod, if any other shovel is already running on node/pod. Since we are able to solve this issue be restarting of shovel form UI, but this not a valid approach for production.
This issue we are not getting in case of queue and exchange.
Can anyone help us here to resolve this issue.
as we lately have seen similar problems - this seems to be an issue since some 3.8. version - https://github.com/rabbitmq/rabbitmq-server/discussions/3154
it should be fixed as far as I have understood from version 3.8.20 on. see
https://github.com/rabbitmq/rabbitmq-server/releases/tag/v3.8.19
and
https://github.com/rabbitmq/rabbitmq-server/releases/tag/v3.8.20
and
https://github.com/rabbitmq/rabbitmq-server/releases/tag/v3.9.2
didn't have time yet to check if this is really fixed with those versions.

Deployment handling with Rabbit

My application has rabbitMq which has multiple consumers[3] and i do round robing for consuming the messages .
At the moment i do canary deployment while deploying new version but this lands me in situation where the message can be consumed by any consumer [not just the canary one] and it creates problem .
[Also i want it to behave like this only while deployment all other times round robin is needed]
I know about green blue deployment process already but is there any other way to go about this problem ?

Canary release when Queues are involved

Fowler says a small percentage of traffic is routed to the Canary version while the old version is still running.
This is assuming that the routing can be controlled at the Load balancer/router level.
We have a use case where a micro-service consumes off a queue and does some processing. We were wondering how the routing can be controlled to direct a subset of traffic to the canary consumer.
One of the options we considered is to have a separate "Canary queue" but the problem is that the producers now have to be aware of this queue which sounds like a smell.
This seems like a common problem where queues are involved. Any ideas on how Canary was adopted for such applications?
As you wrote, the goal of the canary release is to drive a small fraction of live traffic through a new deployment to minimize the potential impact of flaws in the new deployment. When you do not control the routing to the service under deployment, you can adjust the percentage of traffic handled by the new deployment by adjusting the percentage of new version services to current version services.
For example, your queue is being processed by a pool of 100 service instances at v1. To canary test the next version, deploy 1 to 10 of v2 and turn off 1 to 10 of v1. This will approximate routing 1 to 10% of the traffic to the new service.
If expected throughput of the new version of the service is significantly different, consider adjusting the ratio of new services to old.
If you current deployment of services is very small, consider temporarily increasing the total number of deployed current services before deploying an instance of the new service. For example, assume your active deployment is 3 services. Deploying 6 more of your current service before deploying 1 of your new version service will allow you to keep the traffic to the canary closer to 10%.
There are 2 approaches for canary deployment of queue workers:
A dedicated canary queue
A common queue
Both these approaches have their pros and cons which are covered in detail here: http://www.varlog.co.in/blog/canary-deployment-workers/

How to use kafka and storm on cloudfoundry?

I want to know if it is possible to run kafka as a cloud-native application, and can I create a kafka cluster as a service on Pivotal Web Services. I don't want only client integration, I want to run the kafka cluster/service itself?
Thanks,
Anil
I can point you at a few starting points, there would be some work involved to go from those starting points to something fully functional.
One option is to deploy the kafka cluster on Cloud Foundry (e.g. Pivotal Web Services) using docker images. Spotify has Dockerized kafka and kafka-proxy (including Zookeeper). One thing to keep in mind is that PWS currently doesn't support apps with persistence (although this work is starting) so if you were to go this route right now, you would lose the data in kafka when the application is rolled. Looking at that Spotify repo, it looks like the docker images are generally run without any mounted volumes, so this persistence-less kafka seems like it may be a valid use case (I don't know enough about kafka to say).
The other option is to deploy kafka directly on some IaaS (e.g. AWS) using BOSH. BOSH can be hard if you're seeing it for the first time, but it is the ideal way to deploy any distributed software that you want running on VMs. You will also be able to have persistent volumes attached to your kafka VMs if necessary. Here is a kafka BOSH release which may work.
Once you have your cluster running, you have two ways to integrate your Cloud Foundry applications with it. The simplest is just to provide it to your applications as a "user-provided service", which lets you flow kafka cluster access info to your apps. The alternative would to put a service broker in front of your cluster, which would be especially useful if you have many different people who will be pushing apps that need to talk to the kafka cluster. Rather than you having to manually tell people the access info each time, they can do something simple like cf bind-service SOME_APP YOUR_KAFKA_SERVICE. Here is a kafka service broker along with more info about service brokers in general.
According to the 12-factor app description (https://12factor.net/processes), Kafka should not run as an application on top of Cloud Foundry:
Twelve-factor processes are stateless and share-nothing. Any data that needs to persist must be stored in a stateful backing service, typically a database.
Kafka is often considered a "distributed commit log" and as such carries a large amount of state. Many companies use it to keep all events flowing through their distributed system of micro services for a long (sometimes unlimited) amount of time.
Therefore I would strongly recommend to go for the second option in the accepted answer: Kafka topics should be bound to your applications in the form of stateful services.