We have a microservice architecture communicated by Kafka on Confluent where each service is set in its own consumer group in order to balance message delivery between the multiple instances.
For example:
SERVICE_A_INSTANCE_1 (CONSUMER_GROUP_A)
SERVICE_A_INSTANCE_2 (CONSUMER_GROUP_A)
SERVICE_A_INSTANCE_3 (CONSUMER_GROUP_A)
SERVICE_B_INSTANCE_1 (CONSUMER_GROUP_B)
SERVICE_B_INSTANCE_2 (CONSUMER_GROUP_B)
When a message is emitted it should only be consumed by one instance of each consumer group.
This worked fine until two days ago. All of the sudden, each message is being delivered to all the instances, so each message is processed multiple times. Basically, the consumer-group stopped working and messages are not being distributed.
Important points:
We use Kafka paas in Confluent on GCP.
We tested this in a different environment and everything worked as expected
No changes have been made on our consumers
No changes have been made on our part to the cluster (we cant know if Confluent changed something)
We suspect it might be a problem on Confluent or an update that is not compatible with our current configuration. Kafka 2.2.0 was recently released and it has some changes to consumer groups behavior.
We are currently working on migrating to AWS MSK to see if the issue prevails.
Any ideas on what could be causing this?
TL;DR: We solved the issue by moving away from Confluent into our own Kafka cluster on GCP.
I will answer my own question since its been a while and we have already solved this. Also, my insights might help others make more informed decisions on where to deploy their Kafka infrastructure.
Unfortunately we could not get to the bottom of the problem with Confluent. It is most likely something on their side because we simply migrated to our own self managed instances on GCP and everything went back to normal.
Some important clarifications before my final thoughts and warnings about using Confluent as a managed Kafka service:
We think this is related to something that affected Node.js in particular. We tested external libraries in languages other than Node and the behavior was as expected. When testing on multiple of the most popular Node libraries the problem persisted.
We did not have premium support with Confluent.
I cannot confirm that this issue is not our fault.
With all of those points in mind, our conclusion is that for companies that decide on using a managed service with Confluent, its best to calculate costs with premium support included. Otherwise, Kafka turns into a completely closed blackbox, making it impossible to diagnose issues. In my personal opinion, the dependency on the Confluent team during a problem is so large that not having them ready to help when needed renders the service non-production ready.
Related
I have a recent Kafka cluster which uses Kraft. I am facing some problems with it and it is possibly due to use of Kraft. I wish to switch to Zookeeper without losing data. Downtime is okay. How do I go about it?
I'm afraid there isn't a documented process to downgrade a cluster from KRaft to ZooKeeper.
If you've found an issue with KRaft, you should report it to the Kafka project via a Jira ticket so it can get fixed.
Assuming your KRaft cluster is somewhat functional, a way to preserve your data is to create a new cluster (running ZooKeeper) and use a tool like MirrorMaker to migrate your data.
We have a problem with Apache ActiveMQ Artemis cluster queues. Sometimes messages are beginning to pile up in the particular cluster queues. It usually happens 1-4 times per day and mostly on production (it was only one time for last 90 days when it has happened on one of the test environments).
These messages are not delivered to consumers on other cluster brokers until we restart cluster connector (or entire broker).
The problem looks related to ARTEMIS-3809.
Our setup is: 6 servers in one environment (3 pairs of master/backup servers). Operating system is Linux (Red Hat).
We have tried to:
upgrade from 2.22.0 to 2.23.1
increase minLargeMessageSize on the cluster connectors to 1024000
The messages are still being stuck in the cluster queues.
Another problem that I tried to configure min-large-message-size as it written in documentation (in cluster-connection), but it caused errors at start (broker.xml did not pass validation with xsd), so it was only option to specify minLargeMessageSize in the URL parameters of connector for each cluster broker. I don't know if this setting has effect.
So we had to make a script which checks if messages are stuck in the cluster queues and restarts cluster connector.
How can we debug this situation?
When the messages are stuck, nothing wrong is written to the log (no errors, no stacktraces etc.).
Which logging level (for what classes) should we enable to debug or trace level to find out what happens with the cluster connectors?
I believe you can remedy the situation by setting this on your cluster-connection:
<producer-window-size>-1</producer-window-size>
See ARTEMIS-3805 for more details.
Generally speaking, moving message around the cluster via the cluster-connection, while convenient, isn't terribly efficient (much less so for "large" messages). Ideally you would have a sufficient number of clients on each node to consume the messages that were originally produced there. If you don't have that many clients then you may want to re-evaluate the size of your cluster as it may actually decrease overall message throughput rather than increase it.
If you're just using 3 HA pairs in order to establish a quorum for replication then you should investigate the recently added pluggable quorum voting which allows integration with a 3rd party component (e.g. ZooKeeper) for leader election eliminating the need for a quorum of brokers.
Looking to come up with solution that would mirror or replicate one Kafka environment without needing Kafka Connect. Having a hard time coming up with any possible solutions or workarounds. Very new to Kafka, would appreciate any thoughts and/or guidance!
MirrorMaker2 is based on Kafka Connect. The original MirrorMaker is not, however it is not recommended to use this anymore as it's not very fault tolerant.
Most Kafka replication solutions are built on Kafka Connect (Confluent Replicator as another example)
Uber uReplicator mentioned in the comments is built on Apache Helix and requires a Zookeeper connection, which Kafka Connect does not, so ultimately depends on what access and infrastructure you have available
Since Kafka comes with the Connect API and MirrorMaker2 pre-installed, there should be little reason to find alternatives unless it absolutely doesn't work for your use case (which is...?)
We are currently using HDF (Hortonworks Dataflow) 3.3.1 which bundles Kafka 2.0.0. Problem is with running multiple connectors with different configuration(Kerberos principals) on same KafkaConnect Cluster.
As part of this Kafka version, all connectors are supposed to use same consumer/producer properties which have been set in worker configuration with consumer.* or producer.* prefix. But as I stated, we have multiple users (apps) running their own connectors and we can't use a single Kerberos principal to allow read on all topics.
So just wanted to check with experts if there is any way this security limitation can be over come. The option I can think of is - run a different Kafka-Connect cluster for each Kafka User (different principals) but what implications it could have if we run many KafkaConnect Clusters on same nodes ? Will it cause any impacts in term of resources (Java heap etc.) or this is the only way (standard procedure) to handle this.
PS: In later releases (2.3+) this problem is fixed via KAFKA-8265 and these settings can be overwritten but even if we try upgrading to latest HDF we will only get Kafka 2.1 which will not solve this issue.
Thanks for your help !!
I think upgrading is your best option to get the linked feature. As I commented, you can go get latest kafka versions on your own... Hortonworks/Cloudera doesn't offer support for Connect anyway. They'd rather you use Spark/Flink/NiFi (I think Storm is no longer around?)
what implications it could have if we run many KafkaConnect Clusters on same nodes ? Will it cause any impacts in term of resources (Java heap etc.)
Heap is the main one (for batching, sink connectors). Network and CPU load could also come into account, depending on rate of messages.
As long as the advertised ports for each cluster process aren't colliding, you should be able to use the same group ids and internal topics, though
I have gone through (not fully) ActiveMQ and tried to figure out the deployment model for my application.
I am bit confused on that.
I want to make the system High Availability and decided to use the following. Please correct me if anything is wrong or disadvantage of the model.
Deployment Modle:
Will deploy Brokers in M1 and M2 respectivley.
Use Hardware load balancer (Either F5 or Zeus) to connect either one of the broker (M1 or M2) based on the load.
Want to publish a message using Load balancer URL.
I have gone through network of brokers and we need to mintain some topology. I fell which makes the system more complicated if system grows horizontally. So it is better to have one load balancer to distribute the load.
Questions
Is this above model will send message to any one of the Broker?
Consumer Will be deployed in Tomcat (Think i need to use embeded brokers to configure either M1 or M2). Is it possible to use Load balancer URL instaed of M1 or M2?
Is it possible to have single Web Console Admin to monitor both M1 and M2.
Do we have any performance issue using Spring's feature to consume message.
Sorry to shoot out so many questions. Please help me to correct the deployment model.
I think the best way to get some load balancing with some activemq servers is having a : network of brokers and your consumers/producers (in your webapps) should use some failover
So if a producer p1 send a message on a queue on broker 1, the consumer c1 can read the message on broker 2.
[Edit] I have never tried to add some hardware balancer instead of the activemq protocole failover. It should work : just try it and tell us.
3- I do not think it is possible to have only one Web Console to monitor both of your brokers.
4- As far as I am concerned I do not have any performance issue with my Spring configuration.
There are a lot of questions there.
The first thing is to do is start simple. If your application's load is being handled with just one broker, consider setting up high availability through a master-slave setup. For this you do not need a load balancer - the ActiveMQ client library has a failover mechanism where you can define the URLs to a set of brokers that the client should attempt to connect to.
If you are looking at setting up an infrastructure where one broker will not be able to deal with the message load (you can test the maximum throughput of your broker using the performance module), I would advise you to read up on how networks of brokers work. If you do go down this path, you really need to understand ActiveMQ.
On monitoring, a web console can only show you the internals of a single broker. To get insight around what is going on around a set of brokers you will need a monitoring tool such as FuseHQ/Hyperic that is able to aggregate JMX information from a number of boxes.
Performance with Spring is not a problem as long as you configure it correctly (see the section on PooledConnectionFactory).
I see that you are a new user, so if this answers your question, please tick it.