How do I send data to multiple Kafka producers one time - apache-kafka

I am looking for a help on Kafka producer to multiple clusters in parallel. I have two environments for pushing data to (cert and dev), every time I run producer to send data to cert and dev separately (one topic), is there away I can send data to both clusters together?

Tying your application (producers) to a particular environment topology (cert / dev) doesn't sound like the best approach. There is no way to produce from the same producer instance to two clusters - so then you would have to have two producer instances, and hope that both behave exactly the same when producing. Any problems (e.g. network glitch) that causes one to fail and the other not means you end up with divergence in your two environments.
Instead use something like Confluent Replicator or MirrorMaker 2 to stream records from one cluster to another. That way you can build your application to producer records to a target cluster, and decoupled from that populate additional environments/clusters as desired.

Related

Kafka replicator setup architecture

I have two Kafka clusters, one in London and another in NYC. Each has three Zookeeper instances and two brokers. There are two topics used in each region a InputData topic and an OutputData topic. I want each region to replicate the data from the other one, ie for them to effectively be using a global Inputdata and OutputData topic. If NYC adds two messages to this, it should be replicated to EMEA. If EMEA adds three messages, those should go to NYC.
My question is how do I achieve this? Does two way replication work or do you get into an endless loop/are there issues with concurrency. Ie what happens if NYC writes the messages locally at the same time EMEA writes its messages, then replicator tries to get the topics synched but they are now out of sync.
Is this even possible? Or can replication only work one way - ie you have to have a source topic that is only written from the main cluster, and the places it is replicated to are read only?
My second question is how do I make the replicator fault tolerant, do I run it in distributed mode with one connect worker per server - which in this case would make it two connect workers per cluster?

Apache Flink Kafka Integration Partition Seperation

I need to implement below data flow. I have one kafka topic which has 9 partitions. I can read this topic with 9 parallelism level. I have also 3 node Flink cluster. Each of nodes of this cluster has 24 task slot.
First of all, I want to spread my kafka like, each server has 3 partition like below. Order is not matter, I only transform kafka message and send it DB.
Second thing is, I want to increase my parallelism degree while saving NoSQL DB. If I increase my parallelism 48, since sending DB is IO operation, it does not consume CPU, I want to be sure, When Flink rebalance my message, my message will stay in the same server.
Is there any advice for me?
If you want to spread you Kafka readers across all 3 nodes, I would recommend to start them with 3 slots each and set the parallelism of the Kafka source to 9.
The problem is that at the moment it is not possible to control how tasks are placed if there are more slots available than the required parallelism. This means if you have fewer sources than slots, then it might happen that all sources will be deployed to one machine, leaving the other machines empty (source-wise).
Being able to spread out tasks across all available machines is a feature which the community is currently working on.

Kafka: multiple consumers in the same group

Let's say I have a Kafka cluster with several topics spread over several partitions. Also, I have a cluster of applications act as clients for Kafka. Each application in that cluster has a client that is subscribed to a same set of topics, which is identical over the whole cluster. Also, each of these clients share same Kafka group ID.
Now, speaking of commit mode. I really do not want to specify offset manually, but I do not want to use autocommit either, because I need to do some handing after I receive my data from Kafka.
With this solution, I expect to occur "same data received by different consumers" problem, because I do not specify offset before I do reading (consuming), and I read data concurrently from different clients.
Now, my question: what are the solutions to get rid of multiple reads? Several options coming to my mind:
1) Exclusive (sequential) Kafka access. Until one consumer committed read, no other consumers access Kafka.
2) Somehow specify offset before each reading. I do not even know how to do that with assumption that read might fail (and offset will not be committed) - we gonna need some complicated distributed offset storage.
I'd like to ask people experienced with Kafka to recommend something to achieve behavior I need.
Every partition is consumed only by one client - another client with the same group ID won't get access to that partition, so concurrent reads won't occur...

Hint about kafka cluster setup

I have the following scenario:
4 wearable sensors attached on individuals.
Potentially infinite individuals.
A Kafka cluster.
I have to perform real-time processing on data streams on a cluster with a running instance of apache flink.
Kafka is the data hub between flink cluster and sensors.
Moreover, subject's streams are totally independent and also different streams belonging to same subject are independent each other.
I imagine this setup in my mind:
I set a specific topic for each subject and each topic is partitioned in 4 partition, each one for each sensor on specific person.
In this way I though to establish a consumer group for every topic.
Actually, my data amount is not so much big but mine interest is to build an easily scalable system. A day maybe I can have hundreds of individuals for instance...
My questions are:
Is this setup good? What do you think about it?
In this way I will have 4 kafka broker and each one handles a partition, right (without consider potential backups)?
Destroy me guys,
and thanks in advance
You can't have an infinite number of topics in a Kafka cluster so if you plan to scale beyond 10,000 or more topics then you should consider another design. Instead of giving each individual a dedicated topic, you can use an individual's ID as a key and publish data as a key/value pair to a smaller number of topics. In Kafka you can have an (almost) infinite number of keys.
Also consider more partitions. Each of your 4 brokers can handle many partitions. If you only have 4 partitions in a topic then you can only have at most 4 consumers working together in parallel in a consumer group (in your case in Flink)

what is the best approach to keep two kafka clusters in Sync

I have to setup two kafka clusters in two different data centers (DCs), which have same topics and configuration. the reason is that the connectivity between two data centers is nasty we cannot create a global one.
We are having producers and consumers to publish and subcribe to the topics of each DC.
the problem is that I need to keep both clusters in sync.
Lets say: all messages are written to the first DC should be eventually replicated to the second, and otherway around.
I am evaluation the kafka MirrorMaker tool by creating the Mirror by consuming messages of the first and procuding messages to the second one. However it is also requried to replicate data from the second to the first because writing data is both allowed in two clusters.
I dont think the Kafka MirrorMaker tool is fit to our case.
Appricate any suggestion?
Thanks in advance.
Depending on your exact requirements, you can use MirrorMaker for your use case.
One option would be to just have two separate topics, lets call them topic1 on cluster 1 and topic2 on cluster 2. All your producing threads write to the "local" topic and you use mirrormaker to replicate this topic to the remote cluster.
For your consumers, you simply subscribe to both topics on whatever cluster is closest to you, that way you will get all records that were written on either cluster.
I have created an illustration that hopefully helps:
Alternatively, you could create aggregation topics on both clusters and use MirrorMaker to replicate data into this topic, this would enable you to have all data in one topic for consumption.
You would have duplicate data on the same cluster this way, but you could take care of this by lower retention settings on the input topic.
Again, hopefully the following picture helps to explains my thinking:
In order for this to work, you will need to configure MirrorMaker to replicate a topic into a topic with a different name, which is not a standard thing for it to do, I have written a small blog post on how to do this, if you want to investigate this option further.