Kafka transactions on multiple clusters - apache-kafka

I could not find the answer in the transaction API. I think it's probably not possible looking at the design under the hood but I wanted a confirmation: is it possible to have a Kafka transaction if the consumer and the producer are connected to different Kafka clusters? Meaning the topics I am consuming from and the topics I am pushing to are on 2 separate Kafka clusters.

You are correct, a Kafka transaction cannot span across multiple clusters.
Both the consumer and producer must use the same cluster. The main reason is that both the offsets from the consumer and the records from the producer have to be committed together in the transaction.

Related

How exactly Kafka's consumer communicate to server?

Does Kafka consumer keep checking the health of the broker(Kafka Server) or vice versa?
Let's say by anyhow, Consumers and brokers know each other's health so how exactly the consumer will read from the partition?
Let's say I have 48 partitions for a topic and have two consumer groups for the topic, so how many threads will be consuming the data from all partitions?
Consumers send out healthchecks so that the broker knows if the consumers are healthy. Brokers' health is controlled by Controller, which is a Kafka service that runs on every broker in a Kafka cluster, but only one can be active (elected) at any point in time.
See this video for detailed description. In a nutshell, first consumer in the group is a leader and decided on the assignments for the rest of the consumers in the same group. This data is send to the broker and distributed across consumers.
Thread management is your own responsibility. Both KafkaProducer and KafkaConsumer are single-threaded components.

What is the correlation in kafka stream/table, globalktable, borkers and partition?

I am studying kafka streams, table, globalktable etc. Now I am confusing about that.
What exactly is GlobalKTable?
But overall if I have a topic with N-partitions, and one kafka stream, after I send some data on the topic how much stream (partition?) will I have?
I made some tries and I notice that the match is 1:1. But what if I make topic replicated over different brokers?
Thank you all
I'll try to answer your questions as you have them listed here.
A GlobalKTable has all partitions available in each instance of your Kafka Streams application. But a KTable is partitioned over all of the instances of your application. In other words, all instances of your Kafka Streams application have access to all records in the GlobalKTable; hence it used for more static data and is used more for lookup records in joins.
As for a topic with N-partitions, if you have one Kafka Streams application, it will consume and work with all records from the input topic. If you were to spin up another instance of your streams application, then each application would process half of the number of partitions, giving you higher throughput due to the parallelization of the work.
For example, if you have input topic A with four partitions and one Kafka Streams application, then the single application processes all records. But if you were to launch two instances of the same Kafka Streams application, then each instance will process records from 2 partitions, the workload is split across all running instances with the same application-id.
Topics are replicated across different brokers by default in Kafka, with 3 being the default level of replication. A replication level of 3 means the records for a given partition are stored on the lead broker for that partition and two other follower brokers (assuming a three-node broker cluster).
Hope this clears things up some.
-Bill

Kafka throttle producer based on consumer lag

Is there any way to pause or throttle a Kafka producer based on consumer lag or other consumer issues? Would the producer need to determine itself if there is consumer lag then perform throttling itself?
Kafka is build on Pub/Sub design. Producer publish the message to centralized topic. Multiple consumers can subscribe to that topic. Since multiple consumers are involve you cannot decide on producer speed. One consumer can be slow other can be fast. Also it is against the design principle otherwise both system will become tightly couple. If you have use case of throttling may be you should evaluate other framework like direct rest call.
Producer and Consumer are decoupled.
Producer push data to Kafka topics (partitions topic), that are stored in Kafka Brokers. Producer doesn't know who and how often consume messages.
Consumer consume data from Brokers. Consumer doesn't know how many producers produce the messages. Even the same messages can be consumed by several consumers that are in different groups. In example some consumer can consume faster than the other.
You can read more about Producer and Consumer in Apache Kafka webpage
It is not possible to throttle the producer/producers weighing on performance of consumers.
In my scenario I don't want to loose events if the disk size is
exceeded before a message is consumed
To tackle your issue, you have to depend on the parallelism offering by the Kafka. Your Kafka topic should have multiple partitions and producers has to use different keys to populate the topic. So your data will be distributed across multiple partitions and bringing a consumer group you can manage load within a group of consumers. All data within a partition can be processed in order, that may be relevant since you are dealing with event processing.

Clickhouse kafka table engine with many consumer

I'm planning to do some test with Clickhouse by ingesting my kafka topics into a SummingMergeTree using this method: https://clickhouse.yandex/docs/en/table_engines/kafka/
For my test on a dev env, I'm not afraid of the volume but on the production environment we are already consuming those topics and we have to put many consumers to be able to read message as fast as they are pushed into. My question is: is there a way on Clickhouse to have many kafka consumer on one table with kafka engine ?
Thanks,
Romaric
Reading the documentation it seems that the num_consumers parameter in the Kafka engine is exactly what you need:
num_consumers – The number of consumers per table. Default: 1. Specify
more consumers if the throughput of one consumer is insufficient. The
total number of consumers should not exceed the number of partitions
in the topic, since only one consumer can be assigned per partition.

Kafka Issues on consumer group

I'm a newbie in Kafka. I had a glance at the Kafka Documentation. It seems that the the message dispatched to a subscribing consumer group is implemented by binding the partition with the consumer instance.
One important thing we should remember when we work with Apache Kafka is the number of consumers in the same consumer group should be less than or equal the number of partitions in the consumed topic. Otherwise, the exceedable consumers will not be received any messages from the topic.
In a non-prod environment, I didn't config the topic partition. In such case, is there only a single partition in Kafka. And If I start multiple consumers sharing the same group and subscribe them to the topic, would the message always dispatched to the same instance in the group? In other words, I have to partition the topic to get the load-balance feature in consumer group?
Thanks!
You are absolutely right. One partitions cannot be processed in paralell (by one consumer group). You can treat partition as atomic and it cannot be split.
If you configure non-prod and prod env with the same amount of partitions per topic, that should help you to find correct number of conumsers and catch problems before moving to prod.