Clickhouse kafka table engine with many consumer - apache-kafka

I'm planning to do some test with Clickhouse by ingesting my kafka topics into a SummingMergeTree using this method: https://clickhouse.yandex/docs/en/table_engines/kafka/
For my test on a dev env, I'm not afraid of the volume but on the production environment we are already consuming those topics and we have to put many consumers to be able to read message as fast as they are pushed into. My question is: is there a way on Clickhouse to have many kafka consumer on one table with kafka engine ?
Thanks,
Romaric

Reading the documentation it seems that the num_consumers parameter in the Kafka engine is exactly what you need:
num_consumers – The number of consumers per table. Default: 1. Specify
more consumers if the throughput of one consumer is insufficient. The
total number of consumers should not exceed the number of partitions
in the topic, since only one consumer can be assigned per partition.

Related

Advice on how I can decrease Kafka Lag

I'm relatively new to working with Kafka, below is a sample of what my current set up is.
Kafka Setup
Multiple topics that all have one partition each. 2 Consumer Groups with each group containing one consumer.
The issue I am seeing is that the Lag is enormous, sometimes upwards of 8-10 hours waiting for consuming, the load is about 100-200 million messages a day
What steps should I look at in order to address this? Is it as simple as reassigning partitions or creating new partitions for the 3 topics that are being consumed by the two consumers? - I've also looked at compressing the contents of the producer with gzip but it doesn't really help in terms of the lag. I've looked at network connections and don't feel that it is anything got to do with this. If anyone could point me in the direction of Kafka and Low Latency documents that would be good also.
Generally the flow is to parallelize your consumption through the increase on the number of partitions and consumers in consumer groups that subscribe to those topics with increased partitions (Nconsumers <= Npartitions).
And distribute your topics with increase on the number of brokers in your cluster.
So from topic considerations:
Less partition per topic result:
in producer and/or consumer lag
starved or overloaded brokers and consumers.
(But take into account) More partition per topic result in:
More broker resources – file handlers and memory.
There is an overhead with each additional partition and a number of partitions a broker can handle is limited.
Overhead of replication load
Then increase the number of consumers in that consumer groups.
Try increasing partition per topic, but by itself it should not help! You also will need to increase the number of consumers in your consumer group. Is that single consumers or consumer groups on your diagram? How many consumers in your consumer group vs partitions on the topic that they are subscibed to.
From this in your message:
I've also looked at compressing the contents of the producer with gzip but it doesn't really help in terms of the lag.
I get an idead that your messages may be huge! Is it so? In case yes, try to keep messages small (for example by excluding BLOBs and keep external links to them)
Still the issue may be somewhere else like bad configs, consumer commit messages (acknowledgment handling), etc.
So, I highly advice you to read article Fine-tune Kafka performance with the Kafka optimization theorem
I also advise you to go through Apache Kafka courses on Confluent web-page
This should be added as a comment, but I haven't had permissions to do so. The provided info is very limited with incorrect diagram, which limits the ability to provide an adequate helpfull answer. If possible please correct your diagram and add more details about your set-up like:
broker configuration, file attached;
consumer set-up (Consumer commit messages);
producer set-up;
topic set-up;
kafka version (the defaults differ with major/minor versions)
The provided diagram is not correct in the notion of topic - partition relationship, so I assume it is a mistype and Partition 0 must be substituded with Broker 0, right?
Kafka's topics are divided into several partitions. While the topic is a logical concept in Kafka, a partition is the smallest storage unit that holds a subset of records owned by a topic...
Then there is an open question on the number of partiotions in each topic and the number of topics in each broker, as well as the number of brokers in your cluster!

Kafka topic is not doing load balance for springboot consumer applications

I am new to KAFKA and would like help
I have a topic XXXX and I have some applications consuming this topic, all listening to the same group
spring.cloud.stream.bindings.aaa_bbb.destination=XXXX
spring.cloud.stream.bindings.aaa_bbb.group=XXXX_group
topic XXXX has only one partition
sending 1000 messages to topic XXXX only one application consumes all messages.
but when I add a new partition to topic XXXX and the messages are divided into 2 applications and I still have applications without receiving anything.
I repeat the process and add a new partition to topic XXXX
now the topic has 3 partitions and the messages are divided into 3 applications.
it looks like it's a partition for each consumer.
which doesn't make much sense to me or I don't understand.
is there a way to make this load balance work, without having to create a partition for each consumer?
Can someone explain to me how this relationship works?
That is a fundamental of Kafka - only one consumer in each group can consume from a partition. A topic/partition is a simple log, it is not like a queue in JMS or RabbitMQ.
Kafka maintains only a simple current committed offset for each group/topic/partition.
The only way to add concurrency is to increase the partitions.
To increase the throughput you have to have more than one partition. When events are written to the log the ID of the event will determine which partition the message will be delivered to.
Kafka only guarantees the ordering over an ID, not over the entire log.
I normally recommend having more than one partition even if you have a single node as this allows the cluster to scaled in the future for improved performance.
You can't change the number of partitions after the topic has been created as that would have an impact on the partitioning target.
In your case I'd start with 3 per node up to a maximum of 9 if you had 3 nodes in the cluster - Please test this yourself.
There's a limit of 1 consumer per partition which is the behaviour you're seeing.

When you change number of partitions for user kafka topic, will the Kafka stream adjust number of partitions for internal topic? [duplicate]

Kafka version: 1.0.0
Let's say the stream application uses low level processor API which maintains the state and reads from a topic with 10 partitions. Please clarify if the internal topic is expected to be created with the same number of partitions OR is it per the broker default. If it's the later, if we need to increase the partitions of the internal topic, is there any option?
Kafka Streams will create the topic for you. And yes, it will create it with the same number of partitions as your input topic. During startup, Kafka Streams also checks if the topic has the expected number of partitions and fails if not.
The internal topic is basically a regular topic as any other and you can change the number of partitions via command line tools like for any other topic. However, this should never be required. Also note, that dropping/adding partitions, will mess up your state.

What is the correlation in kafka stream/table, globalktable, borkers and partition?

I am studying kafka streams, table, globalktable etc. Now I am confusing about that.
What exactly is GlobalKTable?
But overall if I have a topic with N-partitions, and one kafka stream, after I send some data on the topic how much stream (partition?) will I have?
I made some tries and I notice that the match is 1:1. But what if I make topic replicated over different brokers?
Thank you all
I'll try to answer your questions as you have them listed here.
A GlobalKTable has all partitions available in each instance of your Kafka Streams application. But a KTable is partitioned over all of the instances of your application. In other words, all instances of your Kafka Streams application have access to all records in the GlobalKTable; hence it used for more static data and is used more for lookup records in joins.
As for a topic with N-partitions, if you have one Kafka Streams application, it will consume and work with all records from the input topic. If you were to spin up another instance of your streams application, then each application would process half of the number of partitions, giving you higher throughput due to the parallelization of the work.
For example, if you have input topic A with four partitions and one Kafka Streams application, then the single application processes all records. But if you were to launch two instances of the same Kafka Streams application, then each instance will process records from 2 partitions, the workload is split across all running instances with the same application-id.
Topics are replicated across different brokers by default in Kafka, with 3 being the default level of replication. A replication level of 3 means the records for a given partition are stored on the lead broker for that partition and two other follower brokers (assuming a three-node broker cluster).
Hope this clears things up some.
-Bill

Maximum subscription limit of Kafka Topics Per Consumer

What is maximum limit of topics can a consumer subscribe to in Kafka. Am not able to find this value documented anywhere.
If consumer subscribes 500000 or more topics, will there be downgrade in performance.
500,000 or more topics in a single Kafka cluster would be a bad design from the broker point of view. You typically want to keep the number of topic partitions down to the low tens of thousands.
If you find yourself thinking you need that many topics in Kafka you might instead want to consider creating a smaller number of topics and having 500,000 or more keys instead. The number of keys in Kafka is unlimited.
To be technical the "maximum" number of topics you could be subscribed to would be constrained by the available memory space for your consumer process (if your topics are listed explicitly then a very large portion of the Java String pool will be your topics). This seems the less likely limiting factor (listing that many topics explicitly is prohibitive).
Another consideration is how the Topic assignment data structures are setup at Group Coordinator Brokers. They could run out of space to record the topic assignment depending on how they do it.
Lastly, which is the most plausible, is the available memory on your Apache Zookeeper node. ZK keeps ALL data in memory for fast retrieval. ZK is also not sharded, meaning all data MUST fit onto one node. This means there is a limit to the number of topics you can create, which is constrained by the available memory on a ZK node.
Consumption is initiated by the consumers. The act of subscribing to a topic does not mean the consumer will start receiving messages for that topic. So as long as the consumer can poll and process data for that many topics, Kafka should be fine as well.
Consumer is fairly independent entity than Kafka cluster, unless you are talking about build in command line consumer that is shipped with Kafka
That said logic of subscribing to a kafka topic, how many to subscribe to and how to handle that data is upto the consumer. So scalability issue here lies with consumer logic
Last but not the least, I am not sure it is a good idea to consumer too many topics within a single consumer. The vary purpose of pub sub mechanism that Kafka provides through the segregation of messages into various topics is to facilitate the handling of specific category of messages using separate consumers. So I think if you want to consume many topics like few 1000s of them using a single consumer, why divide the data into separate topics first using Kafka.