Ordering guarantee in each partition using Confluent Replicator - apache-kafka

There is an requirement in our systems to maintain a proper sequence and ordering guarantee of records inside a Kafka topic partition.
As observed in our test runs, Kafka Mirror does not provide an ordering guarantee in partition. Records tend to be shuffled between source and target cluster topics.
We are planning to use Confluent Replicator for cross cluster data replication. In the test run of Confluent community edition 5.3.1, it has been observed that source and destination topic maintained the exact same partition and its respective record count. (Replicator was run on single thread configs)
But, does Replicator guarantee exact ordering of records within a partition ?
And if I increase the number of replication threads for parallelism and better throughput, does it still guarantee ordering (also in case of one thread failure) ?

MirrorMaker (1.0) will repartition data using the DefaultPartitioner, so the only way you'd manage to get "out of order data" is by having producers overriding their partitioner. In addition, MirrorMaker does not guarantee destination topics have the same number of partitions or configurations as the source
Replicator and MirrorMaker 2.0 (available with Kafka 2.4.0) preserve the input partition counts and topic configs. Order is guaranteed as well as any other consumer group. It might be possible records are produced more than once during delivery due to edge cases in network transmission errors, however.
Increasing connector tasks will add more consumers to the group, again, same as any other application, and input and output partitions should match

Related

Can I use Kafka for multiple independent consumers sequential reads?

I have the following use case:
50 students write their own code which consumes a preloaded dataset, and they will repeat it many times.
They all need to do the same task: read the data in order, and process it.
The dataset is a time series containing 600 million messages, each message is about 1.3KB.
Processing will probably be in Spark, but not mandatory.
The dataset is fixed and ReadOnly.
The data should be read at "reasonable speed" > 30MB/sec for each consumer.
I was thinking of setting kafka cluster with 3+ brokers, 1 topic, and 50 partitions.
My issue with the above plan is that each student (== consumer) must read all the data, regardless of what other consumers do.
Is Kafka a good fit for this? If so, how?
What if I relax the requirement of reading the dataset in order? i.e. a consumer can read the 600M messages in any order.
Is it correct that in this case each consumer will simply pull the full topic (starting with "earliest)?
An alternative is to set an HDFS storage (we use Azure so it's called Storage Account) and simply supply a mount point. However, I do not have control of the throughput in this case.
Throughput calculation:
let's say 25 consumers run concurrently, each reading at 30MB/s -> 750MB/s .
Assuming data is read from disk, and disk rate is 50MB/s, I need to read concurrently from 750/50 = 15 disks.
Does it mean I need to have 15 brokers? I did not see how one broker can allocate partitions to several disks attached to it.
similar posts:
Kafka topic partitions to Spark streaming
How does one Kafka consumer read from more than one partition?
(Spring) Kafka appears to consume newly produced messages out of order
Kafka architecture many partitions or many topics?
Is it possible to read from multiple partitions using Kafka Simple Consumer?
Processing will probably be in Spark, but not mandatory
An alternative is to set an HDFS storage (we use Azure)
Spark can read from Azure Blob Storage, so I suggest you start with that first. You can easily scale up Spark executors in parallel for throughput.
If want to use Kafka, don't base consumption rate on disk speed alone, especially when Kafka can do zero-copy transfers. Use kafka-consumer-perf-test script to test how fast your consumers can go with one partition. Or, better, if your data has some key other than timestamp that you can order by, then use that.
It's not really clear if each "50 students" does the same processing on the data set, or some pre computations can be done, but if so, Kafka Streams KTables can be setup to aggregate some static statistics of the data, if it's all streamed though a topic, that way, you can distribute load for those queries, and not need 50 parallel consumers.
Otherwise, my first thought would be to simply use a TSDB like OpenTSDB, Timescale or Influx, maybe Druid . Which could also be used with Spark, or queried directly.
If you are using Apache Spark 3.0+ there are ways around consumer per partition bound, as it can use more executor threads than partitions are, so it's mostly about how fast your network and disks are.
Kafka stores latest offsets in memory, so probably for your use case most of reads will be from memory.
Desired minimum number of partitions to read from Kafka. By default, Spark has a 1-1 mapping of topicPartitions to Spark partitions consuming from Kafka. If you set this option to a value greater than your topicPartitions, Spark will divvy up large Kafka partitions to smaller pieces. Please note that this configuration is like a hint: the number of Spark tasks will be approximately minPartitions. It can be less or more depending on rounding errors or Kafka partitions that didn't receive any new data.
https://spark.apache.org/docs/3.0.1/structured-streaming-kafka-integration.html

What is the correlation in kafka stream/table, globalktable, borkers and partition?

I am studying kafka streams, table, globalktable etc. Now I am confusing about that.
What exactly is GlobalKTable?
But overall if I have a topic with N-partitions, and one kafka stream, after I send some data on the topic how much stream (partition?) will I have?
I made some tries and I notice that the match is 1:1. But what if I make topic replicated over different brokers?
Thank you all
I'll try to answer your questions as you have them listed here.
A GlobalKTable has all partitions available in each instance of your Kafka Streams application. But a KTable is partitioned over all of the instances of your application. In other words, all instances of your Kafka Streams application have access to all records in the GlobalKTable; hence it used for more static data and is used more for lookup records in joins.
As for a topic with N-partitions, if you have one Kafka Streams application, it will consume and work with all records from the input topic. If you were to spin up another instance of your streams application, then each application would process half of the number of partitions, giving you higher throughput due to the parallelization of the work.
For example, if you have input topic A with four partitions and one Kafka Streams application, then the single application processes all records. But if you were to launch two instances of the same Kafka Streams application, then each instance will process records from 2 partitions, the workload is split across all running instances with the same application-id.
Topics are replicated across different brokers by default in Kafka, with 3 being the default level of replication. A replication level of 3 means the records for a given partition are stored on the lead broker for that partition and two other follower brokers (assuming a three-node broker cluster).
Hope this clears things up some.
-Bill

Increase the Number of partitions

We are working on Confluent Platform and we are still getting to know the internals. But we have implemented generic use cases . We are trying to optimizing our cluster
In my use case, I need to increase the number of partitions of a topic . What is the impact of it ? Can you please share of it
Sure, you can increase partitions.
However,
Increasing partitions does not move existing data. If using Confluent Enterprise, you could use confluent-rebalancer, or if not, then kafka-reassign-partitions CLI tool. So, you'll definitely want to rebalance a topic to "optimize" the cluster.
During the retention period of the topic (read: for the existing data), if you previously had a producer sending data to partition N, and now had N+1 partitions, then you lose ordering of those messages that solely existed in partition N. New messages could be spread across new partitions that a new producer calculates with the DefaultPartitioner. If you don't send keys with messages, then this isn't a problem.

Clickhouse kafka table engine with many consumer

I'm planning to do some test with Clickhouse by ingesting my kafka topics into a SummingMergeTree using this method: https://clickhouse.yandex/docs/en/table_engines/kafka/
For my test on a dev env, I'm not afraid of the volume but on the production environment we are already consuming those topics and we have to put many consumers to be able to read message as fast as they are pushed into. My question is: is there a way on Clickhouse to have many kafka consumer on one table with kafka engine ?
Thanks,
Romaric
Reading the documentation it seems that the num_consumers parameter in the Kafka engine is exactly what you need:
num_consumers – The number of consumers per table. Default: 1. Specify
more consumers if the throughput of one consumer is insufficient. The
total number of consumers should not exceed the number of partitions
in the topic, since only one consumer can be assigned per partition.

Maximum subscription limit of Kafka Topics Per Consumer

What is maximum limit of topics can a consumer subscribe to in Kafka. Am not able to find this value documented anywhere.
If consumer subscribes 500000 or more topics, will there be downgrade in performance.
500,000 or more topics in a single Kafka cluster would be a bad design from the broker point of view. You typically want to keep the number of topic partitions down to the low tens of thousands.
If you find yourself thinking you need that many topics in Kafka you might instead want to consider creating a smaller number of topics and having 500,000 or more keys instead. The number of keys in Kafka is unlimited.
To be technical the "maximum" number of topics you could be subscribed to would be constrained by the available memory space for your consumer process (if your topics are listed explicitly then a very large portion of the Java String pool will be your topics). This seems the less likely limiting factor (listing that many topics explicitly is prohibitive).
Another consideration is how the Topic assignment data structures are setup at Group Coordinator Brokers. They could run out of space to record the topic assignment depending on how they do it.
Lastly, which is the most plausible, is the available memory on your Apache Zookeeper node. ZK keeps ALL data in memory for fast retrieval. ZK is also not sharded, meaning all data MUST fit onto one node. This means there is a limit to the number of topics you can create, which is constrained by the available memory on a ZK node.
Consumption is initiated by the consumers. The act of subscribing to a topic does not mean the consumer will start receiving messages for that topic. So as long as the consumer can poll and process data for that many topics, Kafka should be fine as well.
Consumer is fairly independent entity than Kafka cluster, unless you are talking about build in command line consumer that is shipped with Kafka
That said logic of subscribing to a kafka topic, how many to subscribe to and how to handle that data is upto the consumer. So scalability issue here lies with consumer logic
Last but not the least, I am not sure it is a good idea to consumer too many topics within a single consumer. The vary purpose of pub sub mechanism that Kafka provides through the segregation of messages into various topics is to facilitate the handling of specific category of messages using separate consumers. So I think if you want to consume many topics like few 1000s of them using a single consumer, why divide the data into separate topics first using Kafka.