Kafka Stream partition distribution - apache-kafka

In Kafka Stream, stream tasks will be distributed between instances in multi-instance(and hence partitions will be distributed). On the other hand, one of differences between KTable and GlobalKTable is that KTable distributes partitions between instances(from Mastering Kafka Streams and ksqlDB).
Now I can't understand that KTable will eventually cause distribution or Stream Task or both(If both, then how)?
What happened if we have KTable in our topology and multiple stream task(source processor on a multi-partition topic) in a multi-instance environment?

Sure sure if I fully understand the question. Maybe the docs help to shed some light: https://docs.confluent.io/platform/current/streams/architecture.html
A KTable (and KStream) is a logical abstraction. When you call streamsBuilder.build() it will be compiled into a Topology with Processors. Connected Processors (that may have a state store attached) are grouped into sub-topologies, and sub-topologies are executed by tasks (base on the number of partitions).
For streamsBuilder.table("topic"), the compiled Topology is:
topic -> source -> processor(state)
For each topic partition, a task will be created processing one partition, and thus the KTable is implicitly partitioned, too.

Related

Consuming from multiple topics using single kafka stream

which one is recommended to use :
1. Single kafka stream consuming from multiple topics
2. Different kafka streams consuming from different topics (I've used this one already with no issues encountered)
Is it possible to achieve #1 ? and if yes, what're the implications?
and if I use 'EXACTLY_ONCE' settings, what kind of complexities it'll bring?
kafka version : 2.2.0-cp2
Is it possible to achieve #1 (Single kafka stream consuming from multiple topics)
Yes, you can use StreamsBuilder#stream(Collection<String> topics)
If the data that you want to process is spread across multiple topics and that these multiple topics constitute one single source, then you can use this, but not if you want to process those topics in parallel.
It is like one consumer subscribing to all these topics which also means one thread for consuming all the topics. When you call poll() it returns ConsumerRecords from all the subscribed topics and not just one topic.
In Kafka streams, there is a term called Topology, which is basically a acyclic graph of sources, processors and sinks. A topology can contain sub-topologies.
Sub-topologies can then be executed as independent stream tasks through parallel threads (Reference)
Since each topology can have a source, which can be a topic, and if you want parallel processing of these topics, then you have to break-up your graph to sub-topologies.
If I use 'EXACTLY_ONCE' settings, what kind of complexities it'll bring?
When messages reach sink processor in a topology, then its source must be committed, where a source can be a single topic or collection of topics.
Multiple topics or one topic, we need to send offsets to the transaction from the producer, which is basically Map<TopicPartition, OffsetMetadata> that should be committed when the messages are produced.
So, I think it should not introduce any complexities whether it is single topic having 10 partitions or 10 topics with 1 partition each, because offset is at the TopicPartition level and not at topic level.

What is the correlation in kafka stream/table, globalktable, borkers and partition?

I am studying kafka streams, table, globalktable etc. Now I am confusing about that.
What exactly is GlobalKTable?
But overall if I have a topic with N-partitions, and one kafka stream, after I send some data on the topic how much stream (partition?) will I have?
I made some tries and I notice that the match is 1:1. But what if I make topic replicated over different brokers?
Thank you all
I'll try to answer your questions as you have them listed here.
A GlobalKTable has all partitions available in each instance of your Kafka Streams application. But a KTable is partitioned over all of the instances of your application. In other words, all instances of your Kafka Streams application have access to all records in the GlobalKTable; hence it used for more static data and is used more for lookup records in joins.
As for a topic with N-partitions, if you have one Kafka Streams application, it will consume and work with all records from the input topic. If you were to spin up another instance of your streams application, then each application would process half of the number of partitions, giving you higher throughput due to the parallelization of the work.
For example, if you have input topic A with four partitions and one Kafka Streams application, then the single application processes all records. But if you were to launch two instances of the same Kafka Streams application, then each instance will process records from 2 partitions, the workload is split across all running instances with the same application-id.
Topics are replicated across different brokers by default in Kafka, with 3 being the default level of replication. A replication level of 3 means the records for a given partition are stored on the lead broker for that partition and two other follower brokers (assuming a three-node broker cluster).
Hope this clears things up some.
-Bill

Ordering guarantee in each partition using Confluent Replicator

There is an requirement in our systems to maintain a proper sequence and ordering guarantee of records inside a Kafka topic partition.
As observed in our test runs, Kafka Mirror does not provide an ordering guarantee in partition. Records tend to be shuffled between source and target cluster topics.
We are planning to use Confluent Replicator for cross cluster data replication. In the test run of Confluent community edition 5.3.1, it has been observed that source and destination topic maintained the exact same partition and its respective record count. (Replicator was run on single thread configs)
But, does Replicator guarantee exact ordering of records within a partition ?
And if I increase the number of replication threads for parallelism and better throughput, does it still guarantee ordering (also in case of one thread failure) ?
MirrorMaker (1.0) will repartition data using the DefaultPartitioner, so the only way you'd manage to get "out of order data" is by having producers overriding their partitioner. In addition, MirrorMaker does not guarantee destination topics have the same number of partitions or configurations as the source
Replicator and MirrorMaker 2.0 (available with Kafka 2.4.0) preserve the input partition counts and topic configs. Order is guaranteed as well as any other consumer group. It might be possible records are produced more than once during delivery due to edge cases in network transmission errors, however.
Increasing connector tasks will add more consumers to the group, again, same as any other application, and input and output partitions should match

What purpose do tasks in Kafka streams API serve

I am trying to understand the architecture of Kafka streams API and came across this in the documentation:
An application's processor topology is scaled by breaking it into multiple tasks
What are all the criteria to break up the processor topology into tasks? Is it just the number of partitions in the stream/topic or something more.
Tasks can then instantiate their own processor topology based on the assigned partitions
Can someone explain what the above means with an example? If the tasks are created only with the purpose of scaling, shouldn't they all have the same topology?
Tasks are atomic parallel units of processing.
A topology is divided into sub-topologies (sub-topologies are "connected components" that forward data in-memory; different sub-topologies are connected via topics). For each sub-topology the number of input topic partitions determines the number of tasks that are created. If there are multiple input topics, the maximum number of partitions over all topics determines the number of tasks.
If you want to know the sub-topologies of your Kafka Streams application, you can call Topology#describe(): the returned TopologyDescription can either be just printed via toString() or one can traverse sub-topologies and their corresponding DAGs.
A Kafka Streams application has one topology that may have one or more sub-topologies. You can find a topology with 2 sub-topologies in the article Data Reprocessing with the Streams API in Kafka: Resetting a Streams Application.

Kafka consumer reading messages parallel

Can we have multiple consumers to consume from a topic to achieve parallel processing in kafka.
My use case is to read messages from a single partition in parallel.
Yes you can process messages in parallel using many Kafka consumers, but no, it's not possible if you only have one partition.
Parallelism in Kafka consuming is defined by the number of partitions, you can easily re-partition your topic at any time to create more partitions.
An example of how process messages in parallel using rapids-kafka-client below, a library to make Kafka parallel consuming easier.
public static void main(String[] args){
ConsumerConfig.<String, String>builder()
.prop(KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName())
.prop(VALUE_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName())
.prop(GROUP_ID_CONFIG, "stocks")
.topics("stock_changed")
.consumers(7)
.callback((ctx, record) -> {
System.out.printf("status=consumed, value=%s%n", record.value());
})
.build()
.consume()
.waitFor();
}
Simply saying we can't achieve partition level parallelism for Consumers by default.
But you can try Akka Streams Kafka (Reactive kafka). Once go through thsese docs.
The number of partitions define the level of parallelism to read from a kafka topic. But reading is (more or less) only restricted by your network capacities.
A good pattern is to separate reading and processing of messages (one thread per topic-partition for reading and multiple threads for processing this messages).
You need multiple partitions to do this, or something like Parallel Consumer (PC) to sub divide the single partition.
However, it's recommended to have at least 3 partitions and have at least three consumers running in a group, to utilise high availability. You can again use PC to process all these partitions, sub divided by key, in parallel.
PC directly solves for this, by sub partitioning the input partitions by key and processing each key in parallel.
It also tracks per record acknowledgement. Check out Parallel Consumer on GitHub (it's open source BTW, and I'm the author).