What purpose do tasks in Kafka streams API serve - apache-kafka

I am trying to understand the architecture of Kafka streams API and came across this in the documentation:
An application's processor topology is scaled by breaking it into multiple tasks
What are all the criteria to break up the processor topology into tasks? Is it just the number of partitions in the stream/topic or something more.
Tasks can then instantiate their own processor topology based on the assigned partitions
Can someone explain what the above means with an example? If the tasks are created only with the purpose of scaling, shouldn't they all have the same topology?

Tasks are atomic parallel units of processing.
A topology is divided into sub-topologies (sub-topologies are "connected components" that forward data in-memory; different sub-topologies are connected via topics). For each sub-topology the number of input topic partitions determines the number of tasks that are created. If there are multiple input topics, the maximum number of partitions over all topics determines the number of tasks.
If you want to know the sub-topologies of your Kafka Streams application, you can call Topology#describe(): the returned TopologyDescription can either be just printed via toString() or one can traverse sub-topologies and their corresponding DAGs.
A Kafka Streams application has one topology that may have one or more sub-topologies. You can find a topology with 2 sub-topologies in the article Data Reprocessing with the Streams API in Kafka: Resetting a Streams Application.

Related

Kafka Stream partition distribution

In Kafka Stream, stream tasks will be distributed between instances in multi-instance(and hence partitions will be distributed). On the other hand, one of differences between KTable and GlobalKTable is that KTable distributes partitions between instances(from Mastering Kafka Streams and ksqlDB).
Now I can't understand that KTable will eventually cause distribution or Stream Task or both(If both, then how)?
What happened if we have KTable in our topology and multiple stream task(source processor on a multi-partition topic) in a multi-instance environment?
Sure sure if I fully understand the question. Maybe the docs help to shed some light: https://docs.confluent.io/platform/current/streams/architecture.html
A KTable (and KStream) is a logical abstraction. When you call streamsBuilder.build() it will be compiled into a Topology with Processors. Connected Processors (that may have a state store attached) are grouped into sub-topologies, and sub-topologies are executed by tasks (base on the number of partitions).
For streamsBuilder.table("topic"), the compiled Topology is:
topic -> source -> processor(state)
For each topic partition, a task will be created processing one partition, and thus the KTable is implicitly partitioned, too.

Consuming from multiple topics using single kafka stream

which one is recommended to use :
1. Single kafka stream consuming from multiple topics
2. Different kafka streams consuming from different topics (I've used this one already with no issues encountered)
Is it possible to achieve #1 ? and if yes, what're the implications?
and if I use 'EXACTLY_ONCE' settings, what kind of complexities it'll bring?
kafka version : 2.2.0-cp2
Is it possible to achieve #1 (Single kafka stream consuming from multiple topics)
Yes, you can use StreamsBuilder#stream(Collection<String> topics)
If the data that you want to process is spread across multiple topics and that these multiple topics constitute one single source, then you can use this, but not if you want to process those topics in parallel.
It is like one consumer subscribing to all these topics which also means one thread for consuming all the topics. When you call poll() it returns ConsumerRecords from all the subscribed topics and not just one topic.
In Kafka streams, there is a term called Topology, which is basically a acyclic graph of sources, processors and sinks. A topology can contain sub-topologies.
Sub-topologies can then be executed as independent stream tasks through parallel threads (Reference)
Since each topology can have a source, which can be a topic, and if you want parallel processing of these topics, then you have to break-up your graph to sub-topologies.
If I use 'EXACTLY_ONCE' settings, what kind of complexities it'll bring?
When messages reach sink processor in a topology, then its source must be committed, where a source can be a single topic or collection of topics.
Multiple topics or one topic, we need to send offsets to the transaction from the producer, which is basically Map<TopicPartition, OffsetMetadata> that should be committed when the messages are produced.
So, I think it should not introduce any complexities whether it is single topic having 10 partitions or 10 topics with 1 partition each, because offset is at the TopicPartition level and not at topic level.

What is the correlation in kafka stream/table, globalktable, borkers and partition?

I am studying kafka streams, table, globalktable etc. Now I am confusing about that.
What exactly is GlobalKTable?
But overall if I have a topic with N-partitions, and one kafka stream, after I send some data on the topic how much stream (partition?) will I have?
I made some tries and I notice that the match is 1:1. But what if I make topic replicated over different brokers?
Thank you all
I'll try to answer your questions as you have them listed here.
A GlobalKTable has all partitions available in each instance of your Kafka Streams application. But a KTable is partitioned over all of the instances of your application. In other words, all instances of your Kafka Streams application have access to all records in the GlobalKTable; hence it used for more static data and is used more for lookup records in joins.
As for a topic with N-partitions, if you have one Kafka Streams application, it will consume and work with all records from the input topic. If you were to spin up another instance of your streams application, then each application would process half of the number of partitions, giving you higher throughput due to the parallelization of the work.
For example, if you have input topic A with four partitions and one Kafka Streams application, then the single application processes all records. But if you were to launch two instances of the same Kafka Streams application, then each instance will process records from 2 partitions, the workload is split across all running instances with the same application-id.
Topics are replicated across different brokers by default in Kafka, with 3 being the default level of replication. A replication level of 3 means the records for a given partition are stored on the lead broker for that partition and two other follower brokers (assuming a three-node broker cluster).
Hope this clears things up some.
-Bill

How does Kafka message processing scale in publish-subscribe mode?

All, Forgive me I am a newbie just beginner of Kafka. Currently I was reading the document of Kafka about the difference between traditional message system like Active MQ and Kafka.
As the document put.
For the traditional message system. they can not scale the message processing.
Since
Publish-subscribe allows you broadcast data to multiple processes, but
has no way of scaling processing since every message goes to every
subscriber.
I think this make sense to me.
But for the Kafka. Document says the Kafka can scale the message processing even in the publish-subscribe mode. (Please correct me if I was wrong. Thanks.)
The consumer group concept in Kafka generalizes these two concepts. As
with a queue the consumer group allows you to divide up processing
over a collection of processes (the members of the consumer group). As
with publish-subscribe, Kafka allows you to broadcast messages to
multiple consumer groups.
The advantage of Kafka's model is that every topic has both these
properties—it can scale processing and is also multi-subscriber—there
is no need to choose one or the other.
So my question is How Kafka make it ? I mean scaling the processing in the publish-subscribe mode. Thanks.
The main unique features in Kafka that enables scalable pub/sub are:
Partitioning individual topics and spreading the active partitions across multiple brokers in the cluster to take advantage of more machines, disks, and cache memory. Producers and consumers often connect to many or all nodes in the cluster, not just a single master node for a given topic/queue.
Storing all messages in a sequential commit log and not deleting them when consumed. This leads to more sequential reads and writes, offloads the broker from having to deal with keeping track of different copies of messages, deleting individual messages, handling fragmentation, tracking which consumer has acknowledged consuming which messages.
Enabling smart parallel processing of individual consumers and consumer groups in a way that each parallel message stream can come from the distributed partitions mentioned in #1 while offloading the offset management and partition assignment logic onto the clients themselves. Kafka scales with more consumers because the consumers do some of the work (unlike most other pub/sub brokers where the bulk of the work is done in the broker)

Kafka consumer reading messages parallel

Can we have multiple consumers to consume from a topic to achieve parallel processing in kafka.
My use case is to read messages from a single partition in parallel.
Yes you can process messages in parallel using many Kafka consumers, but no, it's not possible if you only have one partition.
Parallelism in Kafka consuming is defined by the number of partitions, you can easily re-partition your topic at any time to create more partitions.
An example of how process messages in parallel using rapids-kafka-client below, a library to make Kafka parallel consuming easier.
public static void main(String[] args){
ConsumerConfig.<String, String>builder()
.prop(KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName())
.prop(VALUE_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName())
.prop(GROUP_ID_CONFIG, "stocks")
.topics("stock_changed")
.consumers(7)
.callback((ctx, record) -> {
System.out.printf("status=consumed, value=%s%n", record.value());
})
.build()
.consume()
.waitFor();
}
Simply saying we can't achieve partition level parallelism for Consumers by default.
But you can try Akka Streams Kafka (Reactive kafka). Once go through thsese docs.
The number of partitions define the level of parallelism to read from a kafka topic. But reading is (more or less) only restricted by your network capacities.
A good pattern is to separate reading and processing of messages (one thread per topic-partition for reading and multiple threads for processing this messages).
You need multiple partitions to do this, or something like Parallel Consumer (PC) to sub divide the single partition.
However, it's recommended to have at least 3 partitions and have at least three consumers running in a group, to utilise high availability. You can again use PC to process all these partitions, sub divided by key, in parallel.
PC directly solves for this, by sub partitioning the input partitions by key and processing each key in parallel.
It also tracks per record acknowledgement. Check out Parallel Consumer on GitHub (it's open source BTW, and I'm the author).