Kafka Streams processors - state store and input topic partitioning

Kafka Streams processors - state store and input topic partitioning - apache-kafka

I would like to fully understand the rules that kafka-streams processors must obey with respect to partitioning of a processor's input and its state(s). Specifically I would like to understand:
Whether or not it is possible and what are the potential consequences of using a key for the state store(s) that is not the same as the key of the input topic
Whether or not state store keys are shared across partitions, i.e. whether or not I will get the same value if I try to access the same key in a processor while it is processing records belonging to two different partitions
I have been doing some research on this and the answers I found seem not to be very clear and sometimes contradictory: e.g. this one seems to suggest that the stores are totally independent and you can use any key while this one says that you should never use a store with a different key than the one in the input topic.
Thanks for any clarification.

You have to distinguish between input partitions and store shards/changelog topic partitions for a complete picture. Also, it depends if you use the DSL or the Processor API, because the DSL does some auto-repartitioning but the Processor API doesn't. Because the DSL compiles down to the Processor API, I'll start with this.
If you have a topic with let's say 4 partitions and you create a stateful processor that consumes this topic, you will get 4 tasks, each task running a processor instance that maintains one shard of the store. Note, that the overall state is split into 4 shards and each shard is basically isolated from the other shards.
From an Processor API runtime point of view, the input topic partitions and the state store shards (including their corresponding changelog topic partitions) are a unit of parallelism. Hence, the changelog topic for the store is create with 4 partitions, and changelog-topic-partition-X is mapped to input-topic-partition-X. Note, that Kafka Streams does not use hash-based partitioning when writing into a changelog topic, but provides the partition number explicitly, to ensure that "processor instance X", that processes input-topic-partition-X, only reads/write from/into changelog-topic-partition-X.
Thus, the runtime is agnostic to keys if you wish.
If your input topic is not partitioned by keys, messages with the same key will be processed by different task. Depending on the program, this might be ok (eg. filtering), or not (eg, count per key).
Similar to state: you can put any key into a state store, but this key is "local" to the corresponding shard. Other tasks, will never see this key. Thus, if you use the same key in a store on different tasks, they will be completely independent from each other (as if they would be two keys).
Using Processor API, it's your responsibility to partition input data correctly and to use stores correctly, depending on the operator semantics you need.
At DSL level, Kafka Streams will make sure that data is partitioned correctly to ensure correct operator semantics. First, it's assumed that input topics are partitioned by key. If the key is modified, for example via selectKey() and a downstream operator is an aggregation, Kafka Streams is repartitioning the data first, to insure that records with the same key are in the same topic partition. This ensures, that each key will be used in a single store shard. Thus, the DSL will always partition the data such that one key is never processed on different shards.

Related

In Kafka Streams, how do you parallelize complex operations (or sub-topologies) using multiple topics and partitions?

I am currently trying to understand how Kafka Streams achieves parallelism. My main concern boils down to three questions:
Can multiple sub-topologies read from the same partition?
How can you parallelise a complex operation (making up a sub-topology) that uses the processor API and requires reading the entire topic?
Can multiple sub-topologies read from the same topic (such that independent and expensive operations on the same topic can be run in different sub-topologies)?
As the developer, we don't have direct control about how topologies are divided into sub-topologies. Kafka Streams divides the Topology into multiple sub-topologies using Topics as a "bridge" where possible. Additionally, multiple stream tasks are created that each read a subset of data from the input topic, divided by partition. The documentation reads:
Slightly simplified, the maximum parallelism at which your application may run is bounded by the maximum number of stream tasks, which itself is determined by maximum number of partitions of the input topic(s) the application is reading from.
Assume there was a sub-topology that reads multiple input topics whose number of partitions are not identical. If the above excerpt of the documentation is to be believed, then one or more partitions of the topic that has less partitions would need to be assigned to multiple stream tasks (if both topics need to be read for the logic to work). However, this should not be possible, because, as I understand it, multiple instances of the streams application (each sharing the same application id) act as one Consumer group, where each partition is only assigned once. In such a case, the number of tasks being created for a sub-topology should actually be limited by the minimum number of partitions of its input topics, i.e. a single partition is only assigned to one Task.
I am not sure if the initial problem, i.e. a non-co-partitioned sub-topology would actually occur. If there is an operation that requires to read both input topics, the data would probably need to be co-partitioned (like in Joins).
Say there was an expensive operation between two topics (possibly built from multiple custom processors) that requires the data of one topic to always be available in its entirety. You would want to parallelise this operation into multiple tasks.
If the topic had just one partition, and a partition could be read multiple times, this would not be a problem. However, as discussed earlier, I don't believe this to work.
Then there are GlobalKTables. However, there is no way to use GlobalKTables with custom processors (toStream is not available).
Another idea would be to broadcast the data into multiple partitions, essentially duplicating it by the partition count. This way, multiple stream tasks could be created for the topology to read the same data. To do this, a custom partitioner could be specified in the Produced-Instance given to KStream#to. If this data duplication can be accepted, this seems to be the only way to achieve what I have in mind.
Regarding question number three, because the Streams application is one Consumer group, I would also expect this to not be possible. With my current understanding, this would require to write the data into multiple identical topics (again essentially duplicating the data), such that independent sub-topologies can be created. An alternative would be to run separate streaming applications (such that a different consumer group is used).

Without seeing your topology definition, this is a somewhat vague question. You can have repartition and changelog topics. These are duplicated data from the original input topic.
But stateless operators like map, filter, etc. pass data through from the same (assigned) partitions for each thread.
A "sub topology" is still part of only one application.id, thus one consumer group, so no, it cannot read the same topic partitions more than once. For that, you'd need independent streams/tables via branching operations within the whole topology, for example, filtering numbers by even and odd only consumes the topic once; you don't need to "broadcast" records to all partitions, and I'm not sure that's even possible out of the box (to sends one-to-one, and Produced defines serialization, not multiple partitions). If you need to cross reference different operators, then you can use join / statestores / KTables.
None of this is really related to parallelism. You have num.stream.threads, or you can run multiple instances of the same JVM process to scale.

Ordering guarantees in Kafka in case of multiple threads

As far as I understand both Kafka Producer and Consumer have to use a single thread per topic-partition if we want to write / read records in an order. Am I right or maybe they use multiple threads in such situations?

So the ordering can be achieved in Kafka in both single threaded as well as multithreaded env
single broker/single partition -> Single thread based consumer model
The order of message in Kafka works well for a single partition. But with a single partition, parallelism and load balancing is difficult to achieve. Please note that in this case only one thread will be used to access topic partition thus the ordering is always guaranteed.
multiple brokers/multiple partitions -> Multithread based consumers model(having consumer groups holding more than 1 consumers)
In this case, we assume that there are multiple partitions present in topic and each partition is being handled by a single consumer(precisely a single thread) in each consumer group which is fairly called multithreading.
There are three methods in which we can retain the order of messages within partitions in Kafka. Each method has its own pros and cons.
Method 1: Round Robin or Spraying
Method 2 : Hashing Key Partition
Method 3 : Custom Partitioner
Round Robin or Spraying (Default)
In this method, the partitioned will send messages to all the partitions in a round-robin fashion, ensuring a balanced server load. Over loading of any partition will not happen. By this method parallelism and load balancing is achieved but it fails to maintain the overall order but the order within the partition will be maintained. This is a default method and it is not suitable for some business scenarios.
In order to overcome the above scenarios and to maintain message ordering, let’s try another approach.
Hashing Key Partition
In this method we can create a ProducerRecord, specifying a message key with each message being passed to the topic to ensure that partition ordering will happen.
The default partitioned will use the hash of the key to ensure that all messages for the same key go to same partition. This is the easiest and most common approach. This is the same method which has been used for hive bucketing as well. It uses modulo operation for hashing.
Hash(Key) % Number of partitions -> Partition number
We can say that the key here will help to define the partition where the producer wants to send the message always to maintain the order. But, the drawback with this method is as it uses random hashing value to pull the data to assigned partition, and it follows overloading of data to single partition. But, the drawback with this method is as it uses random hashing value to pull the data to assigned partition, and it follows overloading of data to single partition.
Custom Partitioner
We can write our own business logic to decide which message need to be send to which partition. With this approach, we can make ordering of messages as per our business logic and achieve parallelism at the same time.
For understanding more details you can check below
https://medium.com/latentview-data-services/how-to-use-apache-kafka-to-guarantee-message-ordering-ac2d00da6c22
Also Please note that this information represents the Partition level parallelism
There has been a new parallelism strategy as well called consumer level parallelism. I have not give it a read but you can find details at below confluent link
https://www.confluent.io/blog/introducing-confluent-parallel-message-processing-client/

Kafka local state store of multiple partitions

I am using kafka processor api and I create a state store from a topic of 3 partitions(I have 3 brokers), I have 1 instance of stream. I wonder to know when I get local state store, can I get all keys? Why certain keys work but certain don't? Is it normal?
Thank you

The number if application instances does not matter for this case. Because the input topic has 3 partitions, the state store is created with 3 shards. Processing happens with 3 parallel tasks. Each task instantiates a copy of your topology, processes one input topic partition, and uses one shard.
Compare: https://kafka.apache.org/21/documentation/streams/architecture
If you want to access different shards, you can use "Interactive Queries" feature for key/value lookups (and key-range queried) over all shards.
Also, the is the notion of a global state store, that would load data from all partitions into a single store (not sharding). However, it provided different semantics compared to "regular" stores, because store updates are not time-synchronized with the other processing.

SnappyData table definitions using partition keys

Reading through the documentation (http://snappydatainc.github.io/snappydata/streamingWithSQL/) and had a question about this item:
"Reduced shuffling through co-partitioning: With SnappyData, the partitioning key used by the input queue (e.g., for Kafka sources), the stream processor and the underlying store can all be the same. This dramatically reduces the need to shuffle records."
If we are using Kafka and partition our data in a topic using a key (single value). Is it possible to map this single key from kafka to multiple partition keys identified in the snappy table?
Is there a hash of some sort to turn multiple keys into a single key?
The benefit of reduced shuffling seems significant and trying to understand the best practice here.
thanks!

With DirectKafka stream, each partition pulls the data from own designated topic. If no partitioning is specified for the storage table, then each DirectKafka partition will put only to local storage buckets and then everything will line up well without requiring anything extra. The only thing to take care of is enough number of topics (thus partitions) for better concurrency -- ideally at least as many as total number of processor cores in the cluster so all cores are busy.
When partitioning storage tables explicitly, SnappyData's store has been adjusted to use the same hashing as Spark's HashPartitioning (for "PARTITION_BY" option of both column and row tables) since that is the one used at Catalyst SQL execution layer. So execution and storage are always collocated.
However, aligning that with ingestion from DirectKafka partitions will require some manual work (align kafka topic partitioning with HashPartitioning, then having the preferred locations for each DirectKafka partition match the storage). Will be simplified in coming releases.

kafka topics and partitions decisions

I need to understand something about kafka:
When I have a single kafka broker on a single host - is there any sense to have it have more than one partition for the topics? I means even if my data can be distinguished with some key (say tenant id) - what is the benefit of doing it on a single kafka broker? does this give any parallelism , if so how?
When a key is used, is this means that each key is mapped to a given partition? Does the number of partitions for a topic must be equal to the number of possible values for the key I specified? OR is this just a hash and so the number of partitions doesnt have to be equal?
From what I read, topics are created due to types of messages to be places in kafka. But in my case, i have 2 topics I have created since I have 2 types of consumption: one for reading one by one message. the second in case of a bulk of messages comes into the queue (application reasons) and then it is being entered into the second topic. Is that a good design although the messages type is the same? any other practice for such a scansion?

Yes, it definitely makes sense to have more than one partition for a topic even when you have a single Kafka broker. A scenario when you can benefit from this is pretty simple:
you need to guarantee in-order processing by tenant id
processing logic for each message is rather complex and takes some time. Especially the case when the Kafka message itself is simple, but the logic behind processing this message takes time (simple example - message is an URL, and the processing logic is downloading the file from there and doing some processing)
Given these 2 conditions you may get into a situation where one consumer is not able to keep up processing all the messages if all the data goes to a single partition. Remember, you can process one partition with exactly one consumer (well, you can use 2 consumers if using different consumer groups, but that's not your case), so you'll start getting behind over time. But if you have more than one partition you'll either be able to use one consumer and process data in parallel (this could help to speed things up in some cases) or just add more consumers.
By default, Kafka uses hash-based partitioning. This is configurable by providing a custom Partitioner, for example you can use random partitioning if you don't care what partition your message ends up in.
It's totally up to you what purposes you have topics for
UPD, answers to questions in the comment:
Adding more consumers is usually done for adding more computing power, not for achieving desired parallelism. To add parallelism add partitions. Most consumer implementations process different partitions on different threads, so if you have enough computing power, you might just have a single consumer processing multiple partitions in parallel. Then, if you start bumping into situations where one consumer is not enough, you just add more consumers.
When you create a topic you just specify the number of partitions (and replication factor for this topic, but that's a different thing). The key and partition to send is completely up to producer. In fact, you could configure your producer to use random partitioner and it won't even care about keys, just pick the partition randomly. There's no direct relation between key -> partition, it's just convenient to benefit from having things setup like this.
Can you elaborate on this one? Not sure I understand this, but I guess your question is whether you can send just a value and Kafka will infer a key somehow itself. If so, then the answer is no - Kafka does not apply any transformation to messages and stores them as is, so if you want your message to contain a key, the producer must explicitly send the key.