How Kafka streams handle distributed data - apache-kafka

I have tried to go through various tutorials but am not clear on two aspects of Kafka streams.
Lets take the word count example mentioned in:
https://docs.confluent.io/current/streams/quickstart.html
// Serializers/deserializers (serde) for String and Long types
final Serde<String> stringSerde = Serdes.String();
final Serde<Long> longSerde = Serdes.Long();
// Construct a `KStream` from the input topic "streams-plaintext-input", where message values
// represent lines of text (for the sake of this example, we ignore whatever may be stored
// in the message keys).
KStream<String, String> textLines = builder.stream("streams-plaintext-input", Consumed.with(stringSerde, stringSerde));
KTable<String, Long> wordCounts = textLines
// Split each text line, by whitespace, into words. The text lines are the message
// values, i.e. we can ignore whatever data is in the message keys and thus invoke
// `flatMapValues` instead of the more generic `flatMap`.
.flatMapValues(value -> Arrays.asList(value.toLowerCase().split("\\W+")))
// We use `groupBy` to ensure the words are available as message keys
.groupBy((key, value) -> value)
// Count the occurrences of each word (message key).
.count();
// Convert the `KTable<String, Long>` into a `KStream<String, Long>` and write to the output topic.
wordCounts.toStream().to("streams-wordcount-output",
Produced.with(stringSerde, longSerde));
Couple of questions here:
1.) Since there are no keys in the original stream, two words can land up at two different nodes as they might fall in different partition and hence true count would be the aggregation from both of them. It does not seem to be done here ? Do different nodes serving same topic's partition coordinate here to aggregate the count ?
2.) As the new stream is generated by each operation (e.g. flatMapValues, groupBy etc) are the partitions recalculated for messages in these substreams so that they land up on different nodes ?
Will appreciate any help here!

1.) Since there are no keys in the original stream, two words can land up at two different nodes as they might fall in different partition and hence true count would be the aggregation from both of them. It does not seem to be done here ?
It is done here. This is the relevant code:
// We use `groupBy` to ensure the words are available as message keys
.groupBy((key, value) -> value)
Here, "words" become the new message key, which means that words are re-partitioned so that each word is put into one partition only.
Do different nodes serving same topic's partition coordinate here to aggregate the count ?
No, they don't. A partition is processed by one node only (more precisely: by one stream task only, see below).
2.) As the new stream is generated by each operation (e.g. flatMapValues, groupBy etc) are the partitions recalculated for messages in these substreams so that they land up on different nodes ?
Not sure I understand your question, notably the "recalcuated" comment. Operations (like aggregations) are always performed per partition, and Kafka Streams maps partitions to stream tasks (slightly simplified: a partition is always processed by one and only one stream task). Stream tasks are executed by the various instances of your Kafka Streams application, which typically run on different containers/VMs/machines. If need be, data will need to be re-partitioned (see the question #1 and answer above) for an operation to produce the expected result -- perhaps that's what you mean when you say "recalculated".
I'd suggest to read Kafka's documentation, such as https://kafka.apache.org/documentation/streams/architecture#streams_architecture_tasks.

Related

How to understand Kafka streams?

I am following Kafka streams documentation and I am getting confused in some concepts which I want to clarify here.
https://kafka.apache.org/23/documentation/streams/developer-guide/dsl-api.html
On reading flatMap mentioned in documentation, that it takes one record and produces zero, one, or more records. You can modify the record keys and values too. it also marks the data for re-partitioning.
Questions:
1) What does it mean by re-partitioning, will it re-partition data for a new topic, where I am going to write transformed results or will it re-partition data in the same topic, where from I started streaming?
2) If in case old topic data is getting re-partitioned, does that mean transformed results are getting written to that topic too?
For example:
KStream<Long, String> stream = ...;
KStream<String, Integer> transformed = stream.flatMap(
// Here, we generate two output records for each input record.
// We also change the key and value types.
// Example: (345L, "Hello") -> ("HELLO", 1000), ("hello", 9000)
(key, value) -> {
List<KeyValue<String, Integer>> result = new LinkedList<>();
result.add(KeyValue.pair(value.toUpperCase(), 1000));
result.add(KeyValue.pair(value.toLowerCase(), 9000));
return result;
}
);
In this example, it is taking one record and generating two records, does this mean that my topic from which I have started streaming, will have 3 records now, one with key 345L and two with HELLO. If I put transformed result to a new topic or a store, what would be state of old and new topic then. Would both the tables will contain all 3 records. I am novice.
This is a transformed result. So, when you read from a topic, you don't change the source topic. However, when you write to another topic, your output sink topic will have 2 values.
When it says it marks for repartitioning, it will mark the result for repartitioning and when you write to sink topic, it will have to repartition. It doesn't repartition the source topic. Think about why?
If you're continuously reading from source topic, will it continuously repartition the source topic? So, that's not practical option.
I hope this clarifies your question.
Re-partitioning in Kafka Steams means that the records are send to an intermediate topic before a processor and then the processor reads the records from the intermediate topic. By sending the records to an intermediate topic the records are re-partitioned.
This is needed, for example with join processors. A join processor in Kafka Streams requires that all keys of one partition are processed by the same task to ensure correctness. This would not be true, if an upstream processor modified the keys of the records as in your example the flatMap(). Besides joins also aggregations require that all keys of one partition are processed by the same task. Re-partitioning does not write anything to the input or output topic of your streams application and you should usually not need to care about intermediate topics.
However, what you can do is avoid re-partitionings where possible by using *Values() operators like flatMapValues() if you do not change the key of the records. For example, if you use flatMap() and you do not change the keys of the record, the records will be nevertheless re-partitioned although it would not be needed. Kafka Streams cannot know that you did not touch the key if you do not use flatMapValues().

Kafka Streams: pipe one topic into another

I'm new to Kafka Streams and I'm using it to make an exact copy of a topic into another with a different name. This topic has several partitions and my producers are using custom partitioners. The output topic is created beforehand with the same number of partitions of the input topic.
In my app, I did (I'm using Kotlin):
val builder = StreamsBuilder()
builder
.stream<Any, Any>(inputTopic)
.to(outputTopic)
This works, except for the partitions (because of course I'm using a custom partitioner). Is there a simple way to copy input records to the output topic using the same partition of the input record?
I checked the Processor API that allows to access the partition of the input record through a ProcessorContext but I was unable to manually set the partition of the output record.
Apparently, I could use a custom partitioner in the sink, but that would imply deserializing and serializing the records to recalculate the output partition with my custom partitioner.
Produced (that is one of the KStream::to arguments) has StreamPartitioner as one of its member.
You could try following code:
builder.stream("input", Consumed.with(Serdes.ByteArray(), Serdes.ByteArray()))
.to("output", Produced.with(Serdes.ByteArray(), Serdes.ByteArray(), (topicName, key, value, numberOfPartitions) -> calculatePartition(topicName, key, value, numberOfPartitions));
In above code only ByteArray Serdes are used so any special serialization or deserialization happens.
Firstly, messages are distributed among partitions based on Key. A message with similar key would always go in the same partition.
So if your messages have keys then you don't need to worry about it at all. As long as you have similar number of partitions as your original topic; it would be taken care of.
Secondly, if you are copying data to another topic as it is then you should consider using the original topic instead. Kafka has notion of consumer-groups.
For example, you have a topic 'transactions' then you can have consumer-groups i.e. 'credit card processor', 'mortgage payment processor', 'apple pay processor' and so on. Consumer-groups would read the same topic and filter out events that are meaningful to them and process them.
You can also create 3 topics and achieve the same result. Though, it's not an optimal solution. You can find more information at https://kafka.apache.org/documentation/.

How to merge multiple kafka streams in order to do a session windowing over all events of the resulting stream

We have multiple input topics with different business events (page views, clicks, scroll events etc). As far as I understood Kafka streams they all get an event timestamp, which can be used for KStream joins with other streams or tables to align the times.
What we want to do is: Merge all different events (originating from the above mentioned different topics) for a user id (i.e. group by user id) and apply a session window to them.
This should by possible by using groupByKey and then aggregate/reduce (specifying the Inactivity time here) on a stream containing all events. This combined stream must have all events from the different input topics in an order of the event time (or in a way that the above kafka streams methods honor this event times).
The only challenge that is left, is to create this combined / merged stream.
When I look at the Kafka Streams API, there is the KStreamBuilder#merge operation for which the javadoc says: There is no ordering guarantee for records from different {#link KStream}s.. Does this mean the Session Windowing will produce incorrect results?
If yes, what is the alternative to #merge?
I was also thinking about joining, but in fact it seems to depend if you have one event per topic per ID, or potentially multiple events with the same ID within one input topic. For the first case, joining is a good strategy but not for the later, as you would get some unnecessary duplication.
stream A: <a,1> <a,2>
stream B: <a,3>
join-output plus session: <a,1-3 + 2-3>
Number 3 would be a duplicate.
Also keep in mind, that joining slightly modifies the time stamps and thus your session windows might be different if you apply them on the join result or on the raw data.
About merge() and ordering. You can use merge() safely as the session windows will be build based on record timestamp and not offset-order. And all window operations in Kafka Streams can handle out-of-order data gracefully.
What we want to do is: Merge all different events (originating from the above mentioned different topics) for a user id (i.e. group by user id) and apply a session window to them.
From what I understand, you'd need to join the streams (and use groupBy to ensure that they can be properly joined by user id), not merge them. You can then follow-up with an session-windowed aggregation.

Kafka Streams: one record to multiple records

Given: I have two topics in Kafka let's say topic A and topic B. The Kafka Stream reads a record from topic A, processes it and produces multiple records (let's say recordA and recordB) corresponding to the consumed record. Now, the question is how can I achieve this using Kafka Streams.
KStream<String, List<Message>> producerStreams[] = recordStream.mapValues(new ValueMapper<Message, List<Message>>() {
#Override
public List<Message> apply(final Message message) {
return consumerRecordHandler.process(message);
}
}).*someFunction*()
Here, the record read is Message; After processing it returns a list of Message. How can I divide this list to two producer streams? Any help will be appreciated.
I am not sure if I understand the question correctly, and I also don't understand the answer from #Abhishek :(
If you have an input stream, and you want to get zero, one, or more output records per input records, you would apply a flatMap() or flatMapValues() (depending if you want to modify the key or not).
You are also asking about "How can I divide this list to two producer streams?" If you mean to split one stream into multiple, you can use branch().
For more details, I refer to the docs:
https://docs.confluent.io/platform/current/streams/developer-guide/dsl-api.html#stateless-transformations
What's your key (type) ? I am guessing its not String. After executing the mapValues you'll have this - KStream<K,List<Message>>. If K is not String then someFunction() can be a map which will convert K into String (if its is, you already have the result) and leave the List<Message> (the value) untouched since that's your intended end result

Kafka Streams reduceByKey vs. leftJoin

At first glance it seems to me that with a KStream#reduceByKey one can achieve the same functionality as with a KStream to KTable leftJoin. I.e combining records with the same key. What i the difference between the two, also in terms of performance?
Short answer: (What is the difference between the two?)
reduceByKey is applied to a single input stream while leftJoin combines two streams/tables.
Long answer:
If I understand your question correctly, it seems that your incoming KTable changelog stream would be empty, and you want to compute a new join result (ie, update result KTable) for each incoming KStream record? The result KTable of a join is not available as materialized view, but only the changelog topic will be sent downstream. Thus, your input KTable would always be empty and your input KStream record, would always join with "nothing" (because of left join), which would not be really be update the result KTable. You could also do a KStream#map() -- there is no state you can exploit if your input KTable does not provide a state.
In contrast, if you use reduceByKey, the result KTable is available as materialized view, and thus for each KStream input record, the previous result value is available to get updated.
Thus, both operations are fundamentally different. If you have a single input KStream using a join (that required two inputs) would be quite odd, as there is no KTable...
KStream represents a record stream in which each record is self contained. For example, if we are to summarize word occurrences, it would hold the count during a certain frame (e.g. time window or paragraph).
KTable represents a sort of a state and, each record coming in, would normally hold the total occurrences count.
Therefore, the use case to which each method is used is quite different. While KStream#reduceByKey would reduce all records in the same key and summarize the counts for each key, KTable#leftJoin would normally be used in cases when the total count needs to be adjusted according to another information coming in, or combining more data to the record.
The example given in Kafka Stream's documentation is for log compaction. While with KStream, no record could be discarded, in KTable, records that are no longer relevant would be removed.