Kafka Streams reduceByKey vs. leftJoin

Kafka Streams reduceByKey vs. leftJoin - apache-kafka

At first glance it seems to me that with a KStream#reduceByKey one can achieve the same functionality as with a KStream to KTable leftJoin. I.e combining records with the same key. What i the difference between the two, also in terms of performance?

Short answer: (What is the difference between the two?)
reduceByKey is applied to a single input stream while leftJoin combines two streams/tables.
Long answer:
If I understand your question correctly, it seems that your incoming KTable changelog stream would be empty, and you want to compute a new join result (ie, update result KTable) for each incoming KStream record? The result KTable of a join is not available as materialized view, but only the changelog topic will be sent downstream. Thus, your input KTable would always be empty and your input KStream record, would always join with "nothing" (because of left join), which would not be really be update the result KTable. You could also do a KStream#map() -- there is no state you can exploit if your input KTable does not provide a state.
In contrast, if you use reduceByKey, the result KTable is available as materialized view, and thus for each KStream input record, the previous result value is available to get updated.
Thus, both operations are fundamentally different. If you have a single input KStream using a join (that required two inputs) would be quite odd, as there is no KTable...

KStream represents a record stream in which each record is self contained. For example, if we are to summarize word occurrences, it would hold the count during a certain frame (e.g. time window or paragraph).
KTable represents a sort of a state and, each record coming in, would normally hold the total occurrences count.
Therefore, the use case to which each method is used is quite different. While KStream#reduceByKey would reduce all records in the same key and summarize the counts for each key, KTable#leftJoin would normally be used in cases when the total count needs to be adjusted according to another information coming in, or combining more data to the record.
The example given in Kafka Stream's documentation is for log compaction. While with KStream, no record could be discarded, in KTable, records that are no longer relevant would be removed.

Related

Is it possible to specify Kafka Stream topology starting sequence

Let say I have Topology A that streams from Source A to Stream A, and I have Topology B which stream from Source B to stream B (used as Table B).
Then I have a stream/table join that joins Stream A and Table B.
As expected the join only triggers when something arrives in Stream A and theres a correlating record in Table B.
I have an architecture, where the source topics are still populated while the Kafka Stream is DEAD. And messages are always arrives in source B before source A.
I am finding that when I restart Kafka Stream (by redeploy the app), the topology that streams stuff to stream A, can happen BEFORE the topology that streams stuff to Table B.
And as a result, the join won't trigger.
I know this is probably the expected behaviour, there's no coordination between separate topologies.
I was wondering if there is a mechanism, a delay or something that can ORDER/Sequence the start of the topologies?
Once they are up, they are fine, as I can ensure the message arrives in the right order.

I think you want to try setting the max.task.idle.ms to something greater than the default (0), maybe 30 secs? It's tough to give a precise answer, so you'll have to experiment some.
HTH,
Bill

If you need to trigger a downstream result from both sides of the join, you have to do a KTable-to-KTable join. From the javadoc:
"The join is computed by (1) updating the internal state of one KTable and (2) performing a lookup for a matching record in the current (i.e., processing time) internal state of the other KTable. This happens in a symmetric way, i.e., for each update of either this or the other input KTable the result gets updated."
EDIT: Even if you do stream-to-KTable join that triggers only when a new stream event is emitted on the left side of the join (KTable updates do not emit downstream event), when you start the topology Streams will try to do timestamp re-synchronisation using the timestamps of the input events, and there should not be any race condition between the rate of consumption of the KTable source and the stream topic. BUT, my understanding is that this is on a best effort cases. E.g. if two events have exactly the same timestamp then Streams cannot deduce which should be processed first.

KStream - KTable Join - Patterns to Update Already Enriched Records when Reference Data on KTable Side Gets an Update

Question on the KStream - KTable joins. Usually this kind of join is used for data enrichment purposes where the KTable provides reference data.
So the question is, when the KTable record gets an update, how do we go about updating the older records that we already processed, enriched and probably stored in some data store?
Are there any patterns that we can follow?
(Please assume KTable - KTable wouldn’t be an option as KStream side would emit large volume of changes)

I tend to think of such joins as enriching the stream of data. In that view, records which have come through the join before an update to the KTable are "correct" at the time.
I can see two options to consider:
First, as a Kafka Streams option, would a KStream-KStream join work?
It sounds like that's the processing semantics that you'd like.
(Incidentally, I really like the docs for showing clear examples of when records are and are not emitted: https://kafka.apache.org/31/documentation/streams/developer-guide/dsl-api.html#kstream-kstream-join)
Second, since it sounds like you may be persisting the streaming data, in this case, it may make sense to do query time enrichment. Creating a view/join over the two tables in the data store may provide a sane alternative to reprocessing data in the database.

Applying KTable enrichment on data prior to filling row

I've got sales messages with timestamps and several messages belonging to the same sale share the same ID. But only one contains a field that I want to store in a KTable to enrich follwing messages with the corresponding ID.
I cannot be sure that the message with the necessary field will always be sent first.
Is it possible to do a Join including also the messages prior to populating the KTable (let's say timestamps - 5min)?
(What if your data comes in batches with breaks of x min?)
Thank you!

Not 100% sure if I understand the use case, but it seems you only want to store a message if it contains the corresponding field, but you want to drop the message otherwise. For this case, you could read the data as a KStream and apply a filter before you put the records into a table:
KStream input = builder.stream("table-topic");
KTable table = input.filter(/*contains field or is tombstone*/)
.toTable();
Note, that you might want to ensure that tombstone messages, ie, messages with value==null are not filtered out, to preserve delete semantics (seems to be use-case dependent).

In KStreams How can I dynamically control when Ktable/Ktable joins yield results?

I have a Ktable to KTable join. I create the Ktables using .aggregate() Those yield results to the next stream processor when either side receives a new message. I have a use case where I can receive another message on the left KTable, but the message is a "duplicate". It's not an actual duplicate in the technical sense but it's a duplicate per my business logic (it contains X,Y and Z fields that have identical values to the previous message).
How can I check the previous aggregate value, compare it to the new value and stop that message from causing the join to yield results?
I also don't want to delete that key from the Ktable because I still need the right side Ktable to continue to join when new 'right side' messages come in.
I want to dynamically control when the join yields results. Is there something in the joiner I can do to check the previous state?

Processing records in order in Storm

I'm new to Storm and I'm having problems to figure out how to process records in order.
I have a dataset which contains records with the following fields:
user_id, location_id, time_of_checking
Now, I would like to identify users which have fulfilled the path I specified (for example, users that went from location A to location B to location C).
I'm using Kafka producer and reading this records from a file to simulate live data. Data is sorted by date.
So, to check if my pattern is fulfilled I need to process records in order. The thing is, due to parallelization (bolt replication) I don't get check-ins of user in order. Because of that patterns won't work.
How to overcome this problem? How to process records in order?

There is no general system support for ordered processing in Storm. Either you use a different system that supports ordered steam processing like Apache Flink (Disclaimer, I am a committer at Flink) or you need to take care of it in your bolt code by yourself.
The only support Storm delivers is using Trident. You can put tuples of a certain time period (for example one minute) into a single batch. Thus, you can process all tuples within a minute at once. However, this only works if your use case allows for it because you cannot related tuples from different batches to each other. In your case, this would only be the case, if you know that there are points in time, in which all users have reached their destination (and no other use started a new interaction); ie, you need points in time in which no overlap of any two users occurs. (It seems to me, that your use-case cannot fulfill this requirement).
For non-system, ie, customized user-code based solution, there would be two approaches:
You could for example buffer up tuples and sort on time stamp within a bolt before processing. To make this work properly, you need to inject punctuations/watermarks that ensure that no tuple with larger timestamp than the punctuation comes after a punctuation. If you received a punctuation from each parallel input substream you can safely trigger sorting and processing.
Another way would be to buffer tuples per incoming substream in district buffers (within a substream order is preserved) and merge the tuples from the buffers in order. This has the advantage that sorting is avoided. However, you need to ensure that each operator emits tuples ordered. Furthermore, to avoid blocking (ie, if no input is available for a substream) punctuations might be needed, too. (I implemented this approach. Feel free to use the code or adapt it to your needs: https://github.com/mjsax/aeolus/blob/master/queries/utils/src/main/java/de/hub/cs/dbis/aeolus/utils/TimestampMerger.java)

Storm supports this use case. For this you just have to ensure that order is maintained throughout your flow in all the involved components. So as first step, in Kafka producer, all the messages for a particular user id should go to the same partition in Kafka. For this you can implement a custom Partitioner in your KafkaProducer. Please refer to the link here for implementation details.
Since a partition in Kafka can be read by one and only one kafkaSpout instance in Storm, the messages in that partition come in order in the spout instance. Thereby ensuring that all the messages of the same user id arrive to the same spout.
Now comes the tricky part - to maintain order in bolt, you want to ensure that you use field grouping on bolt based on "user_id" field emitted from the Kafka spout. A provided kafkaSpout does not break the message to emit field, you would have to override the kafkaSpout to read the message and emit a "user_id" field from the spout. One way of doing so is to have an intermediate bolt which reads the message from the Kafkaspout and emits a stream with "user_id" field.
When finally you specify a bolt with field grouping on "user_id", all messages of a particular user_id value would go to the same instance of the bolt, whatever be the degree of parallelism of the bolt.
A sample topology which work for your case could be as follow -
builder.setSpout("KafkaSpout", Kafkaspout);
builder.setBolt("FieldsEmitterBolt", FieldsEmitterBolt).shuffleGrouping("KafkaSpout");
builder.setBolt("CalculatorBolt", CalculatorBolt).fieldsGrouping("FieldsEmitterBolt", new Fields("user_id")); //user_id field emitted by Bolt2
--Beware, there could be case when all the user_id values come to the same CalculatorBolt instance if you have limited number of user_ids. This in turn would decrease the effective 'parallelism'!