Is it possible to specify Kafka Stream topology starting sequence - apache-kafka

Let say I have Topology A that streams from Source A to Stream A, and I have Topology B which stream from Source B to stream B (used as Table B).
Then I have a stream/table join that joins Stream A and Table B.
As expected the join only triggers when something arrives in Stream A and theres a correlating record in Table B.
I have an architecture, where the source topics are still populated while the Kafka Stream is DEAD. And messages are always arrives in source B before source A.
I am finding that when I restart Kafka Stream (by redeploy the app), the topology that streams stuff to stream A, can happen BEFORE the topology that streams stuff to Table B.
And as a result, the join won't trigger.
I know this is probably the expected behaviour, there's no coordination between separate topologies.
I was wondering if there is a mechanism, a delay or something that can ORDER/Sequence the start of the topologies?
Once they are up, they are fine, as I can ensure the message arrives in the right order.

I think you want to try setting the max.task.idle.ms to something greater than the default (0), maybe 30 secs? It's tough to give a precise answer, so you'll have to experiment some.
HTH,
Bill

If you need to trigger a downstream result from both sides of the join, you have to do a KTable-to-KTable join. From the javadoc:
"The join is computed by (1) updating the internal state of one KTable and (2) performing a lookup for a matching record in the current (i.e., processing time) internal state of the other KTable. This happens in a symmetric way, i.e., for each update of either this or the other input KTable the result gets updated."
EDIT: Even if you do stream-to-KTable join that triggers only when a new stream event is emitted on the left side of the join (KTable updates do not emit downstream event), when you start the topology Streams will try to do timestamp re-synchronisation using the timestamps of the input events, and there should not be any race condition between the rate of consumption of the KTable source and the stream topic. BUT, my understanding is that this is on a best effort cases. E.g. if two events have exactly the same timestamp then Streams cannot deduce which should be processed first.

Related

Kafka Streams join stops processing after a few days

We have a Kafka Streams topology involving an inner join between two streams that are derived from the same topic. This works fine on new data as it comes in and when reprocessing smallish samples. The problem is that if we reprocess the entire input topic (21M messages), processing stops in the joining service after a few days, resulting in far fewer output messages than expected.
Some ASCII art to illustrate the topology:
Input -> A --------- C -> output
      \--- B ---/
Given that A produces a message A1, this is consumed by B which produces a message B1. When C gets message B1, it joins it to A1 and processes the two to produce a new message C1.
Some extra info that might be relevant:
A is very fast - it can process 70K input messages a second
B and C are very slow - they each process about 10 messages a second
Corresponding messages on A and B have the same timestamp and the same key, and the topics have the same number of partitions
It’s a KStream-KStream inner join with a time window of 10 seconds
All three are Scala components using org.apache.kafka.kafka-streams-scala 2.6.0, configured with processing-guarantee = "at_least_once" and auto-offset-reset-config = "earliest"
This a three-node MSK cluster (Kafka 2.2.1); the services run on EKS
No errors occur in any of the components
When the problem has occurred, we observe that C’s lag on A is 0 - ie C has consumed all the old messages from A, even though B is still producing new messages based on A messages that should trigger joins. At that point, all the messages from A are present on C’s underlying temporary join topic.
Once B has caught up, new messages that appear on A get processed immediately by B and then C. Since processing resumes when more messages appear, it seems that C hasn’t crashed, it just thinks it has no work to do because it’s prematurely reached the end of A’s topic.
Mitigation we’ve tried:
To avoid retention-related problems, all the topics, including C’s internal KSTREAM-JOIN* topics, are set to compact with infinite retention (retention.bytes and retention.ms set to -1)
We originally joined A to B, but actually, C is only interested if B has produced a message so now we join B to A. The behaviour is the same in both cases.
The simplest solution might be to include in B all the data C needs from A instead of doing a join. But as we use Kafka Streams a great deal, we’d like to understand what’s going on here and how to configure it better. I wondered if a larger time window would help but my understanding is that the window is based on the message timestamp, and the timestamps on the corresponding messages in all the topics are all identical.
Does this sound like a familiar problem anyone can shed a light on?

Kafka streams: join on ingestion time

I have two topics of fairly varied volumetry (could be something like 1000 events emitted in the left topic for every event in the right topic).
I'm trying to leftJoin those two topics together and I'm having the impression that the join window is computed over processing time and not ingestion time, causing the smaller stream to "run out" way too soon.
Is it possible to specify the time semantics of a stream-stream join to ingestion time (or event time)?
I could see why it's not such an easy thing to use ingestion time but it seems to be a necessity when processing historical streams.
Kafka Streams join is based on event-time, i.e., whatever TimestampExtractor returns (by default the message timestamp as stored in the topic) and you cannot modify it (you can only use a different timestamp extractor to indirectly modify the result).
Note though, that the join is executed "eagerly", and thus for all left side records, the lookup into the right stream is done immediately, what can lead to additional <key, (left-value,null)> results.
It also depends on the processing order that is base on event-time in a best effort manner. The guarantees got improve in the 2.3 release and using config parameter max.task.idle.ms might help to mitigate the issue.
It's on the mid-term roadmap to improve left- and outer-join behavior to avoid those additional result records. As Kafka is an open-source projects and you would like to pick it up, please help to get this fixed sooner :)
The only other alternative would be to implement a custom join operator via the Processor API.

kafka streams + how to expire entries in a state store asynchronously

I have a kafka streams topology that reads from an input topic updates some state and determines if the state entry needs to remain in state store or can be deleted. If it can be deleted it will be removed else I've a punctuator that runs every 10s and expires items from the state store.
I recently found out that the punctuators run on the same stream thread and can potentially block processing of the stream.
What are some patterns I can use to execute the logic inside the punctuator in a separate thread pool to avoid blocking stream processing ?
Appreciate your help.
Matthias J. Sax already said, that's not possible with state stores, so far, so as he works at Confluent, I believe thats the latest news.
However, what we did in our case was using a KStream-KTable join instead of a state store. I'm not sure, if that's possible for your case, but let me explain, what we did, maybe it's of some use for you, as well:
We have two Topics A and B, Topic A is consumed with a KStream. Topic B is consumed with a KTable. We transform the KTable data, so we can join it on the KStream for Topic A. We join it, perform our operations and "delete" the data from Topic B by writing a null value with the original key to Topic B, using map and through. So when we get another record in Topic A, there are no longer values in our KTable to join with (exactly what we wanted).
I hope it helps.

Kafka Streams TimestampExtractor

Hi everybody I have a question about TimestampExtractor and Kafka Streams....
In our application there is a possibility of receiving out-of-order events, so I like to order the events depending on a business date inside of the payload instead in point of time they placed in the topic.
For this purpose I programmed a custom TimestampExtractor to be able to pull the timestamp from the payload. Everything until I told here worked perfectly but when I build the KTable to this topic, I discerned that the event that I receive out-of-order (from Business point of view it is not last event but it received at the end) displayed as last state of the object while ConsumerRecord having the timestamp from the payload.
I don't know may be it was my mistake to assume Kafka Stream will fix this out-of-order problem with TimestampExtractor.
Then during debugging I saw that if the TimestampExtractor returns -1 as result Kafka Streams are ignoring the message and TimestampExtractor also delivering the timestamp of the last accepted Event, so I build a logic that realise the following check (payloadTimestamp < previousTimestamp) return -1, which achieves the logic I want but I am not sure I am sailing on dangerous waters or not.
Am I allowed to deal with a logic like this or what other ways exist to deal with out-of-order events in Kafka streams....
Thx for answers..
Currently (Kafka 2.0), KTables don't consider timestamps when they are updated, because the assumption is, that there is no out-of-order data in the input topic. The reason for this assumption is the "single writer principle" -- it's assumed, that for compacted KTable input topic, there is only one producer per key, and thus, there won't be any out-of-order data with regard to single keys.
It's a know issue: https://issues.apache.org/jira/browse/KAFKA-6521
For your fix: it's not 100% correct or safe to do this "hack":
First, assume you have two different messages with two different key <key1, value1, 5>, <key2, value2, 3>. The second record with timestamp 3 is later, compared to the first record with timestamp 5. However, both have different keys and thus, you actually want to put the second record into the KTable. Only if you have two record with the same key, you want to drop late arriving data IHMO.
Second, if you have two records with the same key and the second one if out-of-order and you crash before processing the second one, the TimestampExtractor looses the timestamp of the first record. Thus on restart, it would not discard the out-of-order record.
To get this right, you will need to filter "manually" in your application logic instead of the stateless and key-agnostic TimestampExtractor. Instead of reading the data via builder#table() you can read it as a stream, and apply an .groupByKey().reduce() to build the KTable. In you Reducer logic, you compare the timestamp of the new and old record and return the record with the larger timestamp.

Processing records in order in Storm

I'm new to Storm and I'm having problems to figure out how to process records in order.
I have a dataset which contains records with the following fields:
user_id, location_id, time_of_checking
Now, I would like to identify users which have fulfilled the path I specified (for example, users that went from location A to location B to location C).
I'm using Kafka producer and reading this records from a file to simulate live data. Data is sorted by date.
So, to check if my pattern is fulfilled I need to process records in order. The thing is, due to parallelization (bolt replication) I don't get check-ins of user in order. Because of that patterns won't work.
How to overcome this problem? How to process records in order?
There is no general system support for ordered processing in Storm. Either you use a different system that supports ordered steam processing like Apache Flink (Disclaimer, I am a committer at Flink) or you need to take care of it in your bolt code by yourself.
The only support Storm delivers is using Trident. You can put tuples of a certain time period (for example one minute) into a single batch. Thus, you can process all tuples within a minute at once. However, this only works if your use case allows for it because you cannot related tuples from different batches to each other. In your case, this would only be the case, if you know that there are points in time, in which all users have reached their destination (and no other use started a new interaction); ie, you need points in time in which no overlap of any two users occurs. (It seems to me, that your use-case cannot fulfill this requirement).
For non-system, ie, customized user-code based solution, there would be two approaches:
You could for example buffer up tuples and sort on time stamp within a bolt before processing. To make this work properly, you need to inject punctuations/watermarks that ensure that no tuple with larger timestamp than the punctuation comes after a punctuation. If you received a punctuation from each parallel input substream you can safely trigger sorting and processing.
Another way would be to buffer tuples per incoming substream in district buffers (within a substream order is preserved) and merge the tuples from the buffers in order. This has the advantage that sorting is avoided. However, you need to ensure that each operator emits tuples ordered. Furthermore, to avoid blocking (ie, if no input is available for a substream) punctuations might be needed, too. (I implemented this approach. Feel free to use the code or adapt it to your needs: https://github.com/mjsax/aeolus/blob/master/queries/utils/src/main/java/de/hub/cs/dbis/aeolus/utils/TimestampMerger.java)
Storm supports this use case. For this you just have to ensure that order is maintained throughout your flow in all the involved components. So as first step, in Kafka producer, all the messages for a particular user id should go to the same partition in Kafka. For this you can implement a custom Partitioner in your KafkaProducer. Please refer to the link here for implementation details.
Since a partition in Kafka can be read by one and only one kafkaSpout instance in Storm, the messages in that partition come in order in the spout instance. Thereby ensuring that all the messages of the same user id arrive to the same spout.
Now comes the tricky part - to maintain order in bolt, you want to ensure that you use field grouping on bolt based on "user_id" field emitted from the Kafka spout. A provided kafkaSpout does not break the message to emit field, you would have to override the kafkaSpout to read the message and emit a "user_id" field from the spout. One way of doing so is to have an intermediate bolt which reads the message from the Kafkaspout and emits a stream with "user_id" field.
When finally you specify a bolt with field grouping on "user_id", all messages of a particular user_id value would go to the same instance of the bolt, whatever be the degree of parallelism of the bolt.
A sample topology which work for your case could be as follow -
builder.setSpout("KafkaSpout", Kafkaspout);
builder.setBolt("FieldsEmitterBolt", FieldsEmitterBolt).shuffleGrouping("KafkaSpout");
builder.setBolt("CalculatorBolt", CalculatorBolt).fieldsGrouping("FieldsEmitterBolt", new Fields("user_id")); //user_id field emitted by Bolt2
--Beware, there could be case when all the user_id values come to the same CalculatorBolt instance if you have limited number of user_ids. This in turn would decrease the effective 'parallelism'!