Kafka Streams join stops processing after a few days - scala

We have a Kafka Streams topology involving an inner join between two streams that are derived from the same topic. This works fine on new data as it comes in and when reprocessing smallish samples. The problem is that if we reprocess the entire input topic (21M messages), processing stops in the joining service after a few days, resulting in far fewer output messages than expected.
Some ASCII art to illustrate the topology:
Input -> A --------- C -> output
      \--- B ---/
Given that A produces a message A1, this is consumed by B which produces a message B1. When C gets message B1, it joins it to A1 and processes the two to produce a new message C1.
Some extra info that might be relevant:
A is very fast - it can process 70K input messages a second
B and C are very slow - they each process about 10 messages a second
Corresponding messages on A and B have the same timestamp and the same key, and the topics have the same number of partitions
It’s a KStream-KStream inner join with a time window of 10 seconds
All three are Scala components using org.apache.kafka.kafka-streams-scala 2.6.0, configured with processing-guarantee = "at_least_once" and auto-offset-reset-config = "earliest"
This a three-node MSK cluster (Kafka 2.2.1); the services run on EKS
No errors occur in any of the components
When the problem has occurred, we observe that C’s lag on A is 0 - ie C has consumed all the old messages from A, even though B is still producing new messages based on A messages that should trigger joins. At that point, all the messages from A are present on C’s underlying temporary join topic.
Once B has caught up, new messages that appear on A get processed immediately by B and then C. Since processing resumes when more messages appear, it seems that C hasn’t crashed, it just thinks it has no work to do because it’s prematurely reached the end of A’s topic.
Mitigation we’ve tried:
To avoid retention-related problems, all the topics, including C’s internal KSTREAM-JOIN* topics, are set to compact with infinite retention (retention.bytes and retention.ms set to -1)
We originally joined A to B, but actually, C is only interested if B has produced a message so now we join B to A. The behaviour is the same in both cases.
The simplest solution might be to include in B all the data C needs from A instead of doing a join. But as we use Kafka Streams a great deal, we’d like to understand what’s going on here and how to configure it better. I wondered if a larger time window would help but my understanding is that the window is based on the message timestamp, and the timestamps on the corresponding messages in all the topics are all identical.
Does this sound like a familiar problem anyone can shed a light on?

Related

Is it possible to specify Kafka Stream topology starting sequence

Let say I have Topology A that streams from Source A to Stream A, and I have Topology B which stream from Source B to stream B (used as Table B).
Then I have a stream/table join that joins Stream A and Table B.
As expected the join only triggers when something arrives in Stream A and theres a correlating record in Table B.
I have an architecture, where the source topics are still populated while the Kafka Stream is DEAD. And messages are always arrives in source B before source A.
I am finding that when I restart Kafka Stream (by redeploy the app), the topology that streams stuff to stream A, can happen BEFORE the topology that streams stuff to Table B.
And as a result, the join won't trigger.
I know this is probably the expected behaviour, there's no coordination between separate topologies.
I was wondering if there is a mechanism, a delay or something that can ORDER/Sequence the start of the topologies?
Once they are up, they are fine, as I can ensure the message arrives in the right order.
I think you want to try setting the max.task.idle.ms to something greater than the default (0), maybe 30 secs? It's tough to give a precise answer, so you'll have to experiment some.
HTH,
Bill
If you need to trigger a downstream result from both sides of the join, you have to do a KTable-to-KTable join. From the javadoc:
"The join is computed by (1) updating the internal state of one KTable and (2) performing a lookup for a matching record in the current (i.e., processing time) internal state of the other KTable. This happens in a symmetric way, i.e., for each update of either this or the other input KTable the result gets updated."
EDIT: Even if you do stream-to-KTable join that triggers only when a new stream event is emitted on the left side of the join (KTable updates do not emit downstream event), when you start the topology Streams will try to do timestamp re-synchronisation using the timestamps of the input events, and there should not be any race condition between the rate of consumption of the KTable source and the stream topic. BUT, my understanding is that this is on a best effort cases. E.g. if two events have exactly the same timestamp then Streams cannot deduce which should be processed first.

Distribute messages on single Kafka topic to specific consumer

Avro encoded messages on a single Kafka topic, single partitioned. Each of these messages were to be consumed by a specific consumer only. For ex, message a1, a2, b1 and c1 on this topic, there are 3 consumers named A, B and C. Each consumer would get all the messages but ultimate A would consume a1 and a2, B on b1 and C on c1.
I want to know how typically this is solved when using avro on Kafka:
leave it for the consumers to deserialize the message then some application logic to decide to consume the message or drop the message
use partition logic to make each of the messages to go to a particular partition, then setup each consumer to listen to only a single partition
setup another 3 topics and a tiny kafka-stream application that would do the filtering + routing from main topic to these 3 specific topics
make use of kafka header to inject identifier for downstream consumers to filter
Looks like each of the options have their pros and cons. I want to know if there is a convention that people follow or there is some other ways of solving this.
It depends...
If you only have a single partitioned topic, the only option is to let each consumer read all data and filter client side which data the consumer is interested in. For this case, each consumer would need to use a different group.id to isolate the consumers from each other.
Option 2 is certainly possible, if you can control the input topic you are reading from. You might still have different group.ids for each consumer as it seems that the consumer represent different applications that should be isolated from each other. The question is still if this is a good model, because the idea of partitions is to provide horizontal scale out, and data-parallel processing; however, if each application reads only from one partition it seems not to align with this model. You also need to know which data goes into which partition producer side and consumer side to get the mapping right. Hence, it implies a "coordination" between producer and consumer what seems not desirable.
Option 3 seems to indicate that you cannot control the input topic and thus want to branch the data into multiple topics? This seems to be a good approach in general, as topics are a logical categorization of data. However, it would even be better to have 3 topic for the different data to begin with! If you cannot have 3 input topic from the beginning on, Option 3 seems not to provide a good conceptual setup, however, it won't provide much performance benefits, because the Kafka Streams application required to read and write each record once. The saving you gain is that each application would only consume from one topic and thus redundant data read is avoided here -- if you would have, lets say 100 application (and each is only interested in 1/100 of the data) you would be able to cut down the load significantly from an 99x read overhead to a 1x read and 1x write overhead. For your case you don't really cut down much as you go from 2x read overhead to 1x read + 1x write overhead. Additionally, you need to manage the Kafka Streams application itself.
Option 4 seems to be orthogonal, because is seems to answer the question on how the filtering works, and headers can be use for Option 1 and Option 3 to do the actually filtering/branching.
The data in the topic is just bytes, Avro shouldn't matter.
Since you only have one partition, only one consumer of a group can be actively reading the data.
If you only want to process certain offsets, you must either seek to them manually or skip over messages in your poll loop and commit those offsets

Confused about Kafka exactly-once semantics

So i've been reading about kafka's exactly once semantics, and I'm a bit confused about how it works.
I understand how the producer avoids sending duplicate messages (in case the ack from the broker fails), but what I don't understand is how exactly-once works in the scenario where the consumer processes the message but then crashes before committing the offset. Won't kafka retry in that scenario?
here's what i think you mean:
consumer X sees record Y, and "acts" on it, yet does not commit its offset
consumer X crashes (still without committing its offsets)
consumer X boots back up, is re-assigned the same partition (not guaranteed) and eventually sees record Y again
this is totally possible. however, for kafka exactly once to "work" all of your side effects (state, output) must also go into the same kafka cluster. so here's whats going to happen:
consumer X starts a transaction
consumer X sees record Y, emits some output record Z (as part of the transaction started in 1)
consumer X crashes. shortly after the broker acting as the transaction coordinator "rolls back" (im simplifying) the transaction started in 1, meaning no other kafka consumer will ever see record Z
consumer X boots back up, is assigned the same partition(s) as before, starts a new transaction
consumer X sees record Y again, emits record Z2 (as part of the transaction started in 4)
some time later consumer X commits its offsets (as part of the transaction from 4) and then commits that transaction
record Z2 becomes visible to downstream consumers.
if you have side-effects outside of the same kafka cluster (say instead of record Z you insert a row into mysql) there's no general way to make kafka exactly-once work for you. you'd need to rely on oldschool dedup and idempotance.
Radal explained it well in its answer, regarding exactly once in a isolated Kafka cluster.
When dealing with an external database ( transactional at least) , one easy way to achieve exactly once is to UPDATE one row ( in a sgbd transaction), with your business value AND the Partition / offsets where it comes from. That way , if your consumer crash before committing to Kafka, you'll be able to get back the last Kafka offset it has processed ( by using consumer.seek())
It can though be a quite data overhead in your sgbd ( keeping offset/partition for all your rows), but you might be able to optimize a bit.
Yannick

Kafka consume from 2 topics and take equal number of messages

I've jumped into a specific requirement and would like to hear people's views and certainly not re-invent the wheel.
I've got 2 Kafka topics - A and B.
A and B would be filled with messages at different ingest rate.
For example: A could be filled with 10K messages first and then followed by B. Or in some cases we'd have A and B would be filled with messages at the same time. The ingest process is something we have no control of. It's like a 3rd party upstream system for us.
I need to pick up the messages from these 2 topics and mix them at equal proportion.
For example: If the configured size is 50. Then I should pick up 50 from A and 50 from B (or wait until I have it) and then send it off to another kafka topic as 100 (with equal proportions of A and B).
I was wondering what's the best way to solve this? Although I was looking at the join semantics of KStreams and KTables, I'm not quite convinced that this is a valid use case for join (cause there's no key in the message that joins these 2 streams or tables).
Can this be done without Kafka Streams? Vanilla Kafka consumer (perhaps with some batching?) Thoughts?
With Spring, create 2 #KafkaListeners, one for A, one for B; set the container ack mode to MANUAL and add the Acknowledgment to the method signature.
In each listener, accumulate records until you get 50 then pause the listener container (so that Kafka won't send any more, but the consumer stays alive).
You might need to set the max.poll.records to 1 to better control consumption.
When you have 50 in each; combine and send.
Commit the offsets by calling acknowledge() on the last Acknowledgment received in A and B.
Resume the containers.
Repeat.
Deferring the offset commits will avoid record loss in the event of a server crash while you are in the accumulating stage.
When you have lots of messages in both topics, you can skip the pause/resume part.

Understanding max.task.idle.ms in Kafka Stream for a KStream-KTable join

I need help understanding Kafka stream behavior when max.task.idle.ms is used in Kafka 2.2.
I have a KStream-KTable join where the KStream has been re-keyed:
KStream stream1 = builder.stream("topic1", Consumed.with(myTimeExtractor));
KStream stream2 = builder.stream("topic2", Consumed.with(myTimeExtractor));
KTable table = stream1
.groupByKey()
.aggregate(myInitializer, myAggregator, Materialized.as("myStore"))
stream2.selectKey((k,v)->v)
.through("rekeyedTopic")
.join(table, myValueJoiner)
.to("enrichedTopic");
All topics have 10 partitions and for testing, I've set max.task.idle.ms to 2 minutes. myTimeExtractor updates the event time of messages only if they are labelled "snapshot": Each snapshot message in stream1 gets its event time set to some constant T, messages in stream2 get their event time set to T+1.
There are 200 messages present in each of topic1 and in topic2 when I call KafkaStreams#start, all labelled "snapshot" and no message is added thereafter. I can see that within a second or so both myStore and rekeyedTopic get filled up. Since the event time of the messages in the table is lower than the event time of the messages in the stream my understanding (from reading https://cwiki.apache.org/confluence/display/KAFKA/KIP-353%3A+Improve+Kafka+Streams+Timestamp+Synchronization) is that I should see the result of the join (in enrichedTopic) shortly after myStore and rekeyedTopic are filled up. In fact I should be able to fill up rekeyedTopic first and as long as myStore gets filled up less than 2 minutes after that, the join should still produce the expected result.
This is not what happens. What happens is that myStore and rekeyedTopic get filled up within the first second or so, then nothing happens for 2 minutes and only then enrichedTopic gets filled with the expected messages.
I don't understand why there is a pause of 2 minutes before the enrichedTopic gets filled since everything is "ready" long before. What I am missing?
based on the documentation where it states:
max.task.idle.ms - Maximum amount of time a stream task will stay idle when not
all of its partition buffers contain records, to avoid potential out-of-order
record processing across multiple input streams.
I would say it's possibly due to some of the partition buffers NOT containing records so it's basically waiting to avoid out of order processing up to the defined time you have configured for the property.