Kafka KStream-KTable join race condition - apache-kafka

I have the following:
KTable<Integer, A> tableA = builder.table("A");
KStream<Integer, B> streamB = builder.stream("B");
Messages in streamB need to be enriched with data from tableA.
Example data:
Topic A: (1, {name=john})
Topic B: (1, {type=create,...}), (1, {type=update,...}), (1, {type=update...})
In a perfect world, I would like to do
streamB.join(tableA, (b, a) -> { b.name = a.name; return b; })
.selectKey((k,b) -> b.name)
.to("C");
Unfortunately this does not work for me because my data is such that every time a message is written to topic A, a corresponding message is also written to topic B (the source is a single DB transaction). Now after this initial 'creation' transaction topic B will keep receiving more messages. Sometimes several events per seconds will show up on topic B but it is also possible to have consecutive events hours apart for a given key.
The reason the simple solution does not work is that the original 'creation' transaction causes a race condition: Topic A and B get their message almost simultaneously and if the B message reaches the 'join' part of the topology first (say a few ms before the A message gets there) the tableA will not yet contain a corresponding entry. At this point the event is lost. I can see this happening on topic C: some events show up, some don't (if I use a leftJoin, all events show up but some have null key which is equivalent to being lost). This is only a problem for the initial 'creation' transaction. After that every time an event arrives on topic B, the corresponding entry exists in tableA.
So my question is: how do you fix this?
My current solution is ugly. What I do is that I created a 'collection of B' and read topic B using
B.groupByKey()
.aggregate(() -> new CollectionOfB(), (id, b, agg) -> agg.add(b));
.join(tableA, ...);
Now we have a KTable-KTable join, which is not susceptible to this race condition. The reason I consider this 'ugly' is because after each join, I have to send a special message back to topic B that essentially says "remove the event(s) that I just processed from the collection". If this special message is not sent to topic B, the collection will keep growing and every event in the collection will be reported on every join.
Currently I'm investigating whether a window join would work (read both A and B into KStreams and use a windowed join). I'm not sure that this will work either because there is no upper bound on the size of the window. I want to say, "window starts 1 second 'before' and ends infinity seconds 'after'". Even if I can somehow make this work, I am a bit concerned with the space requirement of having an unbounded window.
Any suggestion would be greatly appreciated.

Not sure what version you are using, but latest Kafka 2.1 improves the stream-table-join. Even before 2.1, the following holds:
stream-table join is base on event-time
Kafka Streams processes messages based on event-time, however, in offset-order (for two input streams, the stream with smaller record timestamps is processed first)
if you want to ensure that the table is updated first, the table update record should have a smaller timestamp than the stream record
Since 2.1:
to allow for some delay, you can configure max.task.idle.ms configuration to delay processing for the case that only one input topic has input data
The event-time processing order is implemented as best-effort in 2.0 and earlier versions what can lead to the race condition you describe. In 2.1, processing order is guaranteed and might only be violated if max.task.idle.ms hits.
For details, see https://cwiki.apache.org/confluence/display/KAFKA/KIP-353%3A+Improve+Kafka+Streams+Timestamp+Synchronization

Related

CQRS + ES Implementation Advice

I'm working on a generic CQRS + ES framework (with nodejs) in the company. Remark: Only RDBMS + Redis (without AOF/RDB persistence) is allowed due to some reasons.
I really need some advices on how to implement the CQRS + ES framework....
Ignoring the ES part, I'm struggling with the implementation on the message propagation.
Here is the tables I have in the RDBMS.
EventStore: [aggregateId (varchar), aggregateType (varchar), aggregateVersion (bigint), messageId (varchar), messageData (varchar), messageMetadata (varchar), sequenceNumber (bigint)]
EventDelivery: [messageId (varchar, foreign key to EventStore), sequenceId (equal to aggregateId, varchar), sequenceNumber (equal to the one in EventStore, bigint)]
ConsumerGroup: [consumerGroup (varchar), lastSequenceNumberSeen (bigint)]
And I have multiple EventSubscriber
// In Application 1
#EventSubscriber("consumerGroup1", AccountOpenedEvent)
...
// In Application 2
#EventSubscriber("consumerGroup2", AccountOpenedEvent)
...
Here is the the flow when an AccountOpenedEvent is written to EventStore table.
For each application (i.e application 1 and application 2), it will scan the codebase to obtain all the #EventSubscriber, create a consumer group in ConsumerGroup table with lastSequeneNumberSeen = 0, then having a scheduler (with 100ms polling interval) to poll all the interested events (group by consumer group) in EventStore with condition sequeneNumber >= lastSequeneNumberSeen.
For each event (EventStore) in step 1, calculate the sequenceId (here the sequenceId is equal to aggregateId), this sequenceId (together with the sequenceNumber) is used to guarantee the message delivery ordering. Persist it into EventDelivery table, and update the lastSequeneNumberSeen = sequenceNumber (this is to prevent duplicate event being scanned in next interval).
For each application (i.e application 1 and application 2), we have another scheduler (also with 100ms polling interval) to poll the EventDelivery table (group by seqeunceId and order by sequenceNumber ASC).
For each event (EventDelivery) in step 3, call the corresponding message handler, after message is handled, acknowledge the message by deleting the record in EventDelivery.
Since I have 2 applications, I have to separate the AccountOpenedEvent in EventStore into 2 transactions, supposing 2 applications don't know each other, I can only do it passively. Thats why I need the EventDelivery table and polling scheduler.
Assuming I can use redlock + cron to make sure there is only 1 instance do the polling jobs, in case application 1 have more than 1 replicas.
Application 1 will poll the AccountOpenedEvent and create a record in EventDelivery, and store the lastSequenceNumberSeen in its consumer group.
Application 2 will also poll the AccountOpenedEvent and create a record in EventDelivery and store the lastSequenceNumberSeen in its consumer group.
Since application 1 and application 2 are different consumer group, they treat the event store stream separately.
Here is a problem, we have 2 schedulers and we would have more if there are more consumer group, these will make heavy traffic loads to the database. How to solve this? One of my solution is convert these 2 schedulers to a job and put these jobs into queue, the queue will handle the jobs per interval (lets say 100ms), but seems like this would introduce large latency if the job is unfortunately placed at the end of the queue.
Here is the 2nd problem, in the above flow, I introduced the 2nd polling job to guarantee the message delivery ordering. But unlike the first one, I don't have the lastSequenceNumberSeen, the 2nd polling job will remove the job in EventDelivery if the message is handled. But it is common a message would be handled over 100ms. If thats in case, the same event in EventDelivery will be scanned again.
I'm not sure the common practice. I'm quite struggling on how to implement this. I did lots of research on the internet. I see some of them implement the message propagation by using Debezium + Kafka (Although I cannot use these 2 tools, I still cannot understand how it works).
I know Debezium using CDC approach to tail the transaction logs of RDBMS and forward the message to Kafka. And I see some recommendations that we should not have multiple subscription on the same transaction log. Let's say Debezium guaranteed the event can be propagated to Kafka, it means I need applciation 1 and applciation 2 subscribe the Kafka topic, both should belongs to different consumer group (also use aggregateId as partition key). Since Kafka guaranteed the message ordering, everything should work fine. But I don't think Kafka would store all the message from the most beginning, lets say it is configured to store 1000000 messages, when the message handler keep failed due to unexpected reason, the 1000000 messages after this failed message cannot be handled, the 1000001th event will get lost... Although this is rare case, I'm not sure I understand it right or not, the database table is the most reliable source to trust as it store all the events from the most beginning, if the system suffer from this case, is that mean I need to manually republish all the events to Kafka to recover the projection model?
And other case, if I have new event subscriber, which need to historical events to build the projection model. With Debezium + Kafka, we need assign a new consumerGroup and configured it to read the Kafka stream from the most beginning? It has the same problem as the consumerGroup can only get the last 1000000 events... But this is not a case if we poll the database table directly instead.
I don't understand why most implementation doesn't poll the database table but make use of message broker.
And, I really need advice on how to implement a CQRS + ES framework.... especially the message propagation part (keep in mind I can only use RDBMS + Redis(without persistence))....

How to do a Kafka Streams Left Join that returns LHS messages that have no corresponding RHS after a fixed period?

I'm new to Kafka Streams. I've just put together a left join between Stream A and Stream B. It happens in my setup that for every A there is a B, which arrives a few millis after A but in real life there may be missing B's, or B's that arrive late (after say 250ms). I want to be able to find these (missing and late B's).
I thought it would be easy - just do a left join between A and B, specify the window, and job done.
But I found to my surprise that I get 2 rows in the left join stream output.
Thinking about it, this makes sense - when A arrives, there is no B and a join row that looks like A-[null] is generated. A few milliseconds later, B arrives, and then A-B is generated.
What I want is to have those A messages that do not have a corresponding B after say 100ms - B could be late; might never arrive; but it did not arrive within 100ms of A.
Is there a standard pattern / idiomatic way to do this? I am thinking at the moment that maybe I would have to have a consumer that receives the A and then fires a message after a set time (although I'm not exactly sure how that would be done without some clunky synchronous code) and then I would have to join between that (call it Ax) and B.
This is probably quite a common requirement, but it doesn't seem as easy as I first thought....any thoughts/pointers/tips would be much appreciated. Thanks.
OK I have something that seems to work. All I need to do is, after the left join (which of course has a window), do a .groupByKey().count() and after that I can e.g. send stuff (using filter() and branch() I think, although I haven't done it yet) with a count < 2 to one stream ("missing"), and the others to another "good" stream eg for analysis/calculation of metrics etc.
I tried using .windowedBy(TimeWindows.of(ofMillis(250)).grace(ofMillis(10)))
and .suppress(Suppressed.untilWindowCloses(unbounded())); but got nowhere with it, so it's just as well that a groupBy with count is all that is needed by the looks of things.

Window does not assess elements from Kafka Source

I think my perception of Flink windows may be wrong, since they are not evaluated as I would expect from the documentation or the Flink book. The goal is to join a Kafka topic, which has rather static data, with a Kafka topic with constantly incoming data.
env.addSource(createKafkaConsumer())
.join(env.addSource((createKafkaConsumer()))))
.where(keySelector())
.equalTo(keySelector())
.window(TumblingProcessingTimeWindows.of(Time.hours(2)))
.apply(new RichJoinFunction<A, B>() { ... }
createKafkaConsumer() returns a FlinkKafkaConsumer
keySelector() is a placeholder for my key selector.
KafkaTopic A has 1 record, KafkaTopic B has 5. My understanding would be, that the JoinFunction is triggered 5 times (join condition is valid each time), resulting in 5 records in the sink. If a new record for topic A comes in within the 2 hours, another 5 records would be created (2x5 records). However, what comes through in the sink is rather unpredictable, I could not see a pattern. Sometimes there's nothing, sometimes the initial records, but if I send additional messages, they are not being processed by the join with prior records.
My key question:
What does even happen here? Are the records emitted after the window is done processing? I would expect a real-time output to the sink, but that would explain a lot.
Related to that:
Could I handle this problem with onElement trigger or would this make my TimeWindow obsolete? Do those two concepts exists parallel to each other, i.e. that the join window is 2 hours, but the join function + output is triggered per element? How about duplicates in that case?
Subsequently, does processing time mean the point in time, when the record is consumed from the topic? So if I e.g. setStartFromEarliest() on start, all messages which were consumed within the next two hours, were in that window?
Additional info:
env.setStreamTimeCharacteristic(TimeCharacteristic.ProcessingTime); is set and I also switched to EventTime in between.
The semantics of a tumbling processing time window is that it processes all events which fall into the given timespan. In your case, it is 2 hours. Per default, the window will only output results once the 2 hours are over because it needs to know that no other events will be coming for this window.
If you want to output early results (e.g. for every incoming record), then you could specify a custom Trigger which fires on every element. See the Trigger API docs for more information about this.
Update
The window time does not start with the first element but the window starts at multiples of the window length. For example, if your window size is 2 hours, then you can only have windows [0, 2), [2, 4), ... but not [1, 3), [3, 5).

Best practice for stream processing with fuzzy matching dedupe

I'm designing a data pipeline that starts with flat files that are read. each line in a file is a single record.
Once loaded, each record will be parsed, transformed and enriched. This happen independent of other records.
As a final step, I would want to dedupe records based of fuzzy matching of several record's fields. To do this I would like to get all combinations of 2 records.
currently I use sql table as a buffer. My table contains all records and I join the table with itself, on the conditions that keys are different, and fuzzy matching on name with sounds like:
CREATE TABLE temp_tblSoundsLikeName AS
SELECT DISTINCT clients1.client_name client_name1,
clients1.client_id client_id1,
clients2.client_name client_name2,
clients2.client_id client_id2,
FROM tblClients clients1
JOIN tblClients clients2
ON clients1.client_name != clients2.client_name
AND clients1.ban_id < clients2.ban_id
AND SUBSTRING_INDEX(clients2.client_name,' ',1) SOUNDS LIKE SUBSTRING_INDEX(clients1.client_name,' ',1)
The records in temp_tblSoundsLikeName represents duplicates and I will merge them in tblClients.
I was thinking of using Kafka Streams, which I haven't used in the past. When a message M (representing record R) arrive to the dedupe topic, I would like my application to consume it and as a result to generate a message containing the information from R and from another message R', where R' is any message that arrived in the dedupe stage in the past 5 hours. these messages, containing the combinations for 2 messages, should be sent to another topic, where they can be filtered by matching and fuzzy matching conditions, and the final stage is to merge duplicates records and push merged records to RDBMS with kafka connect JDBC.
I am not sure however how to create messages for all such R and R' combinations.
Is this possible?
Is this a good use case for Kafka Streams?
A starting point for de-duping with Kafka's Streams API is EventDeduplicationLambdaIntegrationTest.java at https://github.com/confluentinc/kafka-streams-examples (direct link for Confluent Platform 3.3.0 / Apache Kafka 0.11.0: EventDeduplicationLambdaIntegrationTest.java).
The method isDuplicate controls whether or not a new event is considered to be a duplicate:
private boolean isDuplicate(final E eventId) {
long eventTime = context.timestamp();
WindowStoreIterator<Long> timeIterator = eventIdStore.fetch(
eventId,
eventTime - leftDurationMs,
eventTime + rightDurationMs);
boolean isDuplicate = timeIterator.hasNext();
timeIterator.close();
return isDuplicate;
The eventIdStore is a so-called "state store", and it allows you to remember information from past events so that you can make "duplicate yes/no?" decisions.
When a message M (representing record R) arrive to the dedupe topic, I would like my application to consume it and as a result to generate a message containing the information from R and from another message R', where R' is any message that arrived in the dedupe stage in the past 5 hours. these messages, containing the combinations for 2 messages, should be sent to another topic, where they can be filtered by matching and fuzzy matching conditions, and the final stage is to merge duplicates records and push merged records to RDBMS with kafka connect JDBC.
One option you have is to do the "given a new R, let's find all R' messages, and then de-dupe" in one step, i.e. do this in one processing step (similar to what the example above does, using a so-called Transformer), rather than creating a bunch of new downstream messages, which leads to write amplification (1 * R => N * "(R/R')" downstream messages). The state store can be used to track all prior messages, including the various R' you are interested in when R arrives.

How can I wait to process tuples from a Kafka Topic/Stream in Storm

Super new to stream processing. Trying to see if this is possible in Core Storm, or possibly Trident. The underlying streams are Kafka topics, if it matters, so they're reliable and even replayable.
Conceptually I have two streams, A and B:
A := (id, timestamp, v)
B := (id, timestamp, w)
I assume A and B are each in timestamp-ascending order.
I assume a Bolt reading both A and B (though I'm open to other solutions).
WLOG, for a given tuple "a" from A, to process it I need to have a corresponding tuple "b" from B, the first tuple encountered from B such that:
a.id == b.id
b.timestamp >= a.timestamp
(As such, assuming we get these events in the Bolt only one at a time, when it is processing "a", the corresponding "b" may have already appeared in the past; or may yet appear in milliseconds, months, or even never.)
For any given id, I do not mind waiting milliseconds, months, or even eternity, to pass on some modified version of the "a" event in my Storm topology. I do however want to emit such modified events as soon as possible after all relevant information appears, and to make this whole setup scalable and redundant in all the usual pragmatic engineering ways.
How can I best implement this in Storm?
Naive thoughts I have had:
Keep failing to ack any "a" that doesn't have a corresponding "b" yet, or vice versa.
Queue up "a" and "b" style tuples in bolt memory until reciprocal events can be found, and save the Bolt state as necessary.
Emit problem "a" and "b" tuples on to some other stream or streams and somehow delay further processing on those tuples until an event from the reciprocal stream for the appropriate id appears.