How to do a Kafka Streams Left Join that returns LHS messages that have no corresponding RHS after a fixed period? - apache-kafka

I'm new to Kafka Streams. I've just put together a left join between Stream A and Stream B. It happens in my setup that for every A there is a B, which arrives a few millis after A but in real life there may be missing B's, or B's that arrive late (after say 250ms). I want to be able to find these (missing and late B's).
I thought it would be easy - just do a left join between A and B, specify the window, and job done.
But I found to my surprise that I get 2 rows in the left join stream output.
Thinking about it, this makes sense - when A arrives, there is no B and a join row that looks like A-[null] is generated. A few milliseconds later, B arrives, and then A-B is generated.
What I want is to have those A messages that do not have a corresponding B after say 100ms - B could be late; might never arrive; but it did not arrive within 100ms of A.
Is there a standard pattern / idiomatic way to do this? I am thinking at the moment that maybe I would have to have a consumer that receives the A and then fires a message after a set time (although I'm not exactly sure how that would be done without some clunky synchronous code) and then I would have to join between that (call it Ax) and B.
This is probably quite a common requirement, but it doesn't seem as easy as I first thought....any thoughts/pointers/tips would be much appreciated. Thanks.

OK I have something that seems to work. All I need to do is, after the left join (which of course has a window), do a .groupByKey().count() and after that I can e.g. send stuff (using filter() and branch() I think, although I haven't done it yet) with a count < 2 to one stream ("missing"), and the others to another "good" stream eg for analysis/calculation of metrics etc.
I tried using .windowedBy(TimeWindows.of(ofMillis(250)).grace(ofMillis(10)))
and .suppress(Suppressed.untilWindowCloses(unbounded())); but got nowhere with it, so it's just as well that a groupBy with count is all that is needed by the looks of things.

Related

How can an event sourced entity to subscribe to state changes in another entity?

I have an events-sourced entity (C) that needs to change its state in response to state changes in another entity of a different type (P). The logic to whether the state of C should actually change is quite complex and the data to compute that lives in C; moreover, many instances of C should listen to one instance of P, and the set of instances increases over time, so I'd rather they pull out of a stream knowing the ID of P than have P keep track of the IDs of all the Cs and push to them.
I am thinking of doing something such as:
Tag a projection of P's events
Have a Subscribe(P.id) command that gets sent to C
If C is not already subscribing to a P (it can only subscribe to one, and it shouldn't change), fire an event Subscribed(P.id)
In response to the event, use Akka-persistent-query to materialize the stream of events tagged in 1, map them to commands, and run asynchronously with a sync that sends them to my ES entity reference
This seems a bit like an anti pattern to have a stream run in the event handler. I am wondering if there's a better/more supported way to do this without the upstream having to know about the downstream. I decided against Akka pub-sub because it does at-most-once delivery, and I'd like to avoid using Kafka if possible.
You definitely don't want to run the stream in the event handler: the event handler should never side effect.
Assuming that you would like a C to get events from times when that C was not running (including before that C had ever run), this suggests that a stream should be run for each C. Since the subscription will be to one particular P, I'd seriously consider not tagging, but instead using the eventsByPersistenceId stream to get all the events of a P and ignore the ones that aren't of interest. In the stream, you translate those to commands in C's API, including the offset in P's event stream with the command, and send it to C (for at-least-once delivery, a mapAsync with an ask is useful; C will persist an event recording that it processed the offset: this allows the command to be idempotent, as C can acknowledge the command if the offset is less-than-or-equal-to the high water offset in its state).
This stream gets kicked off by the command-handler after successfully persisting a Subscribed(P.id) event (in this case starting from offset 0) and then gets kicked off after the persistent actor is rehydrated if the state shows it's subscribed (in this case starting from one plus the high water offset).
The rationale for not using tagging here arises from an assumption that the number of events C isn't interested in is smaller than the number of events with the tag from Ps that C isn't subscribed to (note that for most of the persistence plugins, the more tags there are, the more overhead there is: a tag which is only used by one particular instance of an entity is often not a good idea). If the tag in question is rarely seen, this assumption might not hold and eventsByTag and filtering by id could be useful.
This does of course have the downside of running discrete streams for every C: depending on how many Cs are subscribed to a given P, the overhead of this may be substantial, and the streams for subscribers which are caught up will be especially wasteful. In this scenario, responsibility for delivering commands to subscribed Cs for a given P can be moved to an actor. The only real change in that scenario is that where C would run the stream, it instead confirms that it is subscribed to the event stream by asking that actor feeding events from the P. Because this approach is a marked step-up in complexity (especially around managing when Cs join and drop out of the shared "caught-up" stream), I'd tend to recommend starting with the stream-per-C approach and then going to the shared stream (it's also worth noting that there can be multiple shared streams: in fact I'd tend to have shared streams be per-ActorSystem (e.g. a "node singleton" per P of interest) so as not to involve remoting), since it's not difficult to make the transition (from C's perspective, there's not really a difference whether the adapted commands are coming from a stream it started or from a stream being run by some other actor).

Join a static and a dynamic Kafka source in Flink

Today, I'd like to address a conceptual topic about Flink, rather than a technical.
In our case, we do have two Kafka topics A and B, that need to be joined. The join should always include all elements from topic A, as well as all new elements from topic B. There's 2 possibilities to achieve this: always create a new consumer and start consumption of topic A from beginning, or keep all elements from topic A within a state, once consumed.
Right now, the technological approach is going via joining two DataStreams, which quickly shows us its limits for this use case, as there is no possibility to join streams without a window (fair enough). Elements from topic A are eventually lost, if the window moves on and I got the feeling regularly resetting the consumer would bypass the elaborate logic introduced by Flink.
The other approach I am looking towards right now, would be to use the Table API, it sounds like it's the best fit for this job and actually keeps all the elements in its state for an indefinite amount of time.
However my question: Before going into depths of the Table API, only to notice there is a more elegant way, I'd like to identify, if this is the optimal solution for this matter or if there's an even better fitting Flink concept I am not aware of?
Edit: I forgot to mention: We do not make use of POJOs, but rather keep it generic, which means that the incoming data is identified as Tuple2<K,V>, where K,V are each an instance of GenericRecord. The corresponding schema for Serialization/Deserialization is obtained from the Schema Registry on runtime. I don't know, to which extent the SQL constructs can be a bottleneck in this situation.
Additionally, this remark from the documentation Both tables must have distinct field names makes me doubt a little bit, as we do have the same field names, which we will have to handle somehow, without having huge workarounds.
If A is truly static, then it will be less expensive if you can somehow fully ingest A, either into Flink state or into memory, and then stream B past A -- thereby producing the join results without having to store B.
There are at least a couple of ways to accomplish this with Flink. One is described in this answer, and the other involves using the State Processor API.
With this second approach you would hold A in key-partitioned Flink state. By using the State Processor API you can bootstrap a savepoint that contains the state you want, so that by starting your job from this savepoint, A is already fully loaded and immediately available.
There's a simple example of bootstrapping keyed state in this gist. Once you have created the savepoint, then you need to implement a streaming job that uses it to compute the join -- which can be done with a RichFlatMapFunction.
The other alternative for implementing joins without using the Table API is to simply roll your own with a RichCoFlatMapFunction or a KeyedCoProcessFunction. You will find examples of this in the Flink training. None of those examples really match your requirements, but they give the general flavor. I don't see any advantage to this, however -- if you are going to do a fully dynamic/dynamic join, might as well use the Table API.

Delay fixed window from triggering for several minutes

Using Fixed Windows in Apache Beam. The watermark is set by the event time.
Some data may arrive out of order and cause the window to close.
How can a trigger be defined in Java to occur say 2 minutes after the last data was seen?
It's not entire clear what behavior you expect. One question is what do you expect to happen if the data arrives within the two minutes? Do you want to restart the two minutes interval, don't restart it, re-emit the data or not?
Looks like the trigger you are trying to describe is something along these lines:
wait until the watermark passed the end of window, in event time;
wait for additional 2 minutes in processing time;
emit the data;
If in step 2 it was event time, i.e. you wanted to re-emit the window if a late element arrives that fits within window + 2min, then you could use withAllowedLateness(). Though it sounds different from what you want, because it can keep re-emitting the window contents every time a matching late element arrives.
With processing-time in step 2 this is not possible in general with basic triggers that are available in Beam. You can probably achieve a behavior you want if you manually manage state and timers in your own ParDo, e.g. you can watch for the incoming elements, keep track on them in the state, and then on timer emit what you want. This can become very complicated and might still be not flexible enough for your specific use case.
One of the major problems is that there is no good way to define processing time triggers in Beam in general. It would be complicated to define a general mechanism of working with timers in this manner. For example, when you want to express "wait for 2 minutes", the framework needs to understand in relation to what these two minutes are, when to start the timer, so you need a mechanism to express that as well. And with composition, continuation and other complications this doesn't seem easy to reason about. So it's not in the framework in this general form.
In order to implement only the "wait for 2 minutes after the last element was seen in the window", the framework has to watch for it and set the timer. Technically it is possible to do something like this but doesn't seem like anyone has done it yet.
There seems to be only one meaningful processing time trigger available in Beam but it's not generic enough and doesn't do what you want. You can look at composite triggers like AfterFirst or AfterAll but they likely won't help you without a better general processing time trigger.
I decided against using Beam and implemented the solution in Kafka Streams.
I basically grouped by, then used fixed windows and the aggregated the result.
The "grace" on the window allows data to arrive late.
KGroupedStream<Long, OxyStreamItem> grouped = input.groupByKey();
TimeWindowedKStream<Long, OxyStreamItem> windowed =
grouped.windowedBy(
TimeWindows.of(WIN_SIZE)
.advanceBy(WIN_SIZE)
.grace(Duration.ofSeconds(5L)));
return windowed
.aggregate(
makeInitializer(),
makeAggregator(),
Materialized
.<Long, Aggregate, WindowStore<Bytes, byte[]>>as("tmp")
.withValueSerde(new AggregateSerde()))
.suppress(
Suppressed.untilWindowCloses(Suppressed.BufferConfig.unbounded()))
.toStream()
.map(calculateAvg());

Kafka KStream-KTable join race condition

I have the following:
KTable<Integer, A> tableA = builder.table("A");
KStream<Integer, B> streamB = builder.stream("B");
Messages in streamB need to be enriched with data from tableA.
Example data:
Topic A: (1, {name=john})
Topic B: (1, {type=create,...}), (1, {type=update,...}), (1, {type=update...})
In a perfect world, I would like to do
streamB.join(tableA, (b, a) -> { b.name = a.name; return b; })
.selectKey((k,b) -> b.name)
.to("C");
Unfortunately this does not work for me because my data is such that every time a message is written to topic A, a corresponding message is also written to topic B (the source is a single DB transaction). Now after this initial 'creation' transaction topic B will keep receiving more messages. Sometimes several events per seconds will show up on topic B but it is also possible to have consecutive events hours apart for a given key.
The reason the simple solution does not work is that the original 'creation' transaction causes a race condition: Topic A and B get their message almost simultaneously and if the B message reaches the 'join' part of the topology first (say a few ms before the A message gets there) the tableA will not yet contain a corresponding entry. At this point the event is lost. I can see this happening on topic C: some events show up, some don't (if I use a leftJoin, all events show up but some have null key which is equivalent to being lost). This is only a problem for the initial 'creation' transaction. After that every time an event arrives on topic B, the corresponding entry exists in tableA.
So my question is: how do you fix this?
My current solution is ugly. What I do is that I created a 'collection of B' and read topic B using
B.groupByKey()
.aggregate(() -> new CollectionOfB(), (id, b, agg) -> agg.add(b));
.join(tableA, ...);
Now we have a KTable-KTable join, which is not susceptible to this race condition. The reason I consider this 'ugly' is because after each join, I have to send a special message back to topic B that essentially says "remove the event(s) that I just processed from the collection". If this special message is not sent to topic B, the collection will keep growing and every event in the collection will be reported on every join.
Currently I'm investigating whether a window join would work (read both A and B into KStreams and use a windowed join). I'm not sure that this will work either because there is no upper bound on the size of the window. I want to say, "window starts 1 second 'before' and ends infinity seconds 'after'". Even if I can somehow make this work, I am a bit concerned with the space requirement of having an unbounded window.
Any suggestion would be greatly appreciated.
Not sure what version you are using, but latest Kafka 2.1 improves the stream-table-join. Even before 2.1, the following holds:
stream-table join is base on event-time
Kafka Streams processes messages based on event-time, however, in offset-order (for two input streams, the stream with smaller record timestamps is processed first)
if you want to ensure that the table is updated first, the table update record should have a smaller timestamp than the stream record
Since 2.1:
to allow for some delay, you can configure max.task.idle.ms configuration to delay processing for the case that only one input topic has input data
The event-time processing order is implemented as best-effort in 2.0 and earlier versions what can lead to the race condition you describe. In 2.1, processing order is guaranteed and might only be violated if max.task.idle.ms hits.
For details, see https://cwiki.apache.org/confluence/display/KAFKA/KIP-353%3A+Improve+Kafka+Streams+Timestamp+Synchronization

Kafka streams reduce after groupby to stream sends partial reduce output on commit [duplicate]

This question already has answers here:
How to send final kafka-streams aggregation result of a time windowed KTable?
(3 answers)
Closed 4 years ago.
We're having an issue where upon doing a groupby --> reduce --> toStream, partial reduce values are being sent downstream when a commit happens during the reduce. So if there are 65 keys to be reduced, and say a commit happens half we through, the output will be two messages: one partially reduced, the other with all the values reduced.
So here is our case in more detail:
msg --> leftJoin
leftJoin --> flatMap //break msg into parts so we can join again downstream
flatMap --> leftJoin
leftJoin --> groupByKey
groupByKey --> reduce
reduce --> toStream
toStream --> to
Currently, we've come up with a very ugly fix for this, which has to do with adding an index and out of values to each message created during the flatMap phase...we filter out any message emitted by the reduce where index != out of. My feeling is we're not doing something right here or looking at it the wrong way. Please advise on the correct way of doing this.
Thanks.
So if there are 65 keys to be reduced, and say a commit happens half we through, the output will be two messages: one partially reduced, the other with all the values reduced.
If I understand your description correctly, this is actually intended behavior. For one, it's a tradeoff between processing latency (where you want to see update records as soon as you have a new piece of input data) vs. coalescing multiple update records into fewer or even just a single update record.
The default behavior of Kafka Streams is to favor lower processing latency. That is, it will not wait for "all input data to have arrived" before sending downstream updates. Rather, it will send updates once new data has arrived. Some background information is described at https://www.confluent.io/blog/watermarks-tables-event-time-dataflow-model/.
Today, you have two main knobs to change/tune this default behavior, which is controlled by (1) Kafka Streams record caches (for the DSL) and (2) the configured commit interval (you already mentioned this).
Moving forward, the Kafka community has also been working on a new feature that will allow you to define that you just want a single, final update record to be sent (rather than what you described as "partial" updates). This new feature, in case you are interested, is described in the Kafka Improvement Proposal KIP-328: Ability to suppress updates for KTables. This is actively being worked on, but it will unlikely to be finished in time for the upcoming Kafka v2.1 release in October.
Currently, we've come up with a very ugly fix for this, which has to do with adding an index and out of values to each message created during the flatMap phase...we filter out any message emitted by the reduce where index != out of. My feeling is we're not doing something right here or looking at it the wrong way. Please advise on the correct way of doing this.
In short, in stream processing you should embrace the nature of how streaming works. In general, you will only have partial/incomplete knowledge of the world, so to speak, or rather: you only know what you observed thus far. So, at any given point in time, you must deal with the situation that more, additional data may arrive that you still have to deal with.
A typical situation is having to deal with late-arriving data, where your application logic must decide whether you want to still integrate and process this data (quite likely) or discard (sometimes the way it needs to be).
Going back to your example:
So if there are 65 keys to be reduced [...]
How would one know it's 65, and not 100 or 28, and so on? One can only tell that: "Thus far, at this point in time, I have received 65. So, what do I do? Do I reduce those 65 because I believe that's all the input? Or do I wait some seconds/minutes/hours longer because there might be 35 more to arrive, but this will mean that I will not send an update/answer downstream until this waiting time has elapsed (which results in higher processing latency)?"
In your situation, I would ask: Why do you consider the streaming behavior of how/when updates are being sent a problem? Perhaps it's because you have a downstream system or application that doesn't know how to handle such streaming updates?
Does that make any sense? Again, the above is based on my understanding of what you described as being the issue.