Confused about intervalJoin - scala

I'm trying to come up with a solution which involves applying some logic after the join operation to pick one event from streamB among multiple EventBs. It would be like a reduce function but it only returns 1 element instead of doing it incrementally. So the end result would be a single (EventA, EventB) pair instead of a cross product of 1 EventA and multiple EventB.
streamA
.keyBy((a: EventA) => a.common_key)
.intervalJoin(
streamB
.keyBy((b: EventB) => b.common_key)
)
.between(Time.days(-30), Time.days(0))
.process(new MyJoinFunction)
The data would be ingested like (assuming they have the same key):
EventB ts: 1616686386000
EventB ts: 1616686387000
EventB ts: 1616686388000
EventB ts: 1616686389000
EventA ts: 1616686390000
Each EventA key is guaranteed to arrive only once.
Assume a join operation like above and it generated 1 EventA with 4 EventB, successfully joined and collected in MyJoinFunction. Now what I want to do is, access these values at once and do some logic to correctly match the EventA to exactly one EventB.
For example, for the above dataset I need (EventA 1616686390000, EventB 1616686387000).
MyJoinFunction will be invoked for each (EventA, EventB) pair but I'd like an operation after this, that lets me access an iterator so I can look through all EventB events for each EventA.
I am aware that I can apply another windowing operation after the join to group all the pairs, but I want this to happen immediately after the join succeeds. So if possible, I'd like to avoid adding another window since my window is already large (30 days).
Is Flink the correct choice for this use case or I am completely in the wrong?

This could be implemented as a KeyedCoProcessFunction. You would key both streams by their common key, connect them, and process both streams together.
You can use ListState to store the events from B (for a given key), and ValueState for A (again, for a given key). You can use an event time timer to know when the time has come to look through the B events in the ListState, and produce your result. Don't forget to clear the state once you are finished with it.
If you're not familiar with this part of the Flink API, the tutorial on Process Functions should be helpful.

Related

Kafka Streams leftJoin - only right null if no join within window [duplicate]

I have a Kafka topic where I expect messages with two different key types: old and new.
i.e. "1-new", "1-old", "2-new", "2-old". Keys are unique, but some might be missing.
Now using Kotlin and KafkaStreams API I can log those messages with have same key id from new and old.
val windows = JoinWindows.of(Duration.of(2, MINUTES).toMillis())
val newStream = stream.filter({ key, _ -> isNew(key) })
.map({key, value -> KeyValue(key.replace(NEW_PREFIX, ""), value) })
val oldStream = stream.filter({ key, _ -> isOld(key) })
.map({key, value -> KeyValue(key.replace(OLD_PREFIX, ""), value) })
val joined = newStream.join(oldStream,
{ value1, value2 -> "$value1&$value2" }, windows)
joined.foreach({ key, value ->
log.info { "JOINED $key : $value" }
})
Now I want to know new/old keys which are missing in time window for some reason. Is it possible to achieve with KafkaStreams API?
In my case when key "1-old" is received and "1-new" is not within 2 minutes only in this case I want to report id 1 as suspicious.
The DSL might not give you what you want. However, you can use Processor API. Having say this, the leftJoin can actually be used to do the "heavy lifting". Thus, after the leftJoin you can use .transform(...) with an attached state to "clean up" the data further.
For each old&null record you receive, put it into the store. If you receive a later old&new you can remove it from the store. Furthermore, you register a punctuation and on each punctuation call, you scan the store for entries that are "old enough" so you are sure no later old&new join result will be produced. For those entries, you emit old&null and remove from them from the store.
As an alternative, you can also omit the join, and do everything in a single transform() with state. For this, you would need to KStream#merge() old and new stream and call transform() on the merged stream.
Note: instead of registering a punctuation, you can also put the "scan logic" into the transform and execute it each time you process a record.
If I understand your question correctly you only want to report id's as suspicious when there is an "old" without a corresponding "new" within the 2-minute window.
If that's the case you'll want to use a left join :
val leftJoined = oldStream.leftJoin(newStream,...).filter(condition where value expected from "new" stream is null);
HTH
Looks like what you were looking for. Kafka Streams left outer join on timeout
Eliminates the lack of sql-like left join semantic in kafka streams framework. This implementation will generate left join event only if full join event didn't happen in join window duration interval.

Kafka Streams - adding message frequency in enriched stream

From a stream (k,v), I want to calculate a stream (k, (v,f)) where f is the frequency of the occurrences of a given key in the last n seconds.
Give a topic (t1) if I use a windowed table to calculate the frequency:
KTable<Windowed<Integer>,Long> t1_velocity_table = t1_stream.groupByKey().windowedBy(TimeWindows.of(n*1000)).count();
This will give a windowed table with the frequency of each key.
Assuming I won’t be able to join with a Windowed key, instead of the table above I am mapping the stream to a table with simple key:
t1_Stream.groupByKey()
.windowedBy(TimeWindows.of( n*1000)).count()
.toStream().map((k,v)->new KeyValue<>(k.key(), Math.toIntExact(v))).to(frequency_topic);
KTable<Integer,Integer> t1_frequency_table = builder.table(frequency_topic);
If I now lookup in this table when a new key arrives in my stream, how do I know if this lookup table will be updated first or the join will occur first (which will cause the stale frequency to be added in the record rather that the current updated one). Will it be better to create a stream instead of table and then do a windowed join ?
I want to lookup the table with something like this:
KStream<Integer,Tuple<Integer,Integer>> t1_enriched = t1_Stream.join(t1_frequency_table, (l,r) -> new Tuple<>(l, r));
So instead of having just a stream of (k,v) I have a stream of (k,(v,f)) where f is the frequency of key k in the last n seconds.
Any thoughts on what would be the right way to achieve this ? Thanks.
For the particular program you shared, the stream side record will be processed first. The reason is, that you pipe the data through a topic...
When the record is processed, it will update the aggregation result that will emit an update record that is written to the through-topic. Directly afterwards, the record will be processed by the join operator. Only afterwards a new poll() call will eventually read the aggregation result from the through-topic and update the table side of the join.
Using the DSL, it seems not to be possible for achieve what you want. However, you can write a custom Transformer that re-implements the stream-table join that provides the semantics you need.

End-of-window outer join with KafkaStreams

I have a Kafka topic where I expect messages with two different key types: old and new.
i.e. "1-new", "1-old", "2-new", "2-old". Keys are unique, but some might be missing.
Now using Kotlin and KafkaStreams API I can log those messages with have same key id from new and old.
val windows = JoinWindows.of(Duration.of(2, MINUTES).toMillis())
val newStream = stream.filter({ key, _ -> isNew(key) })
.map({key, value -> KeyValue(key.replace(NEW_PREFIX, ""), value) })
val oldStream = stream.filter({ key, _ -> isOld(key) })
.map({key, value -> KeyValue(key.replace(OLD_PREFIX, ""), value) })
val joined = newStream.join(oldStream,
{ value1, value2 -> "$value1&$value2" }, windows)
joined.foreach({ key, value ->
log.info { "JOINED $key : $value" }
})
Now I want to know new/old keys which are missing in time window for some reason. Is it possible to achieve with KafkaStreams API?
In my case when key "1-old" is received and "1-new" is not within 2 minutes only in this case I want to report id 1 as suspicious.
The DSL might not give you what you want. However, you can use Processor API. Having say this, the leftJoin can actually be used to do the "heavy lifting". Thus, after the leftJoin you can use .transform(...) with an attached state to "clean up" the data further.
For each old&null record you receive, put it into the store. If you receive a later old&new you can remove it from the store. Furthermore, you register a punctuation and on each punctuation call, you scan the store for entries that are "old enough" so you are sure no later old&new join result will be produced. For those entries, you emit old&null and remove from them from the store.
As an alternative, you can also omit the join, and do everything in a single transform() with state. For this, you would need to KStream#merge() old and new stream and call transform() on the merged stream.
Note: instead of registering a punctuation, you can also put the "scan logic" into the transform and execute it each time you process a record.
If I understand your question correctly you only want to report id's as suspicious when there is an "old" without a corresponding "new" within the 2-minute window.
If that's the case you'll want to use a left join :
val leftJoined = oldStream.leftJoin(newStream,...).filter(condition where value expected from "new" stream is null);
HTH
Looks like what you were looking for. Kafka Streams left outer join on timeout
Eliminates the lack of sql-like left join semantic in kafka streams framework. This implementation will generate left join event only if full join event didn't happen in join window duration interval.

Counting latest state of stateful entities in streaming with Flink

I tried to create my first real-time analytics job in Flink. The approach is kappa-architecture-like, so I have my raw data on Kafka where we receive a message for every change of state of any entity.
So the messages are of the form:
(id,newStatus, timestamp)
We want to compute, for every time window, the count of items in a given status. So the output should be of the form:
(outputTimestamp, state1:count1,state2:count2 ...)
or equivalent. These rows should contain, at any given time, the count of the items in a given status, where the status associated to an Id is the most recent message observed for that id. The status for an id should be counted in any case, even if the event is way older than those getting processed. So the sum of all the counts should be equal to the number of different IDs observed in the system. The following step could be forgetting about the items in a final item after a while, but this is not a strict requirement right now.
This will be written on elasticsearch and then queried.
I tried many different paths and none of them completely satisfied the requirement. Using a sliding window I could easily achieve the expected behaviour, except that when the beginning of the sliding window surpassed the timestamp of an event, it was lost for the count, as you may expect. Others approaches failed to be consistent when working with a backlog because I did some tricks with keys and timestamps that failed when the data was processed all at once.
So I would like to know, even at an high level, how should I approach this problem. It looks like a relatively common use-case but the fact that the relevant information for a given ID must be retained indefinitely to count the entities correctly creates a lot of problems.
I think I have a solution for your problem:
Given a DataStream of (id, state, time) as:
val stateUpdates: DataStream[(Long, Int, ts)] = ???
You derive the actual state changes as follows:
val stateCntUpdates: DataStream[(Int, Int)] = s // (state, cntUpdate)
.keyBy(_._1) // key by id
.flatMap(new StateUpdater)
StateUpdater is a stateful FlatMapFunction. It has a keyed state that stores the last state of each id. For each input record it returns two state count update records: (oldState, -1), (newState, +1). The (oldState, -1) record ensures that counts of previous states are reduced.
Next you aggregate the state count changes per state and window:
val cntUpdatesPerWindow: DataStream[(Int, Int, Long)] = stateCntUpdates // (state, cntUpdate, time)
.keyBy(_._1) // key by state
.timeWindow(Time.minutes(10)) // window should be non-overlapping, e.g. Tumbling
.apply(new SumReducer(), new YourWindowFunction())
SumReducer sums the cntUpdates and YourWindowFunction assigns the timestamp of your window. This step aggregates all state changes for each state in a window.
Finally, we adjust the current count with the count updates.
val stateCnts: DataStream[(Int, Int, Long)] = cntUpdatesPerWindow // (state, count, time)
.keyBy(_._1) // key by state again
.map(new CountUpdater)
CountUpdater is a stateful MapFunction. It has a keyed state that stores the current count for each state. For each incoming record, the count is adjusted and a record (state, newCount, time) is emitted.
Now you have a stream with new counts for each state (one record for each state). If possible, you can update your Elasticsearch index using these records. If you need to collect all state counts for a given time you can do that using a window.
Please note: The state size of this program depends on the number of unique ids. That might cause problems if the id space grows very fast.

Spark mapWithState updated states output

After upgrading to Spark 1.6.1, I've started refactoring an application to replace updateStateByKey with mapWithState.
In order to take advantage of the performance advantages of the new API, I don't want to call stateSnapshots, which loads all states. I only want the updated states.
The mapWithState API returns a DStream of [key, input, state, output], where each state is the partially updated state after an input is ingested. How can I extract the latest states alone from this DStream (i.e. the state after all corresponding inputs have been ingested / mapped)?
I can do a map (to drop the input and output) and reduceByKey on the MapWithStateDStream, choosing the state with the newer timestamp (which I set inside the update function), but I have no guarantee there won't be two partial states with the same timestamp, even if using a custom, by key, partitioner.
How can I tell which partial state is the latest in the MapWithStateDStream output of mapWithState?
mapWithState will only be called for each state which is being updated in the current micro batch. One way to achieve what you want is to return an Some[S] in case the state has been updated.
StateSpec.function takes a method with the following signature:
mappingFunction:
(Time, KeyType, Option[ValueType], State[StateType]) => Option[MappedType]
What we can do is make sure that our Option[MappedType] is always Some[MappedType] when the value has been updated, otherwise None.
For example:
def updateState(key: Int, value: Option[Int], state: State[Int]): Option[Int] = {
value match {
case Some(something) if something > 10 =>
val updatedVal = something * something
state.update(updatedVal)
Some(updatedVal)
case _ => None
}
}
And then you can do:
val spec = StateSpec.function(updateState _)
ssc.mapWithState(spec).filter(!_.isEmpty).foreachRDD(/* do stuff on updated state */)
This way you filter out any none updated state and keep only the updated snapshots you're looking for.
One solution that would work if it is possible for your update algorithm is to call reduceByKey on the input stream before calling mapWithstate. Then there would only be a single update for each key and no partial states output.