I have a Kafka topic where I expect messages with two different key types: old and new.
i.e. "1-new", "1-old", "2-new", "2-old". Keys are unique, but some might be missing.
Now using Kotlin and KafkaStreams API I can log those messages with have same key id from new and old.
val windows = JoinWindows.of(Duration.of(2, MINUTES).toMillis())
val newStream = stream.filter({ key, _ -> isNew(key) })
.map({key, value -> KeyValue(key.replace(NEW_PREFIX, ""), value) })
val oldStream = stream.filter({ key, _ -> isOld(key) })
.map({key, value -> KeyValue(key.replace(OLD_PREFIX, ""), value) })
val joined = newStream.join(oldStream,
{ value1, value2 -> "$value1&$value2" }, windows)
joined.foreach({ key, value ->
log.info { "JOINED $key : $value" }
})
Now I want to know new/old keys which are missing in time window for some reason. Is it possible to achieve with KafkaStreams API?
In my case when key "1-old" is received and "1-new" is not within 2 minutes only in this case I want to report id 1 as suspicious.
The DSL might not give you what you want. However, you can use Processor API. Having say this, the leftJoin can actually be used to do the "heavy lifting". Thus, after the leftJoin you can use .transform(...) with an attached state to "clean up" the data further.
For each old&null record you receive, put it into the store. If you receive a later old&new you can remove it from the store. Furthermore, you register a punctuation and on each punctuation call, you scan the store for entries that are "old enough" so you are sure no later old&new join result will be produced. For those entries, you emit old&null and remove from them from the store.
As an alternative, you can also omit the join, and do everything in a single transform() with state. For this, you would need to KStream#merge() old and new stream and call transform() on the merged stream.
Note: instead of registering a punctuation, you can also put the "scan logic" into the transform and execute it each time you process a record.
If I understand your question correctly you only want to report id's as suspicious when there is an "old" without a corresponding "new" within the 2-minute window.
If that's the case you'll want to use a left join :
val leftJoined = oldStream.leftJoin(newStream,...).filter(condition where value expected from "new" stream is null);
HTH
Looks like what you were looking for. Kafka Streams left outer join on timeout
Eliminates the lack of sql-like left join semantic in kafka streams framework. This implementation will generate left join event only if full join event didn't happen in join window duration interval.
Related
I have a stream of events i need to match against a ktable / changelog topic but the matching is done by pattern matching on a property of the ktable entries. so i cannot join the streams based on a key since i dont know yet which one is matching.
example:
ktable X:
{
[abc]: {id: 'abc', prop: 'some pattern'},
[efg]: {id: 'efg', prop: 'another pattern'}
}
stream A:
{ id: 'xyz', match: 'some pattern'}
so stream A should forward something like {match: 'abc'}
So i basically need to iterate over the ktable entries and find the matching entry by pattern matching on this property.
Would it be viable to create a global state store based on the ktable and then access it from the processor API and iterate over the entries?
I could also aggregate all the entries of the ktable into 1 collection and then join on a 'fake' key? But this seems also rather hacky.
Or am i just forcing something which is not really streams and rather just put it into a redis cache with the normal consumer API, which is also kinda awkward since i rather have it backed by rocksDB.
edit: i guess this is kinda related to this question
A GlobalKTable won't work, because a stream-globalTable join allows you to extract a non-key join attribute from the stream -- but the lookup into the table is still based on the table key.
However, you could read the table input topic as a KStream, extract the join attribute, set it as key, and do an aggregation that returns a collection (ie, List, Set, etc). This way, you can do a stream-table join on the key, followed by a flatMapValues() (or flatMap()) that splits the join-result into multiple records (depending on how many records are in the collection of the table).
As long as your join attribute has not too many duplicates (for the table input topic), and thus the value side collection in the table does not grow too large, this should work fine. You will need to provide a custom value-Serde to (de)serialize the collection data.
Normally I would map the table data so I get the join key I need. We recently had a similar case, where we had to join a stream with the corresponding data in a KTable. In our case, the stream key was the first part of the table key, so we could group by that first key part and aggregate the results in a list. At the end it looked something like this.
final KTable<String, ArrayList<String>> theTable = builder
.table(TABLE_TOPIC, Consumed.with(keySerde, Serdes.String()))
.groupBy((k, v) -> new KeyValue<>(k.getFirstKeyPart(), v))
.aggregate(
ArrayList::new,
(key, value, list) -> {
list.add(value);
return list;
},
(key, value, list) -> {
list.remove(value);
return list;
},
Materialized.with(Serdes.String(), stringListSerde));
final KStream<String, String> theStream = builder.stream(STREAM_TOPIC);
theStream
.join(theTable, (streamEvent, tableEventList) -> tableEventList)
.flatMapValues(value -> value)
.map(this::doStuff)
.to(TARGET_TOPIC);
I am not sure, if this is also possible for you, meaning, maybe it is possible for you to map the table data in some way to to the join.
I know this does not completely belong to your case, but I hope it might be of some help anyway. Maybe you can clarify a bit, how the matching would look like for your case.
Is there a way to pass in or get access to the message key from the join section in a Kafka Stream DSL join?
I have something like this right now:
KStream<String, GenericRecord> completedEventsStream = inputStartKStream.
join(
inputEndKStream,
(leftValue, rightValue) -> customLambda((Record) leftValue, (Record) rightValue),
JoinWindows.of(windowDuration),
Joined.with(stringSerde, genericAvroSerde, genericAvroSerde)
);
However, the leftValue and rightValue records passed in to customLambda don't have access to the kafka message key, because that's a separate string. The only content they have is the message itself, not the key.
Is there a way to get access to the key from inside the join lambda? One thing I could do is simply add the message key as part of the message itself, and access it as a regular field there, but I was wondering if the framework provides a way to access it directly?
Most of the time the key is also available in the value of the record, is this not the case for your app?
It looks like the ValueJoiner interface has an improvement filed as part of KIP-149, but hasn't been completed like the other methods in that KIP: ValueTransformer and ValueMapper.
You could add a step before your join to extract the key and include it in the value of your message before doing the join using ValueMapperWithKey.
I am trying to join a
KStream: created from a topic, the topic has JSON value. I re-key the stream using
two attributed from the value. example value (snippet of the json). I created a custom pojo class and use a custom serdes.{"value":"0","time":1.540753118800291E9,,"deviceIp":"111.111.111.111","deviceName":"KYZ1","indicatorName":"ifHCInOctets"}
keys are mapped as:
map((key, value) -> KeyValue.pair(value.deviceName+value.indicatorName, value))
I do a peek on the KStream and prints both key
and the attributes I used.
Looks all good.
KTable: I create a ktable from a topic, I am writing to the topic using a python script and the key for the topic is KYZ1ifHCInOctets, the combination of device name and indicator name (from above).
I do a
toStream and then a peek on the resulting stream. Keys and values all seems
fine.
Now when i do a inner join and do a peek or through/to a topic i see the key and values are mismatched. Join doesn't seems to work,
KStream<String, MyPojoClass> joined= datastream.join(table,
(data,table)->data
,Joined.with(Serdes.String(),myCustomSerde,Serdes.String())
);
key = XYZ1s1_TotalDiscards
Value = {"deviceName":"ABC2", "indicatorName":"jnxCosQstatTxedBytes"}
I have the exactly the same thing working through ksql, but wanted to do my own stream app.
Now it sounds so stupid to what the error was, my PoJo class had few of the attributes as static :-(, resulting in wrong keys.
I have a Kafka topic where I expect messages with two different key types: old and new.
i.e. "1-new", "1-old", "2-new", "2-old". Keys are unique, but some might be missing.
Now using Kotlin and KafkaStreams API I can log those messages with have same key id from new and old.
val windows = JoinWindows.of(Duration.of(2, MINUTES).toMillis())
val newStream = stream.filter({ key, _ -> isNew(key) })
.map({key, value -> KeyValue(key.replace(NEW_PREFIX, ""), value) })
val oldStream = stream.filter({ key, _ -> isOld(key) })
.map({key, value -> KeyValue(key.replace(OLD_PREFIX, ""), value) })
val joined = newStream.join(oldStream,
{ value1, value2 -> "$value1&$value2" }, windows)
joined.foreach({ key, value ->
log.info { "JOINED $key : $value" }
})
Now I want to know new/old keys which are missing in time window for some reason. Is it possible to achieve with KafkaStreams API?
In my case when key "1-old" is received and "1-new" is not within 2 minutes only in this case I want to report id 1 as suspicious.
The DSL might not give you what you want. However, you can use Processor API. Having say this, the leftJoin can actually be used to do the "heavy lifting". Thus, after the leftJoin you can use .transform(...) with an attached state to "clean up" the data further.
For each old&null record you receive, put it into the store. If you receive a later old&new you can remove it from the store. Furthermore, you register a punctuation and on each punctuation call, you scan the store for entries that are "old enough" so you are sure no later old&new join result will be produced. For those entries, you emit old&null and remove from them from the store.
As an alternative, you can also omit the join, and do everything in a single transform() with state. For this, you would need to KStream#merge() old and new stream and call transform() on the merged stream.
Note: instead of registering a punctuation, you can also put the "scan logic" into the transform and execute it each time you process a record.
If I understand your question correctly you only want to report id's as suspicious when there is an "old" without a corresponding "new" within the 2-minute window.
If that's the case you'll want to use a left join :
val leftJoined = oldStream.leftJoin(newStream,...).filter(condition where value expected from "new" stream is null);
HTH
Looks like what you were looking for. Kafka Streams left outer join on timeout
Eliminates the lack of sql-like left join semantic in kafka streams framework. This implementation will generate left join event only if full join event didn't happen in join window duration interval.
We have a kafka message in a sole topic for each event a user does on our platform. Each of the events / kafka messages has a common field userId. We now want to know from that topic how many unique users we had every hour. So we are not interested in the event types and the individual counts for an user. We just want to know how many unique users were active in every hour.
What is the easiest way to achieve this? My current idea seems not to be very simple, see the pseudo code here:
stream
.selectKey() // userId
.groupByKey() // group by userid, results in a KGroupedStream[UserId, Value]
.aggregate( // initializer, merger und accumulator simply deliver a constant value, the message is now just a tick for that userId key
TimeWindows.of(3600000)
) // result of aggregate is KTable[Windowed[UserId], Const]
.toStream // convert in stream to be able to map key in next step
.map() // map key only (Windowed[Userid]) to key = startMs of window to and value Userid
.groupByKey() // grouping by startMs of windows, which was selected as key before
.count() // results in a KTable from startMs of window to counts of users (== unique userIds)
Is there an easier way? I probably overlook something.
There are two thing you can do:
Merge selectKey() and groupByKey() into groupBy()
You don't need the toStream().map() step, but you can do regrouping with a new key directly on the first KTable
Something like this:
stream.groupBy(/* put a KeyValueMapper that return the grouping key */)
.aggregate(... TimeWindow.of(TimeUnit.HOURS.toMillis(1))
.groupBy(/* put a KeyValueMapper that return the new grouping key */)
.count()