Kafka Streams join unrelated streams - apache-kafka

I have a stream of events i need to match against a ktable / changelog topic but the matching is done by pattern matching on a property of the ktable entries. so i cannot join the streams based on a key since i dont know yet which one is matching.
example:
ktable X:
{
[abc]: {id: 'abc', prop: 'some pattern'},
[efg]: {id: 'efg', prop: 'another pattern'}
}
stream A:
{ id: 'xyz', match: 'some pattern'}
so stream A should forward something like {match: 'abc'}
So i basically need to iterate over the ktable entries and find the matching entry by pattern matching on this property.
Would it be viable to create a global state store based on the ktable and then access it from the processor API and iterate over the entries?
I could also aggregate all the entries of the ktable into 1 collection and then join on a 'fake' key? But this seems also rather hacky.
Or am i just forcing something which is not really streams and rather just put it into a redis cache with the normal consumer API, which is also kinda awkward since i rather have it backed by rocksDB.
edit: i guess this is kinda related to this question

A GlobalKTable won't work, because a stream-globalTable join allows you to extract a non-key join attribute from the stream -- but the lookup into the table is still based on the table key.
However, you could read the table input topic as a KStream, extract the join attribute, set it as key, and do an aggregation that returns a collection (ie, List, Set, etc). This way, you can do a stream-table join on the key, followed by a flatMapValues() (or flatMap()) that splits the join-result into multiple records (depending on how many records are in the collection of the table).
As long as your join attribute has not too many duplicates (for the table input topic), and thus the value side collection in the table does not grow too large, this should work fine. You will need to provide a custom value-Serde to (de)serialize the collection data.

Normally I would map the table data so I get the join key I need. We recently had a similar case, where we had to join a stream with the corresponding data in a KTable. In our case, the stream key was the first part of the table key, so we could group by that first key part and aggregate the results in a list. At the end it looked something like this.
final KTable<String, ArrayList<String>> theTable = builder
.table(TABLE_TOPIC, Consumed.with(keySerde, Serdes.String()))
.groupBy((k, v) -> new KeyValue<>(k.getFirstKeyPart(), v))
.aggregate(
ArrayList::new,
(key, value, list) -> {
list.add(value);
return list;
},
(key, value, list) -> {
list.remove(value);
return list;
},
Materialized.with(Serdes.String(), stringListSerde));
final KStream<String, String> theStream = builder.stream(STREAM_TOPIC);
theStream
.join(theTable, (streamEvent, tableEventList) -> tableEventList)
.flatMapValues(value -> value)
.map(this::doStuff)
.to(TARGET_TOPIC);
I am not sure, if this is also possible for you, meaning, maybe it is possible for you to map the table data in some way to to the join.
I know this does not completely belong to your case, but I hope it might be of some help anyway. Maybe you can clarify a bit, how the matching would look like for your case.

Related

Kafka Streams leftJoin - only right null if no join within window [duplicate]

I have a Kafka topic where I expect messages with two different key types: old and new.
i.e. "1-new", "1-old", "2-new", "2-old". Keys are unique, but some might be missing.
Now using Kotlin and KafkaStreams API I can log those messages with have same key id from new and old.
val windows = JoinWindows.of(Duration.of(2, MINUTES).toMillis())
val newStream = stream.filter({ key, _ -> isNew(key) })
.map({key, value -> KeyValue(key.replace(NEW_PREFIX, ""), value) })
val oldStream = stream.filter({ key, _ -> isOld(key) })
.map({key, value -> KeyValue(key.replace(OLD_PREFIX, ""), value) })
val joined = newStream.join(oldStream,
{ value1, value2 -> "$value1&$value2" }, windows)
joined.foreach({ key, value ->
log.info { "JOINED $key : $value" }
})
Now I want to know new/old keys which are missing in time window for some reason. Is it possible to achieve with KafkaStreams API?
In my case when key "1-old" is received and "1-new" is not within 2 minutes only in this case I want to report id 1 as suspicious.
The DSL might not give you what you want. However, you can use Processor API. Having say this, the leftJoin can actually be used to do the "heavy lifting". Thus, after the leftJoin you can use .transform(...) with an attached state to "clean up" the data further.
For each old&null record you receive, put it into the store. If you receive a later old&new you can remove it from the store. Furthermore, you register a punctuation and on each punctuation call, you scan the store for entries that are "old enough" so you are sure no later old&new join result will be produced. For those entries, you emit old&null and remove from them from the store.
As an alternative, you can also omit the join, and do everything in a single transform() with state. For this, you would need to KStream#merge() old and new stream and call transform() on the merged stream.
Note: instead of registering a punctuation, you can also put the "scan logic" into the transform and execute it each time you process a record.
If I understand your question correctly you only want to report id's as suspicious when there is an "old" without a corresponding "new" within the 2-minute window.
If that's the case you'll want to use a left join :
val leftJoined = oldStream.leftJoin(newStream,...).filter(condition where value expected from "new" stream is null);
HTH
Looks like what you were looking for. Kafka Streams left outer join on timeout
Eliminates the lack of sql-like left join semantic in kafka streams framework. This implementation will generate left join event only if full join event didn't happen in join window duration interval.

kafka stream to ktable join

I am trying to join a
KStream: created from a topic, the topic has JSON value. I re-key the stream using
two attributed from the value. example value (snippet of the json). I created a custom pojo class and use a custom serdes.{"value":"0","time":1.540753118800291E9,,"deviceIp":"111.111.111.111","deviceName":"KYZ1","indicatorName":"ifHCInOctets"}
keys are mapped as:
map((key, value) -> KeyValue.pair(value.deviceName+value.indicatorName, value))
I do a peek on the KStream and prints both key
and the attributes I used.
Looks all good.
KTable: I create a ktable from a topic, I am writing to the topic using a python script and the key for the topic is KYZ1ifHCInOctets, the combination of device name and indicator name (from above).
I do a
toStream and then a peek on the resulting stream. Keys and values all seems
fine.
Now when i do a inner join and do a peek or through/to a topic i see the key and values are mismatched. Join doesn't seems to work,
KStream<String, MyPojoClass> joined= datastream.join(table,
(data,table)->data
,Joined.with(Serdes.String(),myCustomSerde,Serdes.String())
);
key = XYZ1s1_TotalDiscards
Value = {"deviceName":"ABC2", "indicatorName":"jnxCosQstatTxedBytes"}
I have the exactly the same thing working through ksql, but wanted to do my own stream app.
Now it sounds so stupid to what the error was, my PoJo class had few of the attributes as static :-(, resulting in wrong keys.

Kafka Streams - adding message frequency in enriched stream

From a stream (k,v), I want to calculate a stream (k, (v,f)) where f is the frequency of the occurrences of a given key in the last n seconds.
Give a topic (t1) if I use a windowed table to calculate the frequency:
KTable<Windowed<Integer>,Long> t1_velocity_table = t1_stream.groupByKey().windowedBy(TimeWindows.of(n*1000)).count();
This will give a windowed table with the frequency of each key.
Assuming I won’t be able to join with a Windowed key, instead of the table above I am mapping the stream to a table with simple key:
t1_Stream.groupByKey()
.windowedBy(TimeWindows.of( n*1000)).count()
.toStream().map((k,v)->new KeyValue<>(k.key(), Math.toIntExact(v))).to(frequency_topic);
KTable<Integer,Integer> t1_frequency_table = builder.table(frequency_topic);
If I now lookup in this table when a new key arrives in my stream, how do I know if this lookup table will be updated first or the join will occur first (which will cause the stale frequency to be added in the record rather that the current updated one). Will it be better to create a stream instead of table and then do a windowed join ?
I want to lookup the table with something like this:
KStream<Integer,Tuple<Integer,Integer>> t1_enriched = t1_Stream.join(t1_frequency_table, (l,r) -> new Tuple<>(l, r));
So instead of having just a stream of (k,v) I have a stream of (k,(v,f)) where f is the frequency of key k in the last n seconds.
Any thoughts on what would be the right way to achieve this ? Thanks.
For the particular program you shared, the stream side record will be processed first. The reason is, that you pipe the data through a topic...
When the record is processed, it will update the aggregation result that will emit an update record that is written to the through-topic. Directly afterwards, the record will be processed by the join operator. Only afterwards a new poll() call will eventually read the aggregation result from the through-topic and update the table side of the join.
Using the DSL, it seems not to be possible for achieve what you want. However, you can write a custom Transformer that re-implements the stream-table join that provides the semantics you need.

End-of-window outer join with KafkaStreams

I have a Kafka topic where I expect messages with two different key types: old and new.
i.e. "1-new", "1-old", "2-new", "2-old". Keys are unique, but some might be missing.
Now using Kotlin and KafkaStreams API I can log those messages with have same key id from new and old.
val windows = JoinWindows.of(Duration.of(2, MINUTES).toMillis())
val newStream = stream.filter({ key, _ -> isNew(key) })
.map({key, value -> KeyValue(key.replace(NEW_PREFIX, ""), value) })
val oldStream = stream.filter({ key, _ -> isOld(key) })
.map({key, value -> KeyValue(key.replace(OLD_PREFIX, ""), value) })
val joined = newStream.join(oldStream,
{ value1, value2 -> "$value1&$value2" }, windows)
joined.foreach({ key, value ->
log.info { "JOINED $key : $value" }
})
Now I want to know new/old keys which are missing in time window for some reason. Is it possible to achieve with KafkaStreams API?
In my case when key "1-old" is received and "1-new" is not within 2 minutes only in this case I want to report id 1 as suspicious.
The DSL might not give you what you want. However, you can use Processor API. Having say this, the leftJoin can actually be used to do the "heavy lifting". Thus, after the leftJoin you can use .transform(...) with an attached state to "clean up" the data further.
For each old&null record you receive, put it into the store. If you receive a later old&new you can remove it from the store. Furthermore, you register a punctuation and on each punctuation call, you scan the store for entries that are "old enough" so you are sure no later old&new join result will be produced. For those entries, you emit old&null and remove from them from the store.
As an alternative, you can also omit the join, and do everything in a single transform() with state. For this, you would need to KStream#merge() old and new stream and call transform() on the merged stream.
Note: instead of registering a punctuation, you can also put the "scan logic" into the transform and execute it each time you process a record.
If I understand your question correctly you only want to report id's as suspicious when there is an "old" without a corresponding "new" within the 2-minute window.
If that's the case you'll want to use a left join :
val leftJoined = oldStream.leftJoin(newStream,...).filter(condition where value expected from "new" stream is null);
HTH
Looks like what you were looking for. Kafka Streams left outer join on timeout
Eliminates the lack of sql-like left join semantic in kafka streams framework. This implementation will generate left join event only if full join event didn't happen in join window duration interval.

Spark in Scala: How to avoid linear scan for searching a key in each partition?

I have one huge key-value dataset named A, and a set of keys named B as queries. My task is that for each key in B, return the key exists in A or not, if it exists, return the value.
I partition A by HashParitioner(100) first. Currently I can use A.join(B') to solve it, where B' = B.map(x=>(x,null)). Or we can use A.lookup() for each key in B.
However, the problem is that both join and lookup for PairRDD is linear scan for each partition. This is too slow. As I desire, each partition could be a Hashmap, so that we can find the key in each parition in O(1). So the ideal strategy is that when the master machine receives a bunch of keys, the master assigns each key to its corresponding partition, then the partition uses its Hashmap to find the keys and return the result to the master machine.
Is there an easy way to achieve it?
One potential way:
As I searched online, a similar question is here
http://mail-archives.us.apache.org/mod_mbox/spark-user/201401.mbox/%3CCAMwrk0kPiHoX6mAiwZTfkGRPxKURHhn9iqvFHfa4aGj3XJUCNg#mail.gmail.com%3E
As it said, I built the Hashmap for each partition using the code as follows
val hashpair = A.mapPartitions(iterator => {
val hashmap = new HashMap[Long, Double]
iterator.foreach { case (key, value) => hashmap.getOrElseUpdate(key,value) }
Iterator(hashmap)
})
Now I get 100 Hashmap (if I have 100 partitions for data A). Here I'm lost. I don't know how to ask query, how to use the hashpair to search keys in B, since hashpair is not a regular RDD. Do I need to implement a new RDD and implement RDD methods for hashpair? If so, what is the easiest way to implement join or lookup methods for hashpair?
Thanks all.
You're probably looking for the IndexedRDD:
https://github.com/amplab/spark-indexedrdd