KSQL Join streams with condition on struct field - apache-kafka

I have two streams defined each from a topic on which JSON messages are published a bit like this:
{"payload": {"some_id": "123"}}
Their corresponding streams are defined like this:
CREATE STREAM mystream
(payload STRUCT <someid varchar>)
WITH (kafka_topic='mytopic', value_format='JSON')
When I try to JOIN the two streams together:
SELECT
s.payload->some_id,
o.payload->other_id
FROM mystream s
LEFT JOIN otherstream o ON s.payload->some_id = o.payload->other_id;
I get the following error:
Invalid comparison expression 'S.PAYLOAD->SOME_ID'
in join '(S.PAYLOAD->SOME_ID = O.PAYLOAD->OTHER_ID)'.
Joins must only contain a field comparison.
Is it not possible to join two streams based on a struct field? Do I first need to publish a stream that flattens each source stream before I can perform the JOIN?

Correct, this is not currently possible. Feel free to upvote the issue tracking it here: https://github.com/confluentinc/ksql/issues/4051
As you say, the workaround is to flatten it in another stream first and then join it.

Related

KSQL JOIN to return only one result with duplicated Keys

I have some KSQL queries running on top of Kafka (currenlty in version 5.52, planning to upgrade to lastest soon).
Imagine you have 2 streams stream_A, stream_B, but you may have some duplicated id's in stream_B.id (for example some retries the applications has logged).
When you do something like:
CREATE STREAM result WITH (KAFKA_TOPIC="some_topic") AS
SELECT stream_A.field, stream_B.field
FROM stream_A INNER JOIN stream_B on steam_A.id = stream_B.id;
Is there a way to tell KSQL that you only want to get the first one of those matches?

Kafka Streams join unrelated streams

I have a stream of events i need to match against a ktable / changelog topic but the matching is done by pattern matching on a property of the ktable entries. so i cannot join the streams based on a key since i dont know yet which one is matching.
example:
ktable X:
{
[abc]: {id: 'abc', prop: 'some pattern'},
[efg]: {id: 'efg', prop: 'another pattern'}
}
stream A:
{ id: 'xyz', match: 'some pattern'}
so stream A should forward something like {match: 'abc'}
So i basically need to iterate over the ktable entries and find the matching entry by pattern matching on this property.
Would it be viable to create a global state store based on the ktable and then access it from the processor API and iterate over the entries?
I could also aggregate all the entries of the ktable into 1 collection and then join on a 'fake' key? But this seems also rather hacky.
Or am i just forcing something which is not really streams and rather just put it into a redis cache with the normal consumer API, which is also kinda awkward since i rather have it backed by rocksDB.
edit: i guess this is kinda related to this question
A GlobalKTable won't work, because a stream-globalTable join allows you to extract a non-key join attribute from the stream -- but the lookup into the table is still based on the table key.
However, you could read the table input topic as a KStream, extract the join attribute, set it as key, and do an aggregation that returns a collection (ie, List, Set, etc). This way, you can do a stream-table join on the key, followed by a flatMapValues() (or flatMap()) that splits the join-result into multiple records (depending on how many records are in the collection of the table).
As long as your join attribute has not too many duplicates (for the table input topic), and thus the value side collection in the table does not grow too large, this should work fine. You will need to provide a custom value-Serde to (de)serialize the collection data.
Normally I would map the table data so I get the join key I need. We recently had a similar case, where we had to join a stream with the corresponding data in a KTable. In our case, the stream key was the first part of the table key, so we could group by that first key part and aggregate the results in a list. At the end it looked something like this.
final KTable<String, ArrayList<String>> theTable = builder
.table(TABLE_TOPIC, Consumed.with(keySerde, Serdes.String()))
.groupBy((k, v) -> new KeyValue<>(k.getFirstKeyPart(), v))
.aggregate(
ArrayList::new,
(key, value, list) -> {
list.add(value);
return list;
},
(key, value, list) -> {
list.remove(value);
return list;
},
Materialized.with(Serdes.String(), stringListSerde));
final KStream<String, String> theStream = builder.stream(STREAM_TOPIC);
theStream
.join(theTable, (streamEvent, tableEventList) -> tableEventList)
.flatMapValues(value -> value)
.map(this::doStuff)
.to(TARGET_TOPIC);
I am not sure, if this is also possible for you, meaning, maybe it is possible for you to map the table data in some way to to the join.
I know this does not completely belong to your case, but I hope it might be of some help anyway. Maybe you can clarify a bit, how the matching would look like for your case.

Get record key with inner join in Kafka Streams DSL

Is there a way to pass in or get access to the message key from the join section in a Kafka Stream DSL join?
I have something like this right now:
KStream<String, GenericRecord> completedEventsStream = inputStartKStream.
join(
inputEndKStream,
(leftValue, rightValue) -> customLambda((Record) leftValue, (Record) rightValue),
JoinWindows.of(windowDuration),
Joined.with(stringSerde, genericAvroSerde, genericAvroSerde)
);
However, the leftValue and rightValue records passed in to customLambda don't have access to the kafka message key, because that's a separate string. The only content they have is the message itself, not the key.
Is there a way to get access to the key from inside the join lambda? One thing I could do is simply add the message key as part of the message itself, and access it as a regular field there, but I was wondering if the framework provides a way to access it directly?
Most of the time the key is also available in the value of the record, is this not the case for your app?
It looks like the ValueJoiner interface has an improvement filed as part of KIP-149, but hasn't been completed like the other methods in that KIP: ValueTransformer and ValueMapper.
You could add a step before your join to extract the key and include it in the value of your message before doing the join using ValueMapperWithKey.

Kafka Streams - adding message frequency in enriched stream

From a stream (k,v), I want to calculate a stream (k, (v,f)) where f is the frequency of the occurrences of a given key in the last n seconds.
Give a topic (t1) if I use a windowed table to calculate the frequency:
KTable<Windowed<Integer>,Long> t1_velocity_table = t1_stream.groupByKey().windowedBy(TimeWindows.of(n*1000)).count();
This will give a windowed table with the frequency of each key.
Assuming I won’t be able to join with a Windowed key, instead of the table above I am mapping the stream to a table with simple key:
t1_Stream.groupByKey()
.windowedBy(TimeWindows.of( n*1000)).count()
.toStream().map((k,v)->new KeyValue<>(k.key(), Math.toIntExact(v))).to(frequency_topic);
KTable<Integer,Integer> t1_frequency_table = builder.table(frequency_topic);
If I now lookup in this table when a new key arrives in my stream, how do I know if this lookup table will be updated first or the join will occur first (which will cause the stale frequency to be added in the record rather that the current updated one). Will it be better to create a stream instead of table and then do a windowed join ?
I want to lookup the table with something like this:
KStream<Integer,Tuple<Integer,Integer>> t1_enriched = t1_Stream.join(t1_frequency_table, (l,r) -> new Tuple<>(l, r));
So instead of having just a stream of (k,v) I have a stream of (k,(v,f)) where f is the frequency of key k in the last n seconds.
Any thoughts on what would be the right way to achieve this ? Thanks.
For the particular program you shared, the stream side record will be processed first. The reason is, that you pipe the data through a topic...
When the record is processed, it will update the aggregation result that will emit an update record that is written to the through-topic. Directly afterwards, the record will be processed by the join operator. Only afterwards a new poll() call will eventually read the aggregation result from the through-topic and update the table side of the join.
Using the DSL, it seems not to be possible for achieve what you want. However, you can write a custom Transformer that re-implements the stream-table join that provides the semantics you need.

Kapacitor Join using 'on' Property and Single Dimension

I'm joining streams in Kapacitor using the on property. The join only seems to work if one of the streams has multiple groupBy dimensions, even if only one dimension is actually needed. Why is that?
For example, in the code below, the join won't return anything if floor was removed from .groupBy('building', 'floor'). Why doesn't the join work with building alone?
var building = stream
|from()
.measurement('building_power')
.groupBy('building')
var floor = stream
|from()
.measurement('floor_power')
.groupBy('building', 'floor')
building
|join(floor)
.as('building', 'floor')
.on('building')