I am trying to join a
KStream: created from a topic, the topic has JSON value. I re-key the stream using
two attributed from the value. example value (snippet of the json). I created a custom pojo class and use a custom serdes.{"value":"0","time":1.540753118800291E9,,"deviceIp":"111.111.111.111","deviceName":"KYZ1","indicatorName":"ifHCInOctets"}
keys are mapped as:
map((key, value) -> KeyValue.pair(value.deviceName+value.indicatorName, value))
I do a peek on the KStream and prints both key
and the attributes I used.
Looks all good.
KTable: I create a ktable from a topic, I am writing to the topic using a python script and the key for the topic is KYZ1ifHCInOctets, the combination of device name and indicator name (from above).
I do a
toStream and then a peek on the resulting stream. Keys and values all seems
fine.
Now when i do a inner join and do a peek or through/to a topic i see the key and values are mismatched. Join doesn't seems to work,
KStream<String, MyPojoClass> joined= datastream.join(table,
(data,table)->data
,Joined.with(Serdes.String(),myCustomSerde,Serdes.String())
);
key = XYZ1s1_TotalDiscards
Value = {"deviceName":"ABC2", "indicatorName":"jnxCosQstatTxedBytes"}
I have the exactly the same thing working through ksql, but wanted to do my own stream app.
Now it sounds so stupid to what the error was, my PoJo class had few of the attributes as static :-(, resulting in wrong keys.
Related
I am using KStream and would like to extract and push only a subset of the fields from the key and value from Topic1 to Topic2.
For ex, if my value in message contains fields like id, name, address, phoneno... i would like to push only id, name and address to a new topic. This is similar to replacefield.whitelist transform, but i would like to try with KStream.
I am able to do this for a single field, but not sure how to prepare the new value for multiple fields (resulting in a GenericRecord).
You'd use the KStream.map function to get access to both the key and value then you'd return a new key value pair.
stream.map((k, v) -> {
GenericRecord r = new GenericData.Record(); // needs a schema
r.put("name", v.getName()); // for example
return new KeyValue(k, r);
}).to("topic2")
If not using Avro Serde, or other generic type, you'd also need to define your own serde class.
I have a stream of events i need to match against a ktable / changelog topic but the matching is done by pattern matching on a property of the ktable entries. so i cannot join the streams based on a key since i dont know yet which one is matching.
example:
ktable X:
{
[abc]: {id: 'abc', prop: 'some pattern'},
[efg]: {id: 'efg', prop: 'another pattern'}
}
stream A:
{ id: 'xyz', match: 'some pattern'}
so stream A should forward something like {match: 'abc'}
So i basically need to iterate over the ktable entries and find the matching entry by pattern matching on this property.
Would it be viable to create a global state store based on the ktable and then access it from the processor API and iterate over the entries?
I could also aggregate all the entries of the ktable into 1 collection and then join on a 'fake' key? But this seems also rather hacky.
Or am i just forcing something which is not really streams and rather just put it into a redis cache with the normal consumer API, which is also kinda awkward since i rather have it backed by rocksDB.
edit: i guess this is kinda related to this question
A GlobalKTable won't work, because a stream-globalTable join allows you to extract a non-key join attribute from the stream -- but the lookup into the table is still based on the table key.
However, you could read the table input topic as a KStream, extract the join attribute, set it as key, and do an aggregation that returns a collection (ie, List, Set, etc). This way, you can do a stream-table join on the key, followed by a flatMapValues() (or flatMap()) that splits the join-result into multiple records (depending on how many records are in the collection of the table).
As long as your join attribute has not too many duplicates (for the table input topic), and thus the value side collection in the table does not grow too large, this should work fine. You will need to provide a custom value-Serde to (de)serialize the collection data.
Normally I would map the table data so I get the join key I need. We recently had a similar case, where we had to join a stream with the corresponding data in a KTable. In our case, the stream key was the first part of the table key, so we could group by that first key part and aggregate the results in a list. At the end it looked something like this.
final KTable<String, ArrayList<String>> theTable = builder
.table(TABLE_TOPIC, Consumed.with(keySerde, Serdes.String()))
.groupBy((k, v) -> new KeyValue<>(k.getFirstKeyPart(), v))
.aggregate(
ArrayList::new,
(key, value, list) -> {
list.add(value);
return list;
},
(key, value, list) -> {
list.remove(value);
return list;
},
Materialized.with(Serdes.String(), stringListSerde));
final KStream<String, String> theStream = builder.stream(STREAM_TOPIC);
theStream
.join(theTable, (streamEvent, tableEventList) -> tableEventList)
.flatMapValues(value -> value)
.map(this::doStuff)
.to(TARGET_TOPIC);
I am not sure, if this is also possible for you, meaning, maybe it is possible for you to map the table data in some way to to the join.
I know this does not completely belong to your case, but I hope it might be of some help anyway. Maybe you can clarify a bit, how the matching would look like for your case.
Is there a way to pass in or get access to the message key from the join section in a Kafka Stream DSL join?
I have something like this right now:
KStream<String, GenericRecord> completedEventsStream = inputStartKStream.
join(
inputEndKStream,
(leftValue, rightValue) -> customLambda((Record) leftValue, (Record) rightValue),
JoinWindows.of(windowDuration),
Joined.with(stringSerde, genericAvroSerde, genericAvroSerde)
);
However, the leftValue and rightValue records passed in to customLambda don't have access to the kafka message key, because that's a separate string. The only content they have is the message itself, not the key.
Is there a way to get access to the key from inside the join lambda? One thing I could do is simply add the message key as part of the message itself, and access it as a regular field there, but I was wondering if the framework provides a way to access it directly?
Most of the time the key is also available in the value of the record, is this not the case for your app?
It looks like the ValueJoiner interface has an improvement filed as part of KIP-149, but hasn't been completed like the other methods in that KIP: ValueTransformer and ValueMapper.
You could add a step before your join to extract the key and include it in the value of your message before doing the join using ValueMapperWithKey.
I have a Kafka topic where I expect messages with two different key types: old and new.
i.e. "1-new", "1-old", "2-new", "2-old". Keys are unique, but some might be missing.
Now using Kotlin and KafkaStreams API I can log those messages with have same key id from new and old.
val windows = JoinWindows.of(Duration.of(2, MINUTES).toMillis())
val newStream = stream.filter({ key, _ -> isNew(key) })
.map({key, value -> KeyValue(key.replace(NEW_PREFIX, ""), value) })
val oldStream = stream.filter({ key, _ -> isOld(key) })
.map({key, value -> KeyValue(key.replace(OLD_PREFIX, ""), value) })
val joined = newStream.join(oldStream,
{ value1, value2 -> "$value1&$value2" }, windows)
joined.foreach({ key, value ->
log.info { "JOINED $key : $value" }
})
Now I want to know new/old keys which are missing in time window for some reason. Is it possible to achieve with KafkaStreams API?
In my case when key "1-old" is received and "1-new" is not within 2 minutes only in this case I want to report id 1 as suspicious.
The DSL might not give you what you want. However, you can use Processor API. Having say this, the leftJoin can actually be used to do the "heavy lifting". Thus, after the leftJoin you can use .transform(...) with an attached state to "clean up" the data further.
For each old&null record you receive, put it into the store. If you receive a later old&new you can remove it from the store. Furthermore, you register a punctuation and on each punctuation call, you scan the store for entries that are "old enough" so you are sure no later old&new join result will be produced. For those entries, you emit old&null and remove from them from the store.
As an alternative, you can also omit the join, and do everything in a single transform() with state. For this, you would need to KStream#merge() old and new stream and call transform() on the merged stream.
Note: instead of registering a punctuation, you can also put the "scan logic" into the transform and execute it each time you process a record.
If I understand your question correctly you only want to report id's as suspicious when there is an "old" without a corresponding "new" within the 2-minute window.
If that's the case you'll want to use a left join :
val leftJoined = oldStream.leftJoin(newStream,...).filter(condition where value expected from "new" stream is null);
HTH
Looks like what you were looking for. Kafka Streams left outer join on timeout
Eliminates the lack of sql-like left join semantic in kafka streams framework. This implementation will generate left join event only if full join event didn't happen in join window duration interval.
Suppose I have a stream "stream-1" consisting of 1 datapoint every second and I'd like to calculate a derived stream "stream-5" which contains the sum using a hopping window of 5 seconds and another stream "stream-10" which is based off "stream-5" containing the sum using a hopping window of 10 seconds. The aggregation needs to be done for each key separately and I'd like to be able to run each step in a different process. It is not a problem in itself if stream-5 and stream-10 contain updates for the same key/timestamp (so I don't necessarily need How to send final kafka-streams aggregation result of a time windowed KTable?) as long as the last values are correct.
Is there an (easy) way to solve this using the high-level Kafka Streams DSL? So far I fail to see an elegant way to deal with the intermediate updates produced on stream-5 due to the aggregation.
I know the intermediate updates can be somehow controlled with the cache.max.bytes.buffering and commit.interval.ms settings but I don't think any setting can guarantee in all cases that no intermediate values will be produced. Also I could try converting "stream-5" to a KTable on read with the timestamp part of the key, but then it seems KTable does not support windowing operations like KStreams do.
This is what I have so far and which fails due to the intermediate aggregate values on stream-5
Reducer<DataPoint> sum = new Reducer<DataPoint>() {
#Override
public DataPoint apply(DataPoint x, DataPoint y) {
return new DataPoint(x.timestamp, x.value + y.value);
}
};
KeyValueMapper<Windowed<String>, DataPoint, String> strip = new
KeyValueMapper<Windowed<String>, DataPoint, String>() {
#Override
public String apply(Windowed<String> wKey, DataPoint arg1) {
return wKey.key();
}
};
KStream<String, DataPoint> s1 = builder.stream("stream-1");
s1.groupByKey()
.reduce(sum, TimeWindows.of(5000).advanceBy(5000))
.toStream()
.selectKey(strip)
.to("stream-5");
KStream<String, DataPoint> s5 = builder.stream("stream-5");
s5.groupByKey()
.reduce(sum, TimeWindows.of(10000).advanceBy(10000))
.toStream()
.selectKey(strip)
.to("stream-10");
Now if stream-1 contains intputs (the key is just KEY)
KEY {"timestamp":0,"value":1.0}
KEY {"timestamp":1000,"value":1.0}
KEY {"timestamp":2000,"value":1.0}
KEY {"timestamp":3000,"value":1.0}
KEY {"timestamp":4000,"value":1.0}
KEY {"timestamp":5000,"value":1.0}
KEY {"timestamp":6000,"value":1.0}
KEY {"timestamp":7000,"value":1.0}
KEY {"timestamp":8000,"value":1.0}
KEY {"timestamp":9000,"value":1.0}
stream-5 contains the correct (final) values:
KEY {"timestamp":0,"value":1.0}
KEY {"timestamp":0,"value":2.0}
KEY {"timestamp":0,"value":3.0}
KEY {"timestamp":0,"value":4.0}
KEY {"timestamp":0,"value":5.0}
KEY {"timestamp":5000,"value":1.0}
KEY {"timestamp":5000,"value":2.0}
KEY {"timestamp":5000,"value":3.0}
KEY {"timestamp":5000,"value":4.0}
KEY {"timestamp":5000,"value":5.0}
but stream-10 is wrong (final value should be 10.0) because it also takes into account the intermediate values on stream-5:
KEY {"timestamp":0,"value":1.0}
KEY {"timestamp":0,"value":3.0}
KEY {"timestamp":0,"value":6.0}
KEY {"timestamp":0,"value":10.0}
KEY {"timestamp":0,"value":15.0}
KEY {"timestamp":0,"value":21.0}
KEY {"timestamp":0,"value":28.0}
KEY {"timestamp":0,"value":36.0}
KEY {"timestamp":0,"value":45.0}
KEY {"timestamp":0,"value":55.0}
The problem is that the results of all aggregations are KTables, which means the records produced to their output topic represent a changlog. However, when you subsequently load them as a stream, the downstream aggregations will be double-counting then.
Instead, you need to load the intermediate topic as tables, not streams. However, you will not be able to use windowed aggregations on them then, as those are only available on streams.
You can use the following pattern to accomplish a windowed aggregation over tables instead of streams:
https://cwiki.apache.org/confluence/display/KAFKA/Windowed+aggregations+over+successively+increasing+timed+windows
If you want to run each step in a separate process you can adapt that, just remember to load intermediate tables using builder.table(), not builder.stream().