2 step windowed aggregation with Kafka Streams DSL - apache-kafka

Suppose I have a stream "stream-1" consisting of 1 datapoint every second and I'd like to calculate a derived stream "stream-5" which contains the sum using a hopping window of 5 seconds and another stream "stream-10" which is based off "stream-5" containing the sum using a hopping window of 10 seconds. The aggregation needs to be done for each key separately and I'd like to be able to run each step in a different process. It is not a problem in itself if stream-5 and stream-10 contain updates for the same key/timestamp (so I don't necessarily need How to send final kafka-streams aggregation result of a time windowed KTable?) as long as the last values are correct.
Is there an (easy) way to solve this using the high-level Kafka Streams DSL? So far I fail to see an elegant way to deal with the intermediate updates produced on stream-5 due to the aggregation.
I know the intermediate updates can be somehow controlled with the cache.max.bytes.buffering and commit.interval.ms settings but I don't think any setting can guarantee in all cases that no intermediate values will be produced. Also I could try converting "stream-5" to a KTable on read with the timestamp part of the key, but then it seems KTable does not support windowing operations like KStreams do.
This is what I have so far and which fails due to the intermediate aggregate values on stream-5
Reducer<DataPoint> sum = new Reducer<DataPoint>() {
#Override
public DataPoint apply(DataPoint x, DataPoint y) {
return new DataPoint(x.timestamp, x.value + y.value);
}
};
KeyValueMapper<Windowed<String>, DataPoint, String> strip = new
KeyValueMapper<Windowed<String>, DataPoint, String>() {
#Override
public String apply(Windowed<String> wKey, DataPoint arg1) {
return wKey.key();
}
};
KStream<String, DataPoint> s1 = builder.stream("stream-1");
s1.groupByKey()
.reduce(sum, TimeWindows.of(5000).advanceBy(5000))
.toStream()
.selectKey(strip)
.to("stream-5");
KStream<String, DataPoint> s5 = builder.stream("stream-5");
s5.groupByKey()
.reduce(sum, TimeWindows.of(10000).advanceBy(10000))
.toStream()
.selectKey(strip)
.to("stream-10");
Now if stream-1 contains intputs (the key is just KEY)
KEY {"timestamp":0,"value":1.0}
KEY {"timestamp":1000,"value":1.0}
KEY {"timestamp":2000,"value":1.0}
KEY {"timestamp":3000,"value":1.0}
KEY {"timestamp":4000,"value":1.0}
KEY {"timestamp":5000,"value":1.0}
KEY {"timestamp":6000,"value":1.0}
KEY {"timestamp":7000,"value":1.0}
KEY {"timestamp":8000,"value":1.0}
KEY {"timestamp":9000,"value":1.0}
stream-5 contains the correct (final) values:
KEY {"timestamp":0,"value":1.0}
KEY {"timestamp":0,"value":2.0}
KEY {"timestamp":0,"value":3.0}
KEY {"timestamp":0,"value":4.0}
KEY {"timestamp":0,"value":5.0}
KEY {"timestamp":5000,"value":1.0}
KEY {"timestamp":5000,"value":2.0}
KEY {"timestamp":5000,"value":3.0}
KEY {"timestamp":5000,"value":4.0}
KEY {"timestamp":5000,"value":5.0}
but stream-10 is wrong (final value should be 10.0) because it also takes into account the intermediate values on stream-5:
KEY {"timestamp":0,"value":1.0}
KEY {"timestamp":0,"value":3.0}
KEY {"timestamp":0,"value":6.0}
KEY {"timestamp":0,"value":10.0}
KEY {"timestamp":0,"value":15.0}
KEY {"timestamp":0,"value":21.0}
KEY {"timestamp":0,"value":28.0}
KEY {"timestamp":0,"value":36.0}
KEY {"timestamp":0,"value":45.0}
KEY {"timestamp":0,"value":55.0}

The problem is that the results of all aggregations are KTables, which means the records produced to their output topic represent a changlog. However, when you subsequently load them as a stream, the downstream aggregations will be double-counting then.
Instead, you need to load the intermediate topic as tables, not streams. However, you will not be able to use windowed aggregations on them then, as those are only available on streams.
You can use the following pattern to accomplish a windowed aggregation over tables instead of streams:
https://cwiki.apache.org/confluence/display/KAFKA/Windowed+aggregations+over+successively+increasing+timed+windows
If you want to run each step in a separate process you can adapt that, just remember to load intermediate tables using builder.table(), not builder.stream().

Related

Kafka Streams join unrelated streams

I have a stream of events i need to match against a ktable / changelog topic but the matching is done by pattern matching on a property of the ktable entries. so i cannot join the streams based on a key since i dont know yet which one is matching.
example:
ktable X:
{
[abc]: {id: 'abc', prop: 'some pattern'},
[efg]: {id: 'efg', prop: 'another pattern'}
}
stream A:
{ id: 'xyz', match: 'some pattern'}
so stream A should forward something like {match: 'abc'}
So i basically need to iterate over the ktable entries and find the matching entry by pattern matching on this property.
Would it be viable to create a global state store based on the ktable and then access it from the processor API and iterate over the entries?
I could also aggregate all the entries of the ktable into 1 collection and then join on a 'fake' key? But this seems also rather hacky.
Or am i just forcing something which is not really streams and rather just put it into a redis cache with the normal consumer API, which is also kinda awkward since i rather have it backed by rocksDB.
edit: i guess this is kinda related to this question
A GlobalKTable won't work, because a stream-globalTable join allows you to extract a non-key join attribute from the stream -- but the lookup into the table is still based on the table key.
However, you could read the table input topic as a KStream, extract the join attribute, set it as key, and do an aggregation that returns a collection (ie, List, Set, etc). This way, you can do a stream-table join on the key, followed by a flatMapValues() (or flatMap()) that splits the join-result into multiple records (depending on how many records are in the collection of the table).
As long as your join attribute has not too many duplicates (for the table input topic), and thus the value side collection in the table does not grow too large, this should work fine. You will need to provide a custom value-Serde to (de)serialize the collection data.
Normally I would map the table data so I get the join key I need. We recently had a similar case, where we had to join a stream with the corresponding data in a KTable. In our case, the stream key was the first part of the table key, so we could group by that first key part and aggregate the results in a list. At the end it looked something like this.
final KTable<String, ArrayList<String>> theTable = builder
.table(TABLE_TOPIC, Consumed.with(keySerde, Serdes.String()))
.groupBy((k, v) -> new KeyValue<>(k.getFirstKeyPart(), v))
.aggregate(
ArrayList::new,
(key, value, list) -> {
list.add(value);
return list;
},
(key, value, list) -> {
list.remove(value);
return list;
},
Materialized.with(Serdes.String(), stringListSerde));
final KStream<String, String> theStream = builder.stream(STREAM_TOPIC);
theStream
.join(theTable, (streamEvent, tableEventList) -> tableEventList)
.flatMapValues(value -> value)
.map(this::doStuff)
.to(TARGET_TOPIC);
I am not sure, if this is also possible for you, meaning, maybe it is possible for you to map the table data in some way to to the join.
I know this does not completely belong to your case, but I hope it might be of some help anyway. Maybe you can clarify a bit, how the matching would look like for your case.

Kafka Streams as table Patch log not full Post

Desired functionality: For a given key, key123, numerous services are running in parallel and reporting their results to a single location, once all results are gathered for key123 they are passed to a new downstream consumer.
Original idea: Using AWS DynamoDB to hold all results for a given entry. Every time a result is ready a micro-service does a PATCH operation to the database on key123. An output stream checks each UPDATE to see if the entry is complete, if so, it is forwarded downstream.
New Idea: Use Kafka Streams and KSQL to reach the same goal. All services write their output to the results topic, the topic forms a change log Kstream that we KSQL query for completed entries. Something like:
CREATE STREAM competed_results FROM results_stream SELECT * WHERE (all results != NULL).
The part I'm not sure how to do is the PATCH operation on the stream. To have the output stream show the accumulation of all messages for key123 instead of just the most recent one?
KSQL users, does this even make sense? Am I close to a solution that someone has done before?
If you can produce all your events to the same topic, with the key set, then you can collect all of the events for a specific key using an aggregation in ksqlDB such as:
CREATE STREAM source (
KEY INT KEY, -- example key to group by
EVENT STRING -- example event to collect
) WITH (
kafka_topic='source', -- or whatever your source topic is called.
value_format='json' -- or whatever value format you need.
);
CREATE TABLE agg AS
SELECT
key,
COLLECT_LIST(event) as events
FROM source
GROUP BY key;
This will create a changelog topic called AGG by default. As new events are received for a specific key on the source topic, ksqlDB will produce messages to the AGG topic, with the key set to key and the value containing the list of all the events seen for that key.
You can then import this changelog as a stream:
CREATE STREAM agg_stream (
KEY INT KEY,
EVENTS ARRAY<STRING>
) WITH (
kafka_topic='AGG',
value_format='json'
);
And you can then apply some criteria to filter the stream to only include your final results:
STREAM competed_results AS
SELECT
*
FROM agg_stream
WHERE ARRAY_LEN(EVENTS) = 5; -- example 'complete' criteria.
You may even want to use a user-defined function to define your complete criteria:
STREAM competed_results AS
SELECT
*
FROM agg_stream
WHERE IS_COMPLETE(EVENTS);

kafka stream to ktable join

I am trying to join a
KStream: created from a topic, the topic has JSON value. I re-key the stream using
two attributed from the value. example value (snippet of the json). I created a custom pojo class and use a custom serdes.{"value":"0","time":1.540753118800291E9,,"deviceIp":"111.111.111.111","deviceName":"KYZ1","indicatorName":"ifHCInOctets"}
keys are mapped as:
map((key, value) -> KeyValue.pair(value.deviceName+value.indicatorName, value))
I do a peek on the KStream and prints both key
and the attributes I used.
Looks all good.
KTable: I create a ktable from a topic, I am writing to the topic using a python script and the key for the topic is KYZ1ifHCInOctets, the combination of device name and indicator name (from above).
I do a
toStream and then a peek on the resulting stream. Keys and values all seems
fine.
Now when i do a inner join and do a peek or through/to a topic i see the key and values are mismatched. Join doesn't seems to work,
KStream<String, MyPojoClass> joined= datastream.join(table,
(data,table)->data
,Joined.with(Serdes.String(),myCustomSerde,Serdes.String())
);
key = XYZ1s1_TotalDiscards
Value = {"deviceName":"ABC2", "indicatorName":"jnxCosQstatTxedBytes"}
I have the exactly the same thing working through ksql, but wanted to do my own stream app.
Now it sounds so stupid to what the error was, my PoJo class had few of the attributes as static :-(, resulting in wrong keys.

Kafka Streams - adding message frequency in enriched stream

From a stream (k,v), I want to calculate a stream (k, (v,f)) where f is the frequency of the occurrences of a given key in the last n seconds.
Give a topic (t1) if I use a windowed table to calculate the frequency:
KTable<Windowed<Integer>,Long> t1_velocity_table = t1_stream.groupByKey().windowedBy(TimeWindows.of(n*1000)).count();
This will give a windowed table with the frequency of each key.
Assuming I won’t be able to join with a Windowed key, instead of the table above I am mapping the stream to a table with simple key:
t1_Stream.groupByKey()
.windowedBy(TimeWindows.of( n*1000)).count()
.toStream().map((k,v)->new KeyValue<>(k.key(), Math.toIntExact(v))).to(frequency_topic);
KTable<Integer,Integer> t1_frequency_table = builder.table(frequency_topic);
If I now lookup in this table when a new key arrives in my stream, how do I know if this lookup table will be updated first or the join will occur first (which will cause the stale frequency to be added in the record rather that the current updated one). Will it be better to create a stream instead of table and then do a windowed join ?
I want to lookup the table with something like this:
KStream<Integer,Tuple<Integer,Integer>> t1_enriched = t1_Stream.join(t1_frequency_table, (l,r) -> new Tuple<>(l, r));
So instead of having just a stream of (k,v) I have a stream of (k,(v,f)) where f is the frequency of key k in the last n seconds.
Any thoughts on what would be the right way to achieve this ? Thanks.
For the particular program you shared, the stream side record will be processed first. The reason is, that you pipe the data through a topic...
When the record is processed, it will update the aggregation result that will emit an update record that is written to the through-topic. Directly afterwards, the record will be processed by the join operator. Only afterwards a new poll() call will eventually read the aggregation result from the through-topic and update the table side of the join.
Using the DSL, it seems not to be possible for achieve what you want. However, you can write a custom Transformer that re-implements the stream-table join that provides the semantics you need.

How to count unique users in a fixed time window in a Kafka stream app?

We have a kafka message in a sole topic for each event a user does on our platform. Each of the events / kafka messages has a common field userId. We now want to know from that topic how many unique users we had every hour. So we are not interested in the event types and the individual counts for an user. We just want to know how many unique users were active in every hour.
What is the easiest way to achieve this? My current idea seems not to be very simple, see the pseudo code here:
stream
.selectKey() // userId
.groupByKey() // group by userid, results in a KGroupedStream[UserId, Value]
.aggregate( // initializer, merger und accumulator simply deliver a constant value, the message is now just a tick for that userId key
TimeWindows.of(3600000)
) // result of aggregate is KTable[Windowed[UserId], Const]
.toStream // convert in stream to be able to map key in next step
.map() // map key only (Windowed[Userid]) to key = startMs of window to and value Userid
.groupByKey() // grouping by startMs of windows, which was selected as key before
.count() // results in a KTable from startMs of window to counts of users (== unique userIds)
Is there an easier way? I probably overlook something.
There are two thing you can do:
Merge selectKey() and groupByKey() into groupBy()
You don't need the toStream().map() step, but you can do regrouping with a new key directly on the first KTable
Something like this:
stream.groupBy(/* put a KeyValueMapper that return the grouping key */)
.aggregate(... TimeWindow.of(TimeUnit.HOURS.toMillis(1))
.groupBy(/* put a KeyValueMapper that return the new grouping key */)
.count()