Kafka Streams - adding message frequency in enriched stream - apache-kafka

From a stream (k,v), I want to calculate a stream (k, (v,f)) where f is the frequency of the occurrences of a given key in the last n seconds.
Give a topic (t1) if I use a windowed table to calculate the frequency:
KTable<Windowed<Integer>,Long> t1_velocity_table = t1_stream.groupByKey().windowedBy(TimeWindows.of(n*1000)).count();
This will give a windowed table with the frequency of each key.
Assuming I won’t be able to join with a Windowed key, instead of the table above I am mapping the stream to a table with simple key:
t1_Stream.groupByKey()
.windowedBy(TimeWindows.of( n*1000)).count()
.toStream().map((k,v)->new KeyValue<>(k.key(), Math.toIntExact(v))).to(frequency_topic);
KTable<Integer,Integer> t1_frequency_table = builder.table(frequency_topic);
If I now lookup in this table when a new key arrives in my stream, how do I know if this lookup table will be updated first or the join will occur first (which will cause the stale frequency to be added in the record rather that the current updated one). Will it be better to create a stream instead of table and then do a windowed join ?
I want to lookup the table with something like this:
KStream<Integer,Tuple<Integer,Integer>> t1_enriched = t1_Stream.join(t1_frequency_table, (l,r) -> new Tuple<>(l, r));
So instead of having just a stream of (k,v) I have a stream of (k,(v,f)) where f is the frequency of key k in the last n seconds.
Any thoughts on what would be the right way to achieve this ? Thanks.

For the particular program you shared, the stream side record will be processed first. The reason is, that you pipe the data through a topic...
When the record is processed, it will update the aggregation result that will emit an update record that is written to the through-topic. Directly afterwards, the record will be processed by the join operator. Only afterwards a new poll() call will eventually read the aggregation result from the through-topic and update the table side of the join.
Using the DSL, it seems not to be possible for achieve what you want. However, you can write a custom Transformer that re-implements the stream-table join that provides the semantics you need.

Related

Kafka Streams join unrelated streams

I have a stream of events i need to match against a ktable / changelog topic but the matching is done by pattern matching on a property of the ktable entries. so i cannot join the streams based on a key since i dont know yet which one is matching.
example:
ktable X:
{
[abc]: {id: 'abc', prop: 'some pattern'},
[efg]: {id: 'efg', prop: 'another pattern'}
}
stream A:
{ id: 'xyz', match: 'some pattern'}
so stream A should forward something like {match: 'abc'}
So i basically need to iterate over the ktable entries and find the matching entry by pattern matching on this property.
Would it be viable to create a global state store based on the ktable and then access it from the processor API and iterate over the entries?
I could also aggregate all the entries of the ktable into 1 collection and then join on a 'fake' key? But this seems also rather hacky.
Or am i just forcing something which is not really streams and rather just put it into a redis cache with the normal consumer API, which is also kinda awkward since i rather have it backed by rocksDB.
edit: i guess this is kinda related to this question
A GlobalKTable won't work, because a stream-globalTable join allows you to extract a non-key join attribute from the stream -- but the lookup into the table is still based on the table key.
However, you could read the table input topic as a KStream, extract the join attribute, set it as key, and do an aggregation that returns a collection (ie, List, Set, etc). This way, you can do a stream-table join on the key, followed by a flatMapValues() (or flatMap()) that splits the join-result into multiple records (depending on how many records are in the collection of the table).
As long as your join attribute has not too many duplicates (for the table input topic), and thus the value side collection in the table does not grow too large, this should work fine. You will need to provide a custom value-Serde to (de)serialize the collection data.
Normally I would map the table data so I get the join key I need. We recently had a similar case, where we had to join a stream with the corresponding data in a KTable. In our case, the stream key was the first part of the table key, so we could group by that first key part and aggregate the results in a list. At the end it looked something like this.
final KTable<String, ArrayList<String>> theTable = builder
.table(TABLE_TOPIC, Consumed.with(keySerde, Serdes.String()))
.groupBy((k, v) -> new KeyValue<>(k.getFirstKeyPart(), v))
.aggregate(
ArrayList::new,
(key, value, list) -> {
list.add(value);
return list;
},
(key, value, list) -> {
list.remove(value);
return list;
},
Materialized.with(Serdes.String(), stringListSerde));
final KStream<String, String> theStream = builder.stream(STREAM_TOPIC);
theStream
.join(theTable, (streamEvent, tableEventList) -> tableEventList)
.flatMapValues(value -> value)
.map(this::doStuff)
.to(TARGET_TOPIC);
I am not sure, if this is also possible for you, meaning, maybe it is possible for you to map the table data in some way to to the join.
I know this does not completely belong to your case, but I hope it might be of some help anyway. Maybe you can clarify a bit, how the matching would look like for your case.

Kafka Streams as table Patch log not full Post

Desired functionality: For a given key, key123, numerous services are running in parallel and reporting their results to a single location, once all results are gathered for key123 they are passed to a new downstream consumer.
Original idea: Using AWS DynamoDB to hold all results for a given entry. Every time a result is ready a micro-service does a PATCH operation to the database on key123. An output stream checks each UPDATE to see if the entry is complete, if so, it is forwarded downstream.
New Idea: Use Kafka Streams and KSQL to reach the same goal. All services write their output to the results topic, the topic forms a change log Kstream that we KSQL query for completed entries. Something like:
CREATE STREAM competed_results FROM results_stream SELECT * WHERE (all results != NULL).
The part I'm not sure how to do is the PATCH operation on the stream. To have the output stream show the accumulation of all messages for key123 instead of just the most recent one?
KSQL users, does this even make sense? Am I close to a solution that someone has done before?
If you can produce all your events to the same topic, with the key set, then you can collect all of the events for a specific key using an aggregation in ksqlDB such as:
CREATE STREAM source (
KEY INT KEY, -- example key to group by
EVENT STRING -- example event to collect
) WITH (
kafka_topic='source', -- or whatever your source topic is called.
value_format='json' -- or whatever value format you need.
);
CREATE TABLE agg AS
SELECT
key,
COLLECT_LIST(event) as events
FROM source
GROUP BY key;
This will create a changelog topic called AGG by default. As new events are received for a specific key on the source topic, ksqlDB will produce messages to the AGG topic, with the key set to key and the value containing the list of all the events seen for that key.
You can then import this changelog as a stream:
CREATE STREAM agg_stream (
KEY INT KEY,
EVENTS ARRAY<STRING>
) WITH (
kafka_topic='AGG',
value_format='json'
);
And you can then apply some criteria to filter the stream to only include your final results:
STREAM competed_results AS
SELECT
*
FROM agg_stream
WHERE ARRAY_LEN(EVENTS) = 5; -- example 'complete' criteria.
You may even want to use a user-defined function to define your complete criteria:
STREAM competed_results AS
SELECT
*
FROM agg_stream
WHERE IS_COMPLETE(EVENTS);

Apache beam: SQL aggregation outputs no results for Unbounded/Bounded join

I am working on an apache beam pipeline to run a SQL aggregation function.Reference: https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/test/java/org/apache/beam/sdk/extensions/sql/BeamSqlDslJoinTest.java#L159.
The example here works fine.However, when I replace the source with an actual unbounded source and do an aggregation, I see no results.
Steps in my pipeline:
Read bounded data from a source and convert to collection of rows.
Read unbounded json data from a websocket source.
Assign timestamp to the every source stream via a DoFn.
Convert the unbounded json to unbounded row collection
Apply a window on the row collection
Apply a SQL statement.
Output the result of the sql.
A normal SQL statement executes and outputs the results. However, when I use a group by in the SQL, there is no output.
SELECT
o1.detectedCount,
o1.sensor se,
o2.sensor sa
FROM SENSOR o1
LEFT JOIN AREA o2
on o1.sensor = o2.sensor
The results are continous and like shown below.
2019-07-19 20:43:11 INFO ConsoleSink:27 - {
"detectedCount":0,
"se":"3a002f000647363432323230",
"sa":"3a002f000647363432323230"
}
2019-07-19 20:43:11 INFO ConsoleSink:27 - {
"detectedCount":1,
"se":"3a002f000647363432323230",
"sa":"3a002f000647363432323230"
}
2019-07-19 20:43:11 INFO ConsoleSink:27 - {
"detectedCount":0,
"se":"3a002f000647363432323230",
"sa":"3a002f000647363432323230"
}
The results don't show up at all when I change the sql to
SELECT
COUNT(o1.detectedCount) o2.sensor sa
FROM SENSOR o1
LEFT JOIN AREA o2
on o1.sensor = o2.sensor
GROUP BY o2.sensor
Is there anything I am doing wrong in this implementation.Any pointers would be really helpful.
Some suggestions come up when reading your code:
Extend the window, to allow lateness, and to emit early arrived data.
.apply("windowing", Window.<Row>into(FixedWindows.of(Duration.standardSeconds(2)))
.triggering(AfterWatermark.pastEndOfWindow()
.withEarlyFirings(AfterProcessingTime.pastFirstElementInPane()
.plusDelayOf(Duration.standardSeconds(1)))
.withLateFirings(AfterProcessingTime.pastFirstElementInPane()
.plusDelayOf(Duration.standardSeconds(2))))
.withAllowedLateness(Duration.standardMinutes(10))
.discardingFiredPanes());
Try to remove the join and check if without it you have output to the window,
Try to add more time to the window. because sometimes it is too short to shuffle the data between the workers. and the joined streams aren't emitted at the same time.
outputWithTimestamp will output the rows in a different timestamp, and then they can be dropped when you don't allow lateness.
Read the docs for outputWithTimestamp, this API is a bit risky.
If the input {#link PCollection} elements have timestamps, the output
timestamp for each element must not be before the input element's
timestamp minus the value of {#link getAllowedTimestampSkew()}. If an
output timestamp is before this time, the transform will throw an
{#link IllegalArgumentException} when executed. Use {#link
withAllowedTimestampSkew(Duration)} to update the allowed skew.
CAUTION: Use of {#link #withAllowedTimestampSkew(Duration)} permits
elements to be emitted behind the watermark. These elements are
considered late, and if behind the {#link
Window#withAllowedLateness(Duration) allowed lateness} of a downstream
{#link PCollection} may be silently dropped.
SELECT
COUNT(o1.detectedCount) as number
,o2.sensor
,sa
FROM SENSOR o1
LEFT OUTER JOIN AREA o2
on o1.sensor = o2.sensor
GROUP BY sa,o1.sensor,o2.sensor

How to do pattern matching using match_recognize in Esper for unknown properties of events in event stream?

I am new to Esper and I am trying to filter the events properties from event streams having multiple events coming with high velocity.
I am using Kafka to send row by row of CSV from producer to consumer and at consumer I am converting those rows to HashMap and creating Events at run-time in Esper.
For example I have events listed below which are coming every 1 second
WeatherEvent Stream:
E1 = {hum=51.0, precipi=1, precipm=1, tempi=57.9, icon=clear, pickup_datetime=2016-09-26 02:51:00, tempm=14.4, thunder=0, windchilli=, wgusti=, pressurei=30.18, windchillm=}
E2 = {hum=51.5, precipi=1, precipm=1, tempi=58.9, icon=clear, pickup_datetime=2016-09-26 02:55:00, tempm=14.5, thunder=0, windchilli=, wgusti=, pressurei=31.18, windchillm=}
E3 = {hum=52, precipi=1, precipm=1, tempi=59.9, icon=clear, pickup_datetime=2016-09-26 02:59:00, tempm=14.6, thunder=0, windchilli=, wgusti=, pressurei=32.18, windchillm=}#
Where E1, E2...EN are multiple events in WeatherEvent
In the above events I just want to filter out properties like hum, tempi, tempm and presssurei because they are changing as the time proceeds ( during 4 secs) and dont want to care about the properties which are not changing at all or are changing really slowly.
Using below EPL query I am able to filter out the properties like temp, hum etc
#Name('Out') select * from weatherEvent.win:time(10 sec)
match_recognize (
partition by pickup_datetime?
measures A.tempm? as a_temp, B.tempm? as b_temp
pattern (A B)
define
B as Math.abs(B.tempm? - A.tempm?) > 0
)
The problem is I can only do it when I specify tempm or hum in the query for pattern matching.
But as the data is coming from CSV and it has high-dimensions or features so I dont know what are the properties of Events before-hand.
I want Esper to automatically detects features/properties (during run-time) which are changing and filter it out, without me specifying properties of events.
Any Ideas how to do it? Is that even possible with Esper? If not, can I do it with other CEP engines like Siddhi, OracleCEP?
You may add a "?" to the event property name to get the value of those properties that are not known at the time the event type is defined. This is called dynamic property see documentation . The type returned is Object so you need to downcast.

2 step windowed aggregation with Kafka Streams DSL

Suppose I have a stream "stream-1" consisting of 1 datapoint every second and I'd like to calculate a derived stream "stream-5" which contains the sum using a hopping window of 5 seconds and another stream "stream-10" which is based off "stream-5" containing the sum using a hopping window of 10 seconds. The aggregation needs to be done for each key separately and I'd like to be able to run each step in a different process. It is not a problem in itself if stream-5 and stream-10 contain updates for the same key/timestamp (so I don't necessarily need How to send final kafka-streams aggregation result of a time windowed KTable?) as long as the last values are correct.
Is there an (easy) way to solve this using the high-level Kafka Streams DSL? So far I fail to see an elegant way to deal with the intermediate updates produced on stream-5 due to the aggregation.
I know the intermediate updates can be somehow controlled with the cache.max.bytes.buffering and commit.interval.ms settings but I don't think any setting can guarantee in all cases that no intermediate values will be produced. Also I could try converting "stream-5" to a KTable on read with the timestamp part of the key, but then it seems KTable does not support windowing operations like KStreams do.
This is what I have so far and which fails due to the intermediate aggregate values on stream-5
Reducer<DataPoint> sum = new Reducer<DataPoint>() {
#Override
public DataPoint apply(DataPoint x, DataPoint y) {
return new DataPoint(x.timestamp, x.value + y.value);
}
};
KeyValueMapper<Windowed<String>, DataPoint, String> strip = new
KeyValueMapper<Windowed<String>, DataPoint, String>() {
#Override
public String apply(Windowed<String> wKey, DataPoint arg1) {
return wKey.key();
}
};
KStream<String, DataPoint> s1 = builder.stream("stream-1");
s1.groupByKey()
.reduce(sum, TimeWindows.of(5000).advanceBy(5000))
.toStream()
.selectKey(strip)
.to("stream-5");
KStream<String, DataPoint> s5 = builder.stream("stream-5");
s5.groupByKey()
.reduce(sum, TimeWindows.of(10000).advanceBy(10000))
.toStream()
.selectKey(strip)
.to("stream-10");
Now if stream-1 contains intputs (the key is just KEY)
KEY {"timestamp":0,"value":1.0}
KEY {"timestamp":1000,"value":1.0}
KEY {"timestamp":2000,"value":1.0}
KEY {"timestamp":3000,"value":1.0}
KEY {"timestamp":4000,"value":1.0}
KEY {"timestamp":5000,"value":1.0}
KEY {"timestamp":6000,"value":1.0}
KEY {"timestamp":7000,"value":1.0}
KEY {"timestamp":8000,"value":1.0}
KEY {"timestamp":9000,"value":1.0}
stream-5 contains the correct (final) values:
KEY {"timestamp":0,"value":1.0}
KEY {"timestamp":0,"value":2.0}
KEY {"timestamp":0,"value":3.0}
KEY {"timestamp":0,"value":4.0}
KEY {"timestamp":0,"value":5.0}
KEY {"timestamp":5000,"value":1.0}
KEY {"timestamp":5000,"value":2.0}
KEY {"timestamp":5000,"value":3.0}
KEY {"timestamp":5000,"value":4.0}
KEY {"timestamp":5000,"value":5.0}
but stream-10 is wrong (final value should be 10.0) because it also takes into account the intermediate values on stream-5:
KEY {"timestamp":0,"value":1.0}
KEY {"timestamp":0,"value":3.0}
KEY {"timestamp":0,"value":6.0}
KEY {"timestamp":0,"value":10.0}
KEY {"timestamp":0,"value":15.0}
KEY {"timestamp":0,"value":21.0}
KEY {"timestamp":0,"value":28.0}
KEY {"timestamp":0,"value":36.0}
KEY {"timestamp":0,"value":45.0}
KEY {"timestamp":0,"value":55.0}
The problem is that the results of all aggregations are KTables, which means the records produced to their output topic represent a changlog. However, when you subsequently load them as a stream, the downstream aggregations will be double-counting then.
Instead, you need to load the intermediate topic as tables, not streams. However, you will not be able to use windowed aggregations on them then, as those are only available on streams.
You can use the following pattern to accomplish a windowed aggregation over tables instead of streams:
https://cwiki.apache.org/confluence/display/KAFKA/Windowed+aggregations+over+successively+increasing+timed+windows
If you want to run each step in a separate process you can adapt that, just remember to load intermediate tables using builder.table(), not builder.stream().