Esper EPL to match 5 pairs of events of distinct field values within 1 minute - complex-event-processing

I have a stream of events defined by:
create schema Event(id string, username string, <additionalFields>)
where the <additionalFields> are used to construct contexts but don't directly participate in the rest of the pattern matching EPL (all the EPL statements will be executed within the context).
The desired match behavior is to match:
On five pairs of events for distinct usernames within one minute or less.
If more than two events occur for a given username within the one minute window then they are to be ignored for the purposes of the match.
If duplicate events (events with the same id field value) occur they should be ignored for the purposes of the match.
Ideally the match would consume the events such that the same events can't participate in later matches, however if the EPL is much simpler to understand then we can do post processing to eliminate these overlaps if needed.
Example input events:
Event={id='e1', username='user1'}
t=t.plus(5 seconds)
Event={id='e2', username='user2'}
t=t.plus(5 seconds)
Event={id='e3', username='user4'}
t=t.plus(5 seconds)
Event={id='e4', username='user3'}
t=t.plus(5 seconds)
Event={id='e5', username='user1'}
t=t.plus(5 seconds)
Event={id='e6', username='user1'}
t=t.plus(5 seconds)
Event={id='e7', username='user1'}
t=t.plus(5 seconds)
Event={id='e8', username='user5'}
t=t.plus(5 seconds)
Event={id='e9', username='user5'}
t=t.plus(5 seconds)
Event={id='e10', username='user4'}
t=t.plus(5 seconds)
Event={id='e11', username='user2'}
t=t.plus(5 seconds)
Event={id='e12', username='user3'}
Ideal output events:
Event={id='e1', username='user1'}
Event={id='e2', username='user2'}
Event={id='e3', username='user4'}
Event={id='e4', username='user3'}
Event={id='e5', username='user1'}
Event={id='e8', username='user5'}
Event={id='e9', username='user5'}
Event={id='e10', username='user4'}
Event={id='e11', username='user2'}
Event={id='e12', username='user3'}
The following would also be acceptable for output events:
Event={id='e1', username='user1'}
Event={id='e5', username='user1'}
Event={id='e8', username='user5'}
Event={id='e9', username='user5'}
Event={id='e3', username='user4'}
Event={id='e10', username='user4'}
Event={id='e2', username='user2'}
Event={id='e11', username='user2'}
Event={id='e4', username='user3'}
Event={id='e12', username='user3'}
I've tried using a named window:
create window AtMostTwoEventsPerUsername#time(1 minute) as Event;
on Event as e merge AtMostTwoEventsPerUsername as w where w.id = e.id or (select count(*) from AtMostTwoEventsPerUsername where username = e.username) > 1 when not matched then insert select *;
on Event insert into FivePairsOfTwoEventsPerUsername select w.* from AtMostTwoEventsPerUsername as w where w.username in (select username from AtMostTwoEventsPerUsername group by username having count(*) = 2) having count(*) = 10;
on FivePairsOfTwoEventsPerUsername as m delete from AtMostTwoEventsPerUsername as w where w.id = m.id;
#Name("Out") select * from FivePairsOfTwoEventsPerUsername#time(1 minute)#length_batch(10);
and it seems to be close, however it requires an extra event after the matching events that is undesireable:
Event={id='e1', username='user1'}
t=t.plus(5 seconds)
Event={id='e2', username='user2'}
t=t.plus(5 seconds)
Event={id='e3', username='user4'}
t=t.plus(5 seconds)
Event={id='e4', username='user3'}
t=t.plus(5 seconds)
Event={id='e5', username='user1'}
t=t.plus(5 seconds)
Event={id='e6', username='user1'}
t=t.plus(5 seconds)
Event={id='e7', username='user1'}
t=t.plus(5 seconds)
Event={id='e8', username='user5'}
t=t.plus(5 seconds)
Event={id='e9', username='user5'}
t=t.plus(5 seconds)
Event={id='e10', username='user4'}
t=t.plus(5 seconds)
Event={id='e11', username='user2'}
t=t.plus(5 seconds)
Event={id='e12', username='user3'}
t=t.plus(1 seconds)
Event={id='e13', username='user999'} // this shouldn't be needed to trigger a match
results in the desired output events:
FivePairsOfTwoEventsPerUsername={id='e1', username='user1'}
FivePairsOfTwoEventsPerUsername={id='e2', username='user2'}
FivePairsOfTwoEventsPerUsername={id='e3', username='user4'}
FivePairsOfTwoEventsPerUsername={id='e4', username='user3'}
FivePairsOfTwoEventsPerUsername={id='e5', username='user1'}
FivePairsOfTwoEventsPerUsername={id='e8', username='user5'}
FivePairsOfTwoEventsPerUsername={id='e9', username='user5'}
FivePairsOfTwoEventsPerUsername={id='e10', username='user4'}
FivePairsOfTwoEventsPerUsername={id='e11', username='user2'}
FivePairsOfTwoEventsPerUsername={id='e12', username='user3'}
If the last event (Event={id='e13', username='user999'}) is removed from the input events stream then the "Out" stream unexpectedly has no matching events.
I'd like to understand why the extra event at the end is needed to trigger the match and if there is a simpler set of EPL statements to achieve the desired pattern matching.

To ignore all events with the same id field value that would mean remembering all id field values that ever occurred... is that right?
I would approach this by making an intermediate stream that has a flag that indicates whether the username is "entering" or "leaving" the set of user names that are currently distinct. I would use that intermediate stream to add and remove usernames (and their additional info) from a named window. Select the final output with an "insert into resultstream select window(*) from namedwindow having count > x". Then use "resultstream" as a trigger to delete the named window contents so that these usernames disappear avoiding overlaps (namedwindow having the merge of the intermediate stream).
This way your solution becomes a two step design. The first step produces the intermediate stream. The second retains the intermediate stream for output when 5 are found.

Related

emit final with tumbling window

Use Case : Get the message from the KSQL stream (visitor_topic1) to push in to new kafka topic (final_visitor) every 1 minutes.
> CREATE TABLE final_visitors_per_min
WITH (KAFKA_TOPIC='final_visitor', KEY_FORMAT='JSON', PARTITIONS = 3, REPLICAS = 3)
AS
SELECT
> id,
> visitorName
> FROM vister_List_stream
> WINDOW TUMBLING (SIZE 1 MINUTE)
> GROUP BY id EMIT CHANGES;
so I created one table and get message from stream (visitor_topic1) with tumbling window with size as 1 MINUTES and emit change .
With Emit CHANGES - in the topic2 am getting immediately when the topic 1 received message not wait 1 minutes to send .
with Final . but not emit any message topic2
any one have the suggestions ? where the problem want to receive the message every 1 minute delay ..

How to ensure outer NULL join results output in spark streaming if the future events are delayed

In a scenario of Spark stream-stream outer join:
val left = spark.readStream.format("delta").load("...")
.withWatermark("enqueuedTime", "1 hour")
val right = spark.readStream.format("delta").load("...")
.withWatermark("enqueuedTime", "1 hour")
val res = left.as("left").join(right.as("right"),
expr("left.key = right.key AND (left.enqueuedTime BETWEEN right.enqueuedTime - INTERVAL 1 hour AND right.enqueuedTime + INTERVAL 1 hour)"),
"left_outer")
res.writeStream(....)
And a data in left and right streams:
How to ensure a record:
2, left_value1, 2022-04-18T12:39:49.370+0000, NULL, NULL, NULL
is outputted after a given period of time even if new events aren't flowing thought the stream?
I'm only able to get it if new events arrive to both tables, like:
INSERT into left_df VALUES ("004", "left_df_value", current_timestamp() + INTERVAL 5 hours);
INSERT into right_df VALUES ("004", "right_df_value", current_timestamp() + INTERVAL 5 hours);
Using which, Spark updates the watermarks and understands that now it's safe to output a nullable record. But how to still output it after some kind of timeout, without the new records arriving to both streams?

How can I count elements satisfying a condition in a group, with PostgresSQL

with this query:
SELECT date_trunc('minute', ts) ts, instrument
FROM test
GROUP BY date_trunc('minute', ts), instrument
ORDER BY ts
I am grouping rows by minutes but I would like to generate a boolean value that tells me if, in the group, there is at least one row with the timestamp where the seconds are < 10 and at least one row with the timestamp where the seconds are > 50.
In short, something like:
lessThan10 = false
moreThan50 = false
for each row in the one minute group:
if row.ts.seconds < 10 then lessThan10 = true
if row.ts.seconds > 50 then moreThan50 = true
return lessThan10 && moreThan50
What I am trying to achieve is to find out if all the records I aggregate cover the beginning and the end of the minute; it's ok if there are holes here and there, but it's possible the data we capture stops and restarts at second 40 for example and, in that case, I'd like to be able to discard the whole minute.
As the data rate varies quite a lot, I can't check for a minimum number of row. There may be a better solution to achieve this, so I'm open to it as well.
Use EXTRACT() to get the seconds of the min and max values of ts:
SELECT date_trunc('minute', ts) ts, instrument,
EXTRACT(SECOND FROM MIN(ts)) < 10 lessThan10,
EXTRACT(SECOND FROM MAX(ts)) > 50 moreThan50
FROM test
GROUP BY date_trunc('minute', ts), instrument
ORDER BY ts
See the demo.

Spark time in parallel execution in Spark for API call

I am doing this below thing in my 8gigs of laptop and running code in Intellij. I am calling 3 apis in parallel with map function and scalajlibrary and calculating time to call each api as follows :
val urls = spark.sparkContext.parallelize(Seq("url1", "url2", "url3"))
//for each API call,execute them in different executor and collate data
val actual_data = urls.map(x => spark.time(HTTPRequestParallel.ds(x)))
When spark.time is executed,i have expected 3 sets of time but it gives me 6 sets of time
Time taken: 14945 ms
Time taken: 21773 ms
Time taken: 22446 ms
Time taken: 6438 ms
Time taken: 6877 ms
Time taken: 7107 ms
What am i missing here and is it really parallel calls to api in nature?
Actually that piece of code alone won't execute any spark.time at all, the map function is lazy so it won't be executed until you perform an action on the RDD. You should also consider that if you don't persist your transformed RDD it will re-compute all transformations for every action. What this mean is that if you are doing something like this:
val urls = spark.sparkContext.parallelize(Seq("url1", "url2", "url3"))
//for each API call,execute them in different executor and collate data
val actual_data = urls.map(x => spark.time(HTTPRequestParallel.ds(x)))
val c = actual_data.count()
actual_data.collect()
There will be 6 executions of what is defined inside the map (two for each element in the RDD, first one for the count and the second for the collect). To avoid this re-computation you can cache or persist the RDD as follows
val urls = spark.sparkContext.parallelize(Seq("url1", "url2", "url3"))
//for each API call,execute them in different executor and collate data
val actual_data = urls.map(x => spark.time(HTTPRequestParallel.ds(x))).cache()
val c = actual_data.count()
actual_data.collect()
In this second example you will only see 3 logs instead of 6

Why Spark Structured Streaming window aggregation evaluates after each trigger

With Spark 2.2.0, I am reading data from Kafka having 2 columns "textcol" and "time". The "time" column has the latest processing time. I want to get the count of my unique values of "textcol" in fixed window duration of 20 seconds. My trigger duration is 10 seconds.
For example if in a 20 sec window duration, trigger1 has textcol=a and trigger2 has textcol=b, then I am expecting to have output as below after 20 sec
textcol cnt
a 1
b 1
I used below code for dataset ds
ds.groupBy(functions.col("textcol"),
functions.window(functions.col("time"), "20 seconds"))
.agg(functions.count("textcol").as("cnt"))
.writeStream().trigger(Trigger.ProcessingTime("10 seconds"))
.outputMode("update")
.format("console").start();
But I am getting output twice due to 2 triggers after 20 sec
Trigger1:
textcol cnt
a 1
Trigger2:
textcol cnt
b 1
So why window does not aggregate the results and outputs after 20 sec, instead of triggering each time 10-10 sec?
Is there any other way to achieve it in spark structured streaming?
change your .outputMode("update") to .outputMode("complete")