Deleting previous events with esper cep - complex-event-processing

While sending events to the runtime if some condition applies I want to ignore all past events and start fresh (with the same runtime, pattern, listeners, etc.), and that could happen many times so i need a relatively fast way of doing that. Is there a function that deletes from the runtime all the events that were already sent (I couldn't find such function in the documentation)? (I know I could create a new runtime but that is very time consuming)

Yes the contexts; You define the start condition and end condition. When the end condition is reached the runtime throws away all state, doc link.

Related

Early firing in Flink - how to emit early window results to a different DataStream with a trigger

I'm working with code that uses a tumbling window of one day, and would like to send early results to a different DataStream on an hourly basis.
I understand that triggers are a way to go here, but don't really see how it would work.
The current code is as follows:
myStream
.keyBy(..)
.window(TumblingEventTimeWindows.of(Time.days(1)))
.aggregate(new MyAggregateFunction(), new MyProcessWindowFunction())
In my understanding, I should register a trigger, and then on its onEventTime method get a hold of a TriggerContext and I can send data to the labeled output from there. But how do I get the current state of MyAggregateFunction from there? Or would I need to my own computation here inside of onEventTime()?
Also, the documentation states that "By specifying a trigger using trigger() you are overwriting the default trigger of a WindowAssigner.". Would my one day window then still fire correctly, or do I need to trigger it somehow differently?
Another way of doing this is creating two different operators - one that windows by 1 hour, and another that windows by 1 day. Would triggers be a preferred approach to that?
Rather than using a custom Trigger, it would be simpler to have two layers of windowing, where the hourly results are further aggregated into daily results. Something like this:
hourlyResults = myStream
.keyBy(...)
.window(TumblingEventTimeWindows.of(Time.hours(1)))
.aggregate(new MyAggregateFunction(), new MyProcessWindowFunction())
dailyResults = hourlyResults
.keyBy(...)
.window(TumblingEventTimeWindows.of(Time.days(1)))
.aggregate(new MyAggregateFunction(), new MyProcessWindowFunction())
hourlyResults.addSink(...)
dailyResults.addSink(...)
Note that the result of a window is not a KeyedStream, so you will need to use keyBy again, unless you can arrange to leverage reinterpretAsKeyedStream (docs).
Normally when I get to more complex behavior like this, I use a KeyedProcessFunction. You can aggregate (and save in state) hourly and daily results, set timers as needed, and use a side output for the hourly results versus the regular output for the daily results.
There are quite a few questions here. I will try to ask all of them. First of all, if You specify Your own trigger using trigger() this means You are going to effectively override the default trigger and thus the window may not work the way it would by default. So, if You for example if You create the 1 day event time tumbling window, but override a trigger so that it fires for every 20th element, it will never fire based on event time.
Now, after Your custom trigger fires, the output from MyAggregateFunction will be passed to MyProcessWindowFunction, so It will work the same as for the default trigger, you don't need to access the MyAggregateFunction from inside the trigger.
Finally, while it may be technically possible to implement trigger to trigger partial results every hour, my personal opinion is that You should probably go with the two separate windows. While this solution may create a slightly larger overhead and may result in a larger state, it should be much clearer, easier to implement, and finally much more error resistant.

Is it possible use composite triggers in conjunction with micro-batching with dataflow?

We have an unbounded PCollection PCollection<TableRow> source that we are inserting to BigQuery.
An easy "by the book" way to fire windows every 500 thousand messages or five minutes would be:
source.apply("GlobalWindow", Window.<TableRow>into(new GlobalWindows())
.triggering(Repeatedly.forever(AfterFirst.of(
AfterPane.elementCountAtLeast(500000),
AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardMinutes(5))))
).withAllowedLateness(Duration.standardMinutes(1440)).discardingFiredPanes())
You would think that applying the following to the fired window/pane would allow you to write contents of the fired pane to BigQuery:
.apply("BatchWriteToBigQuery", BigQueryIO.writeTableRows()
.to(destination)
.withMethod(BigQueryIO.Write.Method.FILE_LOADS)
.withNumFileShards(NUM_FILE_SHARDS)
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND));
But this would yield a compile error An exception occured while executing the Java class. When writing an unbounded PCollection via FILE_LOADS, triggering frequency must be specified
Relatively easy fix would be to add .withTriggeringFrequency(Duration.standardMinutes(5)) to the above, which would essentially render the idea of inserting either every five minutes or every N messages completely void, and you might as well get rid of the windowing in that case anyway.
Is there a way to actually accomplish this?
FILE_LOADS requires triggering frequency.
If you want more realtime results then you can use STREAMING_INSERTS
Reference https://beam.apache.org/releases/javadoc/2.19.0/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.Write.Method.html#FILE_LOADS

Performance issues: Alternative to conditional event

so my model works fine, but since I introduced conditional events (= an event that triggers once a specified condition is met), I find that the computational performance is decreased.
This is especially annoying when I run experiments with many iterations...
Is there an alternative way to trigger an event at a specified condition?
Thank you :)
The answer to this will be triggering events directly by the mechanism that are influencing the conditions you are monitoring. So instead of having the "permanent" condition check, make sure that each time the condition changes you are triggering a check. If that is not possible, because there are too many influencing factors, or it is continuosly changing, consider a cyclic check in discrete timesteps at an interval that suits you. Even this will be much more performant (depending on your choosen time check interval) then the current continous monitoring.
Example on how to do an event trigger monitoring instead of continous monitoring:
In your main, add global counter variable of type int (here: nrAgentsState1) and create a function to update that variable (here: updateStateCounter). Also in the function: a check that triggers a function whenever your threeshold is reached.
In your transition leading into the state you want to monitor (here: state1), call the function in main that updates the global counter. For the transition leading out of the monitored state, add the same action, but with a -1 as parameter.

How to mark a object for deletion in x days time?

I have a regional object store. I would like to be able to tell a particular object that I want you deleted in 5 days time from now.
How do you suggest I implement?
I don't really want to keep track of the object in a database, and based on time send delete commands as a separate process. Is there any tag that could be set to get deletion to occur at a later time (from now, not a specific time in the past)?
There's no functionality built into Google Cloud Storage to do this.
You can configure Lifecycle Management to delete objects according to a number of criteria (including age) - but deleting at a particular date in the future isn't one of the supported conditions and in fact there's no guarantee that a lifecycle condition will run the same day the condition becomes true. Instead you would have to implement this functionality yourself (e.g., in a Compute Engine or App Engine implementation).

Can I use Time as globally unique event version?

I found time as the best value as event version.
I can merge perfectly independent events of different event sources on different servers whenever needed without being worry about read side event order synchronization. I know which event (from server 1) had happened before the other (from server 2) without the need for global sequential event id generator which makes all read sides to depend on it.
As long as the time is a globally ever sequential event version , different teams in companies can act as distributed event sources or event readers And everyone can always relay on the contract.
The world's simplest notification from a write side to subscribed read sides followed by a query pulling the recent changes from the underlying write side can simplify everything.
Are there any side effects I'm not aware of ?
Time is indeed increasing and you get a deterministic number, however event versioning is not only serves the purpose of preventing conflicts. We always say that when we commit a new event to the event store, we send the new event version there as well and it must match the expected version on the event store side, which must be the previous version plus exactly one. If there will be a thousand or three millions of ticks between two events - I do not really care, this does not give me the information I need. And if I have missed one event on the go is critical to know. So I would not use anything else than incremental counter, with events versioned per aggregate/stream.