Registering triggers for missing expected events using Esper in real-time - real-time

My use-case is to identify entities from which expected events have not been received after X amount of time in real-time rather than using batch jobs. For Example:
If we have received PaymentInitiated event at time T but didn't receive either of PaymentFailed / PaymentAborted / PaymentSucedded by T+X, then raise a trigger saying PaymentStuck along with details of PaymentIntitiated event.
1. Can I capture such triggers using Esper?
In my actual use-case X is not constant and varies as per each record which I would know before the first event has occured.
2. Can Esper support registering such dynamic queries where X is not constant?
Thanks,
Harish

You could use a pattern such as "pattern [every pi=PaymentInitiated -> timer:interval(pi.amountOfTimeInSeconds) and not (PaymentAborted(id=pi.id) or PaymentStuck(id=p.id))]"
An outer join is also handy to detect absences. The solution patterns page in among the Esper web site has more examples.

Related

Anylogic ‘how to’ questions

I am using Anylogic for a simulation-modeling class, and I am not anylogic or coding smart. My last and only coding class was MatLab based about 16 yrs ago. I have a few questions about how to implement modeling concepts in a discrete model with anylogic.
How can I add/inject agents directly into a queue downstream from a source? I have tried adding an additional source to use the “Calls of inject() function,” but I am not sure how to implement it after selecting it ( example: what do I do after selecting the Calls of inject() function). I have the new source feeding directly into the queue where I want the inject.
How can I set the release of an agent to a defined schedule instead of a rate? Currently, I have my working model set to interarrival time. But I would like to set the agent release to a defined schedule. (example: agent-1 released at 120 seconds, agent-2 released at 150 seconds, agent-3 released at 270 seconds)
Any help would be greatly appreciated, especially if it can be written in a “explain to me like I am 5yrs old” format.
Question 1:
If you have a source connected directly to a queue, then when you call source.inject() an agent will be created at the source block and go to the queue. If you have 1 source with multiple possible destinations, then you will have to use select output blocks and some criteria to go from the source to the desired queue.
Since you mentioned not being a strong programmer, this probably wouldn't be for you, but I often find myself creating agents via add_population and then just adding them to an ArrayList until I am ready to pull them into the DES flow. Really, there are near infinite ways to control agent flow within AnyLogic.
Question 2:
Option a: Arrivals by "Arrival Table in Database" You can link an AnyLogic database table to Excel, and then the source block will just have an agent arrive based on that table.
Option b: Arrival Schedule - you could set this up manually within the development environment or load your schedule from a database. I prefer option a over option b given your brief description.
Option c: Read in data to variable and then write code to release based on next arrival time. 1,000s of ways to do this, but one example could be a list of doubles (your arrival times), set an event to delay until next arrival, call inject function, remove that arrival from the list. I think option a would be best for you, but given that AnyLogic allows you to add java code, there are no limits to how sophisticated you could make your arrival logic.
For 2) You could also use an event or a dynamic event. The action could be source.inject(1); and you can schedule them to your preferences with variables. Just be vigilant that you re-start the events if necessary.
There is a demo-model from AnyLogic for dynamic events.

Is it possible use composite triggers in conjunction with micro-batching with dataflow?

We have an unbounded PCollection PCollection<TableRow> source that we are inserting to BigQuery.
An easy "by the book" way to fire windows every 500 thousand messages or five minutes would be:
source.apply("GlobalWindow", Window.<TableRow>into(new GlobalWindows())
.triggering(Repeatedly.forever(AfterFirst.of(
AfterPane.elementCountAtLeast(500000),
AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardMinutes(5))))
).withAllowedLateness(Duration.standardMinutes(1440)).discardingFiredPanes())
You would think that applying the following to the fired window/pane would allow you to write contents of the fired pane to BigQuery:
.apply("BatchWriteToBigQuery", BigQueryIO.writeTableRows()
.to(destination)
.withMethod(BigQueryIO.Write.Method.FILE_LOADS)
.withNumFileShards(NUM_FILE_SHARDS)
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND));
But this would yield a compile error An exception occured while executing the Java class. When writing an unbounded PCollection via FILE_LOADS, triggering frequency must be specified
Relatively easy fix would be to add .withTriggeringFrequency(Duration.standardMinutes(5)) to the above, which would essentially render the idea of inserting either every five minutes or every N messages completely void, and you might as well get rid of the windowing in that case anyway.
Is there a way to actually accomplish this?
FILE_LOADS requires triggering frequency.
If you want more realtime results then you can use STREAMING_INSERTS
Reference https://beam.apache.org/releases/javadoc/2.19.0/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.Write.Method.html#FILE_LOADS

Streaming data Complex Event Processing over files and a rather long period

my challenge:
we receive files every day with about 200.000 records. We keep the files for approx 1 year, to support re-processing, etc..
For the sake of the discussion assume it is some sort of long lasting fulfilment process, with a provisioning-ID that correlates records.
we need to identify flexible patterns in these files, and trigger events
typical questions are:
if record A is followed by record B which is followed by record C, and all records occured within 60 days, then trigger an event
if record D or record E was found, but record F did NOT follow within 30 days, then trigger an event
if both records D and record E were found (irrespective of the order), followed by ... within 24 hours, then trigger an event
some pattern require lookups in a DB/NoSql or joins for additional information either to select the record, or to put into the event.
"Selecting a record" can be simple "field-A equals", but can also be "field-A in []" or "filed-A match " or "func identify(field-A, field-B)"
"days" might also be "hours" or "in previous month". Hence more flexible then just "days". Usually we have some date/timestamp in the record. The maximum is currently "within 6 months" (cancel within setup phase)
The created events (preferably JSON) needs to contain data from all records which were part of the selection process.
We need an approach that allows to flexibly change (add, modify, delete) the pattern, optionally re-processing the input files.
Any thoughts on how to tackle the problem elegantly? May be some python or java framework, or does any of the public cloud solutions (AWS, GCP, Azure) address the problem space especially well?
thanks a lot for your help
After some discussions and readings, we'll try first Apache Flink with the FlinkCEP library. From the docs and blog entries it seems to be able to do the job. It also seems AWS's choice, running on their EMR cluster. We didn't find any managed service on GCP nor Azure providing the functionalities. Of course we can always deploy and manage it ourselves. Unfortunately we didn't find a Python framework

Real-time data streaming using Wikipedia's RecentChanges API

I'm lately trying to create a demo on real time streaming using NiFi -> Kafka -> Druid -> Superset. For the purposes of this demo I chose to use Wikipedia's RecentChanges API in order to get asynchronous data of the most recent changes.
I use this URL in order to get a response of changes. I'm calling the API constanlty in order to not miss any changes. This way I get a lot of duplicates that I do not want. Is there anyway to parameterize this API to fix it for example getting all the changes from the previous second and doing that everysecond or something else to tackle this issue. I'm trying to make a configuration for this uing NiFi, if someone has to add something on that part then visit this discussion on Cloudera.
Yes. See https://en.wikipedia.org/w/api.php?action=help&modules=query%2Brecentchanges Use rcstart and rcend to define your start and end times. You can use "now" for rcend.
I want to expand smartse's answer and come up with a solution. You want to put your API request in certain time windows, by shifting the start and end parameters. Windowing might work like this:
Initialize start, end timestamp parameters
Put those parameters as attributes on the flow
Downstream processors can call the API using those parameters
After doing that, you have to set start = previous_end + 1 second and end = now
When you determine the new window for the next run, you need the parameters from the previous run. This is why you have to remember those values. You can achieve this using NiFi's distributed map cache.
I've assembled a flow for you:
Zoom into Get next date range:
The end parameter is always now, so you just have to store the start parameter. FetchDistributedMapCache will fetch that for you and put it into stored.state attribute:
Set time range processor will initialize the parameters:
Notice that end is always now and start is either an initial date (for the first run) or the last end parameter plus 1 second. At this point the flow is directed into the Time range output, where you can call your API downstream. Additionally you have to update the stored.value. This happens in the ReplaceText processor:
Finally you update the state:
The lifecycle of the parameters are bound to the cache identifier. When you change the identifier, you start from scratch.

Can I use Time as globally unique event version?

I found time as the best value as event version.
I can merge perfectly independent events of different event sources on different servers whenever needed without being worry about read side event order synchronization. I know which event (from server 1) had happened before the other (from server 2) without the need for global sequential event id generator which makes all read sides to depend on it.
As long as the time is a globally ever sequential event version , different teams in companies can act as distributed event sources or event readers And everyone can always relay on the contract.
The world's simplest notification from a write side to subscribed read sides followed by a query pulling the recent changes from the underlying write side can simplify everything.
Are there any side effects I'm not aware of ?
Time is indeed increasing and you get a deterministic number, however event versioning is not only serves the purpose of preventing conflicts. We always say that when we commit a new event to the event store, we send the new event version there as well and it must match the expected version on the event store side, which must be the previous version plus exactly one. If there will be a thousand or three millions of ticks between two events - I do not really care, this does not give me the information I need. And if I have missed one event on the go is critical to know. So I would not use anything else than incremental counter, with events versioned per aggregate/stream.