Overriding time used for time windows in Esper - complex-event-processing

I am working on a CEP project where I analyze logs from a file in bulk. The file is a compressed csv file that is bulk transferred over to my analytics machine every hour, where each line contains an event with a timestamp for exactly when it happened during that previous hour.
Reading this file into a plain Java object is no problem and I will typically end up with something like this:
class MyEvent {
public Date getTimestamp();
public String getMessage(); //shortened to these field only for simplicity
public String getSource();
public int getCount();
}
So the problem is that this file may contain events that were written anywhere between 1 hour ago and 1 second ago, and the only way to know is to inspect the timestamp field in the event itself. When loading these events into Esper, then will all be loaded within a few seconds (there will probably be tens of thousands, and will be loaded as fast as Esper can accept them).
Now, the analysis itself want to calculate average "count" per "source" every 5 minutes in Esper (nothing too complex), however, as all events are loaded within a few seconds, the time window in Esper will be wrong and all events may be within the same time window regardless of when they were produced. So my question is: Is there anyway to override what is counted as the event timestamp in Esper time windows?
The problem also increases when the time window is split between two files that are loaded with an hour delay.
Thank you.

This will do it:
select source, sum(count) from MyEvent group by source output all every 5 seconds
Esper also allows external timer to control time freely in app code.

Related

Apache Beam: Handling of late data

I am using fixed windows in my pipeline with this configuration
Window.<KV<String, DeviceData>>into(FixedWindows.of(Duration.standardSeconds(options.getWindowSize())))
.triggering(
AfterWatermark.pastEndOfWindow()
.withLateFirings(AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardMinutes(10))))
.withAllowedLateness(Duration.standardHours(3))
.accumulatingFiredPanes();
Is my assumption correct that there will be one trigger after closing of the window and then additionally a trigger each 10 minutes after the window has closed if there is late data until 3 hours passed since the closing of the window?
How can I handle data import for periods in time that are further in past? (for example months). In this case I would receive no triggers at all.
It seems that you are missing Repeatedly.forever to allow triggering multiple times, in order for it to match your description.
For data further in the past, you need to configure .withAllowedLateness() based on the lateness that you want to handle.

Anylogic: Dataset resource_unit_states_log

I have created a simple model (see first attachment) in Anylogic. Resource unit W1 is seized in service and resource unit W2 is seized in service 1. The delay time of service and service 1 is both 5 minutes. The interarrival time of source is 10 minutes and interarrival time of source 1 is 6 minutes.
Now I would like to analyse the usage state of both resource units, but in dataset resource_unit_states_log there is only the state "usage_busy" logged. Is there any possibility to also log the usage state "idle" in this dataset? Later in my evaluation I want to know the exact date and time when the resource was in state "idle". Currently I can only read the exact date and time for the state "busy" from the data set (see screenshot in first attachment). Theoretically, I could manually calculate the date and time of the "idle" state based on the existing values, but that would take a long time with thousands of dates.
Another attempt was to track the "idle" state using a timeplot. If I use W1.time() as x-axis value, I get the model time (e.g. 0, 1, 2 ...) in the dataset. But I want instead as in the dataset "resource_unit_states_log" the exact date like 27-12-2021 00:06:00.
Does anyone have an idea how I can solve either of these problems?
AnyLogic internal tables/logs are not modifiable. They are as they are. If you want data in any other format, you need to do it by using your own data collection functions/codes. In your case, the second approach is pretty good. You are collecting information every minute and you can export it. I usually do the post-processing in Python. I work with millions of rows and it takes a few minutes; in your case thousands of rows should take some seconds. Here is how I would do it:
Export the data (in your second plot approach) into Excel. The data should look like this:
Open Jupyter notebook (or any IDE).
Read the data into Python. Lets say you have saved your data as data.xlsx.
Input your start_datetime, i.e. starting date and time of your simulation.
Then just add the minutes from your data to the start_datetime.
Write the modified data in a new Excel file called data_modified.xlsx. It will look like this:
Here is the full code:
import pandas as pd
import numpy as np
from datetime import timedelta
from datetime import datetime as dt
df=pd.read_excel('data.xlsx')
#Input your start date and hour below:
start_datetime='2021-12-31 00:00:00'
df['datetime']=start_datetime
df['datetime'] = pd.to_datetime(df['datetime'])
df['datetime']=df['datetime'].dt.strftime('%d-%m-%Y %H:%M:%S')
df['datetime'] = pd.to_datetime(df['datetime'])
df['time_added'] = pd.to_timedelta(df['x'],'m')
df['datetime']=df['datetime']+df['time_added']
del df['time_added']
df.to_excel('data_modified.xlsx')
Another approach:
You can use the cells On seize unit and On exit inside the Service block and log the time when the resource is seized and released by using the function time() and write this information into a dataset. Then do the calculations.
You can also use some conversion functions of AnyLogic as below:

Streaming data Complex Event Processing over files and a rather long period

my challenge:
we receive files every day with about 200.000 records. We keep the files for approx 1 year, to support re-processing, etc..
For the sake of the discussion assume it is some sort of long lasting fulfilment process, with a provisioning-ID that correlates records.
we need to identify flexible patterns in these files, and trigger events
typical questions are:
if record A is followed by record B which is followed by record C, and all records occured within 60 days, then trigger an event
if record D or record E was found, but record F did NOT follow within 30 days, then trigger an event
if both records D and record E were found (irrespective of the order), followed by ... within 24 hours, then trigger an event
some pattern require lookups in a DB/NoSql or joins for additional information either to select the record, or to put into the event.
"Selecting a record" can be simple "field-A equals", but can also be "field-A in []" or "filed-A match " or "func identify(field-A, field-B)"
"days" might also be "hours" or "in previous month". Hence more flexible then just "days". Usually we have some date/timestamp in the record. The maximum is currently "within 6 months" (cancel within setup phase)
The created events (preferably JSON) needs to contain data from all records which were part of the selection process.
We need an approach that allows to flexibly change (add, modify, delete) the pattern, optionally re-processing the input files.
Any thoughts on how to tackle the problem elegantly? May be some python or java framework, or does any of the public cloud solutions (AWS, GCP, Azure) address the problem space especially well?
thanks a lot for your help
After some discussions and readings, we'll try first Apache Flink with the FlinkCEP library. From the docs and blog entries it seems to be able to do the job. It also seems AWS's choice, running on their EMR cluster. We didn't find any managed service on GCP nor Azure providing the functionalities. Of course we can always deploy and manage it ourselves. Unfortunately we didn't find a Python framework

Google Sheet IMPORTHTML VERY Slow?

I have a spreadsheet that acquires some table data using the IMPORTHTML function, and for the first two days I was using it (refreshing twice daily) things were going fine. As of this morning, it is absolutely crawling. Went from taking ~15 seconds to load 30 rows to taking ~10 minutes. Can somebody lend aid on this?
Example formula:
=IMPORTHTML(
"http://www.muthead.com/16/players/prices/1508-markus-wheaton/playstation-4","table",2
)
As mentioned, the first couple of days it was able to refresh and process a list of 30 without any pauses. Now I get the 'Executing script' message for about ten minutes before it begins to do anything, and I haven't touched the source code since origin. I'm not sure what contributes to the performance of the IMPORTHTML statement...
I've run into similar loading issues when using IMPORTHTML, IMPORTDATA, etc. The best solution I've found is to write a trigger that will edit your formula so it is forced to refresh every hour or so.
Open up the script editor and put this in. Change 'A1' with the cell your IMPORTHTML function is in, and change foo to the URL you're trying to import.
function refreshData() {
var range = SpreadsheetApp.getActiveSpreadsheet().getRange('A1');
range.clear();
range.setFormula('=IMPORTHTML(foo)');
}
Then go to Edit > Current Project Trigger > Add Trigger, and set a refresh interval.
Hope this helps.

Need to Make a Parse Cloud Code Job to Reset all Objects at a Certain Time of Day

I currently have an app powered by parse that monitors the wait times for a certain amusement park. On parse, each ride has its own class file and in each class there is an object with a string entitled "waitTime" which has a string that has the most recent wait time submitted. I would like to use cloud code to reset all of the waitTime sections of each object to "0" at 1:00 AM each morning. I have no experience with Cloud Code or anything like it. How would I go about doing something like this? Thank you in advance for your help!
Have a job that runs every 5 minutes, comparing the current server time to any scheduled times (e.g. 1am, including timezone issues), and if it matches run that task (in your case resetting the "waitTime").
It is common to have this single master job trigger multiple different tasks (functions in your cloud code) using different rules such as time or a manual job queue, that's why I suggest this pattern.