Getting actual trigger run start time for Tumbling Window trigger - azure-data-factory

I am interested in getting actual run start time for Tumbling Window trigger. I don't want Schedule Trigger. My scenario demands for Tumbling Window trigger specifically, but also some logic also requires knowing exactly at what time a triggered run started. As per the documentation I tried using #pipeline().TriggerTime , basically I passed it as a value to one of the pipeline parameters, but then it was not converted into a value -- then I realized the scope of this expression is within pipeline so I can't use it in a trigger. #trigger().outputs.windowStartTime can be used in a trigger but it doesn't serve my purpose -- I am not looking for a window start time , which is fixed no matter when a trigger is executed. I want actual run start time for Tumbling Window trigger. Is there any solution to this?

One solution I found is that we create Append Variable activity and call #pipeline().TriggerTime in the value section of the activity. Since this is part of the pipeline, it gets converted into a value there.
Another solution is to simply call utcnow() in the append variable activity.

Related

Trigger Date for reruns

My pipelines activities need the date of the run as a parameter. Now I get the current date in the pipeline from the utcnow() function. Ideally this would be something I could enter dynamically in the trigger so I could rerun a failed day and the parameter would be set right, now a rerun would lead to my pipeline being rerun but with the date of today not the failed run date.
I am used to airflow where such things are pretty easy to do, including scheduling reruns. Probably I think too much in terms of airflow but I can't wrap my head around a better solution.
In ADF,it is not supported directly to pass trigger date at which pipeline got failed to trigger.
You can get the trigger time using #pipeline().TriggerTime .
This system variable will give the time at which the trigger triggers the pipeline to run.
You can store this trigger value for every pipeline and use this as a parameter for the trigger which got failed and rerun the pipeline.
Reference: Microsoft document on System Variables on ADF
To resolve my problem I had to create a nested structure of pipelines, the top pipeline setting a variable for the date and then calling other pipelines passing that variable.
With this I still can't rerun the top pipeline but rerunning Execute Pipeline1/2/3 reruns them with the right variable set. It is still not perfect since the top pipeline run stays an error and it is difficult to keep track of what needs to be rerun, however it is a partial solution.

access variables in "Main" from another agent (anylogic)

I have a model as shown below with two situations, I am running it for two situations.
In the first run (for situation1), I write traceln function as "traceln(productDemand)" in the "event-generateDemand" placed in "Main". At the end of simulation, I get the values in the first column below.
2)In the second run (for situation2), for once I write traceln function as "traceln(main.productDemand)" in the "event" placed in "Producer" agent.At the end of the second simulation, I get the values in the second column below.
Normally, these two values are always same , it expected that at the every simulation time they have to be same, but they are not same as shown in the Fig.1. what's the problem? Why the "productDemand" variable is different when we try to access from another agent at the same time?
I hope I was able to explain my problem.
Fig.1- The obtained results as table format
Fig.2- The screenshot of Event placed in Main
Fig.3- The screenshot of Event placed in Producer agent
Fig.4- The obtained results for both traceln functions on the running
Fig.5- Simulation experiment screenshot.
fig.1
fig.2
fig.3
fig.4
fig.5
There is no bug in the model it is just a simple case of timing. Not all events occur at exactly the same "time" although they all occur at the same timestep. One will always execute before the other.
See the simple example below:
I have eventA that increases the variable value and then traces the value (similar to your event on main)
Then I have another event that traces the variable as well, similar to yours in the agent.
Yet when I run the model at the same time the variable appears to be different from the different locations of tracing
If you click on the Events tab in the console you will see that event B is scheduled to run before event A
Even though both will run at the same timestep in the model they don't run "at the same time"
If you want to be in total control of what happens on a specific time step, it is advised to have one event that runs at the time interval you want, e.g daily, and have it sit on main and then call all the functions in the order you want them executed.
If you don't do this then AnyLogic will schedule the events as they get created, which most of the time is the order in which you placed them on the canvas.

Early firing in Flink - how to emit early window results to a different DataStream with a trigger

I'm working with code that uses a tumbling window of one day, and would like to send early results to a different DataStream on an hourly basis.
I understand that triggers are a way to go here, but don't really see how it would work.
The current code is as follows:
myStream
.keyBy(..)
.window(TumblingEventTimeWindows.of(Time.days(1)))
.aggregate(new MyAggregateFunction(), new MyProcessWindowFunction())
In my understanding, I should register a trigger, and then on its onEventTime method get a hold of a TriggerContext and I can send data to the labeled output from there. But how do I get the current state of MyAggregateFunction from there? Or would I need to my own computation here inside of onEventTime()?
Also, the documentation states that "By specifying a trigger using trigger() you are overwriting the default trigger of a WindowAssigner.". Would my one day window then still fire correctly, or do I need to trigger it somehow differently?
Another way of doing this is creating two different operators - one that windows by 1 hour, and another that windows by 1 day. Would triggers be a preferred approach to that?
Rather than using a custom Trigger, it would be simpler to have two layers of windowing, where the hourly results are further aggregated into daily results. Something like this:
hourlyResults = myStream
.keyBy(...)
.window(TumblingEventTimeWindows.of(Time.hours(1)))
.aggregate(new MyAggregateFunction(), new MyProcessWindowFunction())
dailyResults = hourlyResults
.keyBy(...)
.window(TumblingEventTimeWindows.of(Time.days(1)))
.aggregate(new MyAggregateFunction(), new MyProcessWindowFunction())
hourlyResults.addSink(...)
dailyResults.addSink(...)
Note that the result of a window is not a KeyedStream, so you will need to use keyBy again, unless you can arrange to leverage reinterpretAsKeyedStream (docs).
Normally when I get to more complex behavior like this, I use a KeyedProcessFunction. You can aggregate (and save in state) hourly and daily results, set timers as needed, and use a side output for the hourly results versus the regular output for the daily results.
There are quite a few questions here. I will try to ask all of them. First of all, if You specify Your own trigger using trigger() this means You are going to effectively override the default trigger and thus the window may not work the way it would by default. So, if You for example if You create the 1 day event time tumbling window, but override a trigger so that it fires for every 20th element, it will never fire based on event time.
Now, after Your custom trigger fires, the output from MyAggregateFunction will be passed to MyProcessWindowFunction, so It will work the same as for the default trigger, you don't need to access the MyAggregateFunction from inside the trigger.
Finally, while it may be technically possible to implement trigger to trigger partial results every hour, my personal opinion is that You should probably go with the two separate windows. While this solution may create a slightly larger overhead and may result in a larger state, it should be much clearer, easier to implement, and finally much more error resistant.

How to pause data factory tumbling window trigger if error occurs

I have a tumbling window trigger for a pipeline. If a window fails I do not want any additional windows ran until the failed window is addressed. What is the best way to handle this?
There is no native way to achieve this in tumbling window trigger, but ADF provides various control and transform activities for you to come up with a combination method:
Involve a flag stored in a table/file to indicate the pipeline running status in a window, after each window execution, update this flag, then check this flag before execute another window run. You may need a Lookup activity to fetch the value of this flag, an IF activity to check the flag then a Custom/Stored Procedure activity to update the value. HTH.

Cloud Dataflow: Once trigger not working

I have a Dataflow pipeline reading from unbounded source. My window size is 10 hours, I am trying to test my trigger using a TestStream. My trigger will emit early result if element count reaches at least 2 for the same key within a Window. I have following trigger to achieve this:
input.apply(Window.into(FixedWindows.of(Duration.standardHours(12))) .triggering(AfterWatermark.pastEndOfWindow()
.withEarlyFirings(AfterPane.elementCountAtLeast(2)))
.apply(Count.perElement())
We also tried:
Repeatedly.forever(AfterPane.elementCountAtLeast(2)).orFinally(AfterWatermark.pastEndOfWindow())
I expect early firing when asserting the result, however I don't get all the result in
PAssert.that(pipeline).inWindow(..)..
What am I doing wrong? Also running same test repeatedly yields different result meaning different values are returned from the trigger.
Triggering is non-deterministic. It will give you an early firing some time after the trigger condition is satisfied. It will then give you another early firing some time after the trigger condition is satisfied again.
The actual choice to emit after the trigger is determined by the runner. If you are using a batch runner, it may wait until all the data is available. How much input are you expecting for each key/window? Which runner are you using?