Cloud Dataflow: Once trigger not working

Cloud Dataflow: Once trigger not working - triggers

I have a Dataflow pipeline reading from unbounded source. My window size is 10 hours, I am trying to test my trigger using a TestStream. My trigger will emit early result if element count reaches at least 2 for the same key within a Window. I have following trigger to achieve this:
input.apply(Window.into(FixedWindows.of(Duration.standardHours(12))) .triggering(AfterWatermark.pastEndOfWindow()
.withEarlyFirings(AfterPane.elementCountAtLeast(2)))
.apply(Count.perElement())
We also tried:
Repeatedly.forever(AfterPane.elementCountAtLeast(2)).orFinally(AfterWatermark.pastEndOfWindow())
I expect early firing when asserting the result, however I don't get all the result in
PAssert.that(pipeline).inWindow(..)..
What am I doing wrong? Also running same test repeatedly yields different result meaning different values are returned from the trigger.

Triggering is non-deterministic. It will give you an early firing some time after the trigger condition is satisfied. It will then give you another early firing some time after the trigger condition is satisfied again.
The actual choice to emit after the trigger is determined by the runner. If you are using a batch runner, it may wait until all the data is available. How much input are you expecting for each key/window? Which runner are you using?

Related

ADF fail error. Property 'output' cannot be selected

I have the following Fail task in my Until:
On my Until I have an expression that fails the pipeline if an error is encountered in my Until:
#or(greater(variables('RunDate') ,utcnow()), activity('Fail').output)
I've been getting this error for the last few days. I have 4 other pipelines that work the same way but only this one is giving me this error.
I'm not sure what this error means. It seems it can't select the Output text from the Fail activity but not sure why.

You want to stop the Until activity if any of your child activities fails.
I reproduced the similar scenario and gave the Fail activity output in until condition and got same error.
This is my scenario inside until which causing the error.
You will get this error when all of your inner activities inside Until succeeds. As every activity succeeds, the pipeline flow won't recognize the existence of Fail activity in that particular iteration. And when pipeline flow meets condition in the until, it will fail because it don't know about Fail activity.
Even though, in first iteration if any of your activity fails and Fail activity executed, you will get the error in the until condition because the activity('Fail').output gives the JSON object which will not fit in the or logic as it expects only a boolean value.
So, you can't give the Fail activity output like that in the Until condition.
The workaround for this can be achieved with two set variable activities inside Until.
The Until activity stops the execution of the inner activities when the condition of the Until becomes true. It will execute as long as the condition is false.
First create a pipeline variable of boolean type.
Set the variable to false if you want to continue the execution after success.
In your case, join the success of your last activity to this Set variable activity.
Set the variable to true, if you want to stop the execution after failure and along with your date condition.
Now, give the Until condition as per your requirement using or(your date condition, boolean variable) or you can use and as well.
But make sure that, it gives true if you want to stop and false if you want to continue.
NOTE: As per my repro, the Until is giving the error as Activity failed because an inner activity failed if any of the child activity fails only in the final iteration and not in every iteration.

Early firing in Flink - how to emit early window results to a different DataStream with a trigger

I'm working with code that uses a tumbling window of one day, and would like to send early results to a different DataStream on an hourly basis.
I understand that triggers are a way to go here, but don't really see how it would work.
The current code is as follows:
myStream
.keyBy(..)
.window(TumblingEventTimeWindows.of(Time.days(1)))
.aggregate(new MyAggregateFunction(), new MyProcessWindowFunction())
In my understanding, I should register a trigger, and then on its onEventTime method get a hold of a TriggerContext and I can send data to the labeled output from there. But how do I get the current state of MyAggregateFunction from there? Or would I need to my own computation here inside of onEventTime()?
Also, the documentation states that "By specifying a trigger using trigger() you are overwriting the default trigger of a WindowAssigner.". Would my one day window then still fire correctly, or do I need to trigger it somehow differently?
Another way of doing this is creating two different operators - one that windows by 1 hour, and another that windows by 1 day. Would triggers be a preferred approach to that?

Rather than using a custom Trigger, it would be simpler to have two layers of windowing, where the hourly results are further aggregated into daily results. Something like this:
hourlyResults = myStream
.keyBy(...)
.window(TumblingEventTimeWindows.of(Time.hours(1)))
.aggregate(new MyAggregateFunction(), new MyProcessWindowFunction())
dailyResults = hourlyResults
.keyBy(...)
.window(TumblingEventTimeWindows.of(Time.days(1)))
.aggregate(new MyAggregateFunction(), new MyProcessWindowFunction())
hourlyResults.addSink(...)
dailyResults.addSink(...)
Note that the result of a window is not a KeyedStream, so you will need to use keyBy again, unless you can arrange to leverage reinterpretAsKeyedStream (docs).

Normally when I get to more complex behavior like this, I use a KeyedProcessFunction. You can aggregate (and save in state) hourly and daily results, set timers as needed, and use a side output for the hourly results versus the regular output for the daily results.

There are quite a few questions here. I will try to ask all of them. First of all, if You specify Your own trigger using trigger() this means You are going to effectively override the default trigger and thus the window may not work the way it would by default. So, if You for example if You create the 1 day event time tumbling window, but override a trigger so that it fires for every 20th element, it will never fire based on event time.
Now, after Your custom trigger fires, the output from MyAggregateFunction will be passed to MyProcessWindowFunction, so It will work the same as for the default trigger, you don't need to access the MyAggregateFunction from inside the trigger.
Finally, while it may be technically possible to implement trigger to trigger partial results every hour, my personal opinion is that You should probably go with the two separate windows. While this solution may create a slightly larger overhead and may result in a larger state, it should be much clearer, easier to implement, and finally much more error resistant.

Is it possible use composite triggers in conjunction with micro-batching with dataflow?

We have an unbounded PCollection PCollection<TableRow> source that we are inserting to BigQuery.
An easy "by the book" way to fire windows every 500 thousand messages or five minutes would be:
source.apply("GlobalWindow", Window.<TableRow>into(new GlobalWindows())
.triggering(Repeatedly.forever(AfterFirst.of(
AfterPane.elementCountAtLeast(500000),
AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardMinutes(5))))
).withAllowedLateness(Duration.standardMinutes(1440)).discardingFiredPanes())
You would think that applying the following to the fired window/pane would allow you to write contents of the fired pane to BigQuery:
.apply("BatchWriteToBigQuery", BigQueryIO.writeTableRows()
.to(destination)
.withMethod(BigQueryIO.Write.Method.FILE_LOADS)
.withNumFileShards(NUM_FILE_SHARDS)
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND));
But this would yield a compile error An exception occured while executing the Java class. When writing an unbounded PCollection via FILE_LOADS, triggering frequency must be specified
Relatively easy fix would be to add .withTriggeringFrequency(Duration.standardMinutes(5)) to the above, which would essentially render the idea of inserting either every five minutes or every N messages completely void, and you might as well get rid of the windowing in that case anyway.
Is there a way to actually accomplish this?

FILE_LOADS requires triggering frequency.
If you want more realtime results then you can use STREAMING_INSERTS
Reference https://beam.apache.org/releases/javadoc/2.19.0/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.Write.Method.html#FILE_LOADS

Getting actual trigger run start time for Tumbling Window trigger

I am interested in getting actual run start time for Tumbling Window trigger. I don't want Schedule Trigger. My scenario demands for Tumbling Window trigger specifically, but also some logic also requires knowing exactly at what time a triggered run started. As per the documentation I tried using #pipeline().TriggerTime , basically I passed it as a value to one of the pipeline parameters, but then it was not converted into a value -- then I realized the scope of this expression is within pipeline so I can't use it in a trigger. #trigger().outputs.windowStartTime can be used in a trigger but it doesn't serve my purpose -- I am not looking for a window start time , which is fixed no matter when a trigger is executed. I want actual run start time for Tumbling Window trigger. Is there any solution to this?

One solution I found is that we create Append Variable activity and call #pipeline().TriggerTime in the value section of the activity. Since this is part of the pipeline, it gets converted into a value there.
Another solution is to simply call utcnow() in the append variable activity.

GroupByKey not updating on very long PTransform with a Window

I'm working on a streaming Java Apache Beam (2.13.0) pipeline that is running in Google Cloud Dataflow. I have a long running PTransform (for a single input, it does a lot of work, outputs multiple outputs and can take >10 minutes).
I want to return early results from the processing to the user. I have a Window and Combine step afterwards. Early triggers do not seem to work with a long running PTransform. The Combine step outputs elements after the PTransform finishes processing the element (rather than returning early results).
I've tried many different early Window functions. E.g. I've tried doing forever element count triggers and it does not work. Ditto for forever processing time-based triggers (e.g. every 10 processing seconds). I've tried GlobalWindows, Fixed Windows, Session Windows, etc.
Here is rough pseudo code for what I'm doing.
p.apply(PubsubIO.readStrings().fromSubscription(options.getInput()));
.apply(FlatMapElements.via(new LongRunningCalculation()))
.apply(<I've tried a variety of window functions>)
.apply(Combine.perKey(new SumMetrics()))
.apply(DatastoreIO.v1().write().withProjectId(options.getProject()));
For the Window functions, I've tried many different Window functions to see if I can get anything to return early. I can't get it to return early.
Here is a basic one.
Window.into(new GlobalWindows())
.triggering(Repeatedly.forever(AfterPane.elementCountAtLeast(10)))
.withAllowedLateness(Duration.ZERO)
.discardingFiredPanes());
Even for this one, Even if the Window has added >>10 elements, the GroupBy in the Combine step does not output any rows.
Expected: If I have a long running PTransform, I'd still expect early triggers to still fire.
Actual: I can't seem to get early triggers to work.
Any advice?

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse