GroupByKey not updating on very long PTransform with a Window

GroupByKey not updating on very long PTransform with a Window - apache-beam

I'm working on a streaming Java Apache Beam (2.13.0) pipeline that is running in Google Cloud Dataflow. I have a long running PTransform (for a single input, it does a lot of work, outputs multiple outputs and can take >10 minutes).
I want to return early results from the processing to the user. I have a Window and Combine step afterwards. Early triggers do not seem to work with a long running PTransform. The Combine step outputs elements after the PTransform finishes processing the element (rather than returning early results).
I've tried many different early Window functions. E.g. I've tried doing forever element count triggers and it does not work. Ditto for forever processing time-based triggers (e.g. every 10 processing seconds). I've tried GlobalWindows, Fixed Windows, Session Windows, etc.
Here is rough pseudo code for what I'm doing.
p.apply(PubsubIO.readStrings().fromSubscription(options.getInput()));
.apply(FlatMapElements.via(new LongRunningCalculation()))
.apply(<I've tried a variety of window functions>)
.apply(Combine.perKey(new SumMetrics()))
.apply(DatastoreIO.v1().write().withProjectId(options.getProject()));
For the Window functions, I've tried many different Window functions to see if I can get anything to return early. I can't get it to return early.
Here is a basic one.
Window.into(new GlobalWindows())
.triggering(Repeatedly.forever(AfterPane.elementCountAtLeast(10)))
.withAllowedLateness(Duration.ZERO)
.discardingFiredPanes());
Even for this one, Even if the Window has added >>10 elements, the GroupBy in the Combine step does not output any rows.
Expected: If I have a long running PTransform, I'd still expect early triggers to still fire.
Actual: I can't seem to get early triggers to work.
Any advice?

Related

Parameters Variation not running model in AnyLogic

When I create a ParametersVariation simulation, the main model does not run. All I see is the default UI with iterations completed and replication. My end goal (as with most people) is to have a model go through a certain number of replications, but nothing is even running. There is limited documentation available on this. Please advise.

This is how Parameters Variation is intended to work. If you're running 1000 runs and multiple replications with parallel runs, how can you see what's happening in Main in each?
Typically, the best way to benefit from such an experiment is to track the results of each run using elements from the Analysis palette or even better to export results to Excel or similar.
To be able to collect data, you need to write your code in Java actions fields with root. to access elements in main (or top-level agent).
Check the example below, where after each run a variable from main is added to a dataset in the Parameters Variation experiment. At the end of 100 runs for example, the dataset will have 100 values of the main variable, with 1 value for each run.

Early firing in Flink - how to emit early window results to a different DataStream with a trigger

I'm working with code that uses a tumbling window of one day, and would like to send early results to a different DataStream on an hourly basis.
I understand that triggers are a way to go here, but don't really see how it would work.
The current code is as follows:
myStream
.keyBy(..)
.window(TumblingEventTimeWindows.of(Time.days(1)))
.aggregate(new MyAggregateFunction(), new MyProcessWindowFunction())
In my understanding, I should register a trigger, and then on its onEventTime method get a hold of a TriggerContext and I can send data to the labeled output from there. But how do I get the current state of MyAggregateFunction from there? Or would I need to my own computation here inside of onEventTime()?
Also, the documentation states that "By specifying a trigger using trigger() you are overwriting the default trigger of a WindowAssigner.". Would my one day window then still fire correctly, or do I need to trigger it somehow differently?
Another way of doing this is creating two different operators - one that windows by 1 hour, and another that windows by 1 day. Would triggers be a preferred approach to that?

Rather than using a custom Trigger, it would be simpler to have two layers of windowing, where the hourly results are further aggregated into daily results. Something like this:
hourlyResults = myStream
.keyBy(...)
.window(TumblingEventTimeWindows.of(Time.hours(1)))
.aggregate(new MyAggregateFunction(), new MyProcessWindowFunction())
dailyResults = hourlyResults
.keyBy(...)
.window(TumblingEventTimeWindows.of(Time.days(1)))
.aggregate(new MyAggregateFunction(), new MyProcessWindowFunction())
hourlyResults.addSink(...)
dailyResults.addSink(...)
Note that the result of a window is not a KeyedStream, so you will need to use keyBy again, unless you can arrange to leverage reinterpretAsKeyedStream (docs).

Normally when I get to more complex behavior like this, I use a KeyedProcessFunction. You can aggregate (and save in state) hourly and daily results, set timers as needed, and use a side output for the hourly results versus the regular output for the daily results.

There are quite a few questions here. I will try to ask all of them. First of all, if You specify Your own trigger using trigger() this means You are going to effectively override the default trigger and thus the window may not work the way it would by default. So, if You for example if You create the 1 day event time tumbling window, but override a trigger so that it fires for every 20th element, it will never fire based on event time.
Now, after Your custom trigger fires, the output from MyAggregateFunction will be passed to MyProcessWindowFunction, so It will work the same as for the default trigger, you don't need to access the MyAggregateFunction from inside the trigger.
Finally, while it may be technically possible to implement trigger to trigger partial results every hour, my personal opinion is that You should probably go with the two separate windows. While this solution may create a slightly larger overhead and may result in a larger state, it should be much clearer, easier to implement, and finally much more error resistant.

Is it possible use composite triggers in conjunction with micro-batching with dataflow?

We have an unbounded PCollection PCollection<TableRow> source that we are inserting to BigQuery.
An easy "by the book" way to fire windows every 500 thousand messages or five minutes would be:
source.apply("GlobalWindow", Window.<TableRow>into(new GlobalWindows())
.triggering(Repeatedly.forever(AfterFirst.of(
AfterPane.elementCountAtLeast(500000),
AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardMinutes(5))))
).withAllowedLateness(Duration.standardMinutes(1440)).discardingFiredPanes())
You would think that applying the following to the fired window/pane would allow you to write contents of the fired pane to BigQuery:
.apply("BatchWriteToBigQuery", BigQueryIO.writeTableRows()
.to(destination)
.withMethod(BigQueryIO.Write.Method.FILE_LOADS)
.withNumFileShards(NUM_FILE_SHARDS)
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND));
But this would yield a compile error An exception occured while executing the Java class. When writing an unbounded PCollection via FILE_LOADS, triggering frequency must be specified
Relatively easy fix would be to add .withTriggeringFrequency(Duration.standardMinutes(5)) to the above, which would essentially render the idea of inserting either every five minutes or every N messages completely void, and you might as well get rid of the windowing in that case anyway.
Is there a way to actually accomplish this?

FILE_LOADS requires triggering frequency.
If you want more realtime results then you can use STREAMING_INSERTS
Reference https://beam.apache.org/releases/javadoc/2.19.0/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.Write.Method.html#FILE_LOADS

Cloud Dataflow: Once trigger not working

I have a Dataflow pipeline reading from unbounded source. My window size is 10 hours, I am trying to test my trigger using a TestStream. My trigger will emit early result if element count reaches at least 2 for the same key within a Window. I have following trigger to achieve this:
input.apply(Window.into(FixedWindows.of(Duration.standardHours(12))) .triggering(AfterWatermark.pastEndOfWindow()
.withEarlyFirings(AfterPane.elementCountAtLeast(2)))
.apply(Count.perElement())
We also tried:
Repeatedly.forever(AfterPane.elementCountAtLeast(2)).orFinally(AfterWatermark.pastEndOfWindow())
I expect early firing when asserting the result, however I don't get all the result in
PAssert.that(pipeline).inWindow(..)..
What am I doing wrong? Also running same test repeatedly yields different result meaning different values are returned from the trigger.

Triggering is non-deterministic. It will give you an early firing some time after the trigger condition is satisfied. It will then give you another early firing some time after the trigger condition is satisfied again.
The actual choice to emit after the trigger is determined by the runner. If you are using a batch runner, it may wait until all the data is available. How much input are you expecting for each key/window? Which runner are you using?

Matlab code taking a long time to run

I have a Matlab code (from a journal paper) and I'm trying to re-simulate their data.
I executed the code one week ago. I think the code is taking so long time to run. Matlab is still busy and taking 50% of my cpu.
I was wondering if the process has ended with some errors somewhere in the code. My question is:
When I see no errors, can I be sure that everything is fine with this running process? And I can wait until it is finished?
Is there any way to check which part of code is being run now ( without stopping the execution)?
Or I should stop the program and try something else?
Actually I don't want to loose this 1 week and if you think everything is fine, I would wait until the code stops.
(The authors of the paper didn't reply to my question and I don't know how long should it naturally take... They just mentioned it may take a long time to simulate the data).

Unfortunately, there is little we can do for you.
When I see no errors, can I be sure that everything is fine with this running process?
That's pretty much the definition of an error. If no error is raised, then it means that the program is still running.
Is there any way to check which part of code is being run now (without stopping the execution)?
Unfortunately no. For long-lasting execution times like that, a good developing practice is to display some information from time to time to inform the end user of the execution status.
However, if the programs produces files all along the way (like for instance at every step in an iterative simulation) you can check on your computer that the files are well-produced, and the production rate will more or less inform you on the total execution time.
For all your other questions, well, it's up to you to decide what to do (stop it or let it run). Be aware that the execution time can differ significantly from one machine to another, so the time it took on the author's machine may not be really informative to you.
In the future, I would advise you to react faster than within a week. When you launch a code that has a long execution time and see that there is no display within the first hour, you should stop it, modify it such that it regulatly displays information, and re-run it. It's better to loose one hour than one week.
Best,

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

GroupByKey not updating on very long PTransform with a Window - apache-beam

Related

Parameters Variation not running model in AnyLogic

Early firing in Flink - how to emit early window results to a different DataStream with a trigger

Is it possible use composite triggers in conjunction with micro-batching with dataflow?

Cloud Dataflow: Once trigger not working

Matlab code taking a long time to run

Categories

Resources