Flink AssignerWithPeriodicWatermarks getCurrentWatermark is never called - scala

I'm trying to do a simple Sliding Window aggregation based on a Kafka source.
The events on Kafka all contain a timestamp element and are in ascending order. I've tried using different Periodic Watermarkers (ascending, bounded and a custom one to more easily debug what was going on internally). I can tell that the extractTimestamp method is always being called, but the getCurrentWatermark method is never called.
I've set the autoWatermarkInterval to even 1ms and even then, the watermark for each subtask is never updated. I've verified this by using the Flink UI and looking at the metric available.
I've read quite a few similar questions around this topic on SO and most were about the window never emitting due to several reasons. I haven't been able to identify a reason why it would never advance the watermark.
I've also confirmed that no data is being side outputted as late data.
The stream in it's most basic form:
val rfq = kafkaDataStream
.assignAscendingTimestamps(_.timestamp.toEpochMilli)
.keyBy("id")
val lateTag = new OutputTag[RFQ]("late") {}
val predictions: DataStream[RFQPrediction] = rfq
.window(SlidingEventTimeWindows.of(5,3))
.sideOutputLateData(lateTag)
.aggregate(new PricePredictionsAggregate)
.name("windowed-predictions")
I've verified that it works fine with an AssignerWithPunctuatedWatermarks.
What could be the cause of the getCurrentWatermark method to never get called, even though the interval is set to 1ms?
The test data that I'm feeding through uses a limited list of ids for which events are continually being generated with an ever-increasing timestamp.
Thanks a lot!

Related

Filtering certain types of Requests logged by quarkus.http.access-log

I want to test a new REST-Client I am building, where I'd like to see the exact request which is being built, so I set the quarkus.http.access-log.enabled=true property. When starting the Quarkus however, I am being bombarded with the logs of many scheduled requests which are happening simultaneously. The worst of those are several Elasticsearch Scroll-Requests, which return a lot of data which is directly fed into the log.
My idea is, that everything returning a response that contains _index should be filtered, as I know well by now, that my ES-Client is working properly, however it pumps out so much data, that the request I am intending to log is overwritten almost instantly.
So my question: Does somebody know a working (and convenient) method to effectively filter unwanted HTTP-Logs?
I tried setting
quarkus.http.access-log.exclude-pattern=(_index+)
in an attempt to filter out unwanted requests, but I'm not sure where to continue from there.

Early firing in Flink - how to emit early window results to a different DataStream with a trigger

I'm working with code that uses a tumbling window of one day, and would like to send early results to a different DataStream on an hourly basis.
I understand that triggers are a way to go here, but don't really see how it would work.
The current code is as follows:
myStream
.keyBy(..)
.window(TumblingEventTimeWindows.of(Time.days(1)))
.aggregate(new MyAggregateFunction(), new MyProcessWindowFunction())
In my understanding, I should register a trigger, and then on its onEventTime method get a hold of a TriggerContext and I can send data to the labeled output from there. But how do I get the current state of MyAggregateFunction from there? Or would I need to my own computation here inside of onEventTime()?
Also, the documentation states that "By specifying a trigger using trigger() you are overwriting the default trigger of a WindowAssigner.". Would my one day window then still fire correctly, or do I need to trigger it somehow differently?
Another way of doing this is creating two different operators - one that windows by 1 hour, and another that windows by 1 day. Would triggers be a preferred approach to that?
Rather than using a custom Trigger, it would be simpler to have two layers of windowing, where the hourly results are further aggregated into daily results. Something like this:
hourlyResults = myStream
.keyBy(...)
.window(TumblingEventTimeWindows.of(Time.hours(1)))
.aggregate(new MyAggregateFunction(), new MyProcessWindowFunction())
dailyResults = hourlyResults
.keyBy(...)
.window(TumblingEventTimeWindows.of(Time.days(1)))
.aggregate(new MyAggregateFunction(), new MyProcessWindowFunction())
hourlyResults.addSink(...)
dailyResults.addSink(...)
Note that the result of a window is not a KeyedStream, so you will need to use keyBy again, unless you can arrange to leverage reinterpretAsKeyedStream (docs).
Normally when I get to more complex behavior like this, I use a KeyedProcessFunction. You can aggregate (and save in state) hourly and daily results, set timers as needed, and use a side output for the hourly results versus the regular output for the daily results.
There are quite a few questions here. I will try to ask all of them. First of all, if You specify Your own trigger using trigger() this means You are going to effectively override the default trigger and thus the window may not work the way it would by default. So, if You for example if You create the 1 day event time tumbling window, but override a trigger so that it fires for every 20th element, it will never fire based on event time.
Now, after Your custom trigger fires, the output from MyAggregateFunction will be passed to MyProcessWindowFunction, so It will work the same as for the default trigger, you don't need to access the MyAggregateFunction from inside the trigger.
Finally, while it may be technically possible to implement trigger to trigger partial results every hour, my personal opinion is that You should probably go with the two separate windows. While this solution may create a slightly larger overhead and may result in a larger state, it should be much clearer, easier to implement, and finally much more error resistant.

Is it possible use composite triggers in conjunction with micro-batching with dataflow?

We have an unbounded PCollection PCollection<TableRow> source that we are inserting to BigQuery.
An easy "by the book" way to fire windows every 500 thousand messages or five minutes would be:
source.apply("GlobalWindow", Window.<TableRow>into(new GlobalWindows())
.triggering(Repeatedly.forever(AfterFirst.of(
AfterPane.elementCountAtLeast(500000),
AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardMinutes(5))))
).withAllowedLateness(Duration.standardMinutes(1440)).discardingFiredPanes())
You would think that applying the following to the fired window/pane would allow you to write contents of the fired pane to BigQuery:
.apply("BatchWriteToBigQuery", BigQueryIO.writeTableRows()
.to(destination)
.withMethod(BigQueryIO.Write.Method.FILE_LOADS)
.withNumFileShards(NUM_FILE_SHARDS)
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND));
But this would yield a compile error An exception occured while executing the Java class. When writing an unbounded PCollection via FILE_LOADS, triggering frequency must be specified
Relatively easy fix would be to add .withTriggeringFrequency(Duration.standardMinutes(5)) to the above, which would essentially render the idea of inserting either every five minutes or every N messages completely void, and you might as well get rid of the windowing in that case anyway.
Is there a way to actually accomplish this?
FILE_LOADS requires triggering frequency.
If you want more realtime results then you can use STREAMING_INSERTS
Reference https://beam.apache.org/releases/javadoc/2.19.0/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.Write.Method.html#FILE_LOADS

GroupByKey not updating on very long PTransform with a Window

I'm working on a streaming Java Apache Beam (2.13.0) pipeline that is running in Google Cloud Dataflow. I have a long running PTransform (for a single input, it does a lot of work, outputs multiple outputs and can take >10 minutes).
I want to return early results from the processing to the user. I have a Window and Combine step afterwards. Early triggers do not seem to work with a long running PTransform. The Combine step outputs elements after the PTransform finishes processing the element (rather than returning early results).
I've tried many different early Window functions. E.g. I've tried doing forever element count triggers and it does not work. Ditto for forever processing time-based triggers (e.g. every 10 processing seconds). I've tried GlobalWindows, Fixed Windows, Session Windows, etc.
Here is rough pseudo code for what I'm doing.
p.apply(PubsubIO.readStrings().fromSubscription(options.getInput()));
.apply(FlatMapElements.via(new LongRunningCalculation()))
.apply(<I've tried a variety of window functions>)
.apply(Combine.perKey(new SumMetrics()))
.apply(DatastoreIO.v1().write().withProjectId(options.getProject()));
For the Window functions, I've tried many different Window functions to see if I can get anything to return early. I can't get it to return early.
Here is a basic one.
Window.into(new GlobalWindows())
.triggering(Repeatedly.forever(AfterPane.elementCountAtLeast(10)))
.withAllowedLateness(Duration.ZERO)
.discardingFiredPanes());
Even for this one, Even if the Window has added >>10 elements, the GroupBy in the Combine step does not output any rows.
Expected: If I have a long running PTransform, I'd still expect early triggers to still fire.
Actual: I can't seem to get early triggers to work.
Any advice?

Is it necessary to use windows in Flink?

I'm attempting to transform a stream of data, without using any window provided by Flink. My code looks something like this :
val stream1 = executionEnvironment.getStream
val stream2 = stream1.flatMap(someFunction)
stream2.addSink(s3_Sink)
executionEnvironment.execute()
However, upon submitting and running my job, I'm not getting any output on S3. The web UI shows 0 bytes received, 0 records received, 0 bytes sent, 0 records sent.
Another running Flink job is already using the same data source, so the data source is fine. There are no errors anywhere but still no output. Could this issue be, because I'm not using any window or key operation? I attempted to get the output after assigning ascending timestamps but didn't get any output. Any idea of what could not be working?
I guess that has nothing to do with a missing window. Rule of thumb: Use windows when you want any kind of aggregation (folds, reduces, etc.).
Regarding you initial problem: From what you have shown so far I can only imagine that the flatMap operator doesn't produces any output (in contrast to a map which always have to emit a value flatMap might filter out everything). Maybe you can add more code so that we can have a closer look.