Is it necessary to use windows in Flink? - scala

I'm attempting to transform a stream of data, without using any window provided by Flink. My code looks something like this :
val stream1 = executionEnvironment.getStream
val stream2 = stream1.flatMap(someFunction)
stream2.addSink(s3_Sink)
executionEnvironment.execute()
However, upon submitting and running my job, I'm not getting any output on S3. The web UI shows 0 bytes received, 0 records received, 0 bytes sent, 0 records sent.
Another running Flink job is already using the same data source, so the data source is fine. There are no errors anywhere but still no output. Could this issue be, because I'm not using any window or key operation? I attempted to get the output after assigning ascending timestamps but didn't get any output. Any idea of what could not be working?

I guess that has nothing to do with a missing window. Rule of thumb: Use windows when you want any kind of aggregation (folds, reduces, etc.).
Regarding you initial problem: From what you have shown so far I can only imagine that the flatMap operator doesn't produces any output (in contrast to a map which always have to emit a value flatMap might filter out everything). Maybe you can add more code so that we can have a closer look.

Related

Is it possible use composite triggers in conjunction with micro-batching with dataflow?

We have an unbounded PCollection PCollection<TableRow> source that we are inserting to BigQuery.
An easy "by the book" way to fire windows every 500 thousand messages or five minutes would be:
source.apply("GlobalWindow", Window.<TableRow>into(new GlobalWindows())
.triggering(Repeatedly.forever(AfterFirst.of(
AfterPane.elementCountAtLeast(500000),
AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardMinutes(5))))
).withAllowedLateness(Duration.standardMinutes(1440)).discardingFiredPanes())
You would think that applying the following to the fired window/pane would allow you to write contents of the fired pane to BigQuery:
.apply("BatchWriteToBigQuery", BigQueryIO.writeTableRows()
.to(destination)
.withMethod(BigQueryIO.Write.Method.FILE_LOADS)
.withNumFileShards(NUM_FILE_SHARDS)
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND));
But this would yield a compile error An exception occured while executing the Java class. When writing an unbounded PCollection via FILE_LOADS, triggering frequency must be specified
Relatively easy fix would be to add .withTriggeringFrequency(Duration.standardMinutes(5)) to the above, which would essentially render the idea of inserting either every five minutes or every N messages completely void, and you might as well get rid of the windowing in that case anyway.
Is there a way to actually accomplish this?
FILE_LOADS requires triggering frequency.
If you want more realtime results then you can use STREAMING_INSERTS
Reference https://beam.apache.org/releases/javadoc/2.19.0/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.Write.Method.html#FILE_LOADS

Flink AssignerWithPeriodicWatermarks getCurrentWatermark is never called

I'm trying to do a simple Sliding Window aggregation based on a Kafka source.
The events on Kafka all contain a timestamp element and are in ascending order. I've tried using different Periodic Watermarkers (ascending, bounded and a custom one to more easily debug what was going on internally). I can tell that the extractTimestamp method is always being called, but the getCurrentWatermark method is never called.
I've set the autoWatermarkInterval to even 1ms and even then, the watermark for each subtask is never updated. I've verified this by using the Flink UI and looking at the metric available.
I've read quite a few similar questions around this topic on SO and most were about the window never emitting due to several reasons. I haven't been able to identify a reason why it would never advance the watermark.
I've also confirmed that no data is being side outputted as late data.
The stream in it's most basic form:
val rfq = kafkaDataStream
.assignAscendingTimestamps(_.timestamp.toEpochMilli)
.keyBy("id")
val lateTag = new OutputTag[RFQ]("late") {}
val predictions: DataStream[RFQPrediction] = rfq
.window(SlidingEventTimeWindows.of(5,3))
.sideOutputLateData(lateTag)
.aggregate(new PricePredictionsAggregate)
.name("windowed-predictions")
I've verified that it works fine with an AssignerWithPunctuatedWatermarks.
What could be the cause of the getCurrentWatermark method to never get called, even though the interval is set to 1ms?
The test data that I'm feeding through uses a limited list of ids for which events are continually being generated with an ever-increasing timestamp.
Thanks a lot!

Can I test kafka-streams suppress logic?

My application use kafka streams suppress logic.
I want to test kafka streams topology using suppress.
Runnning uinit test, My topology not emit result.
Kafka streams logic
...
.suppress(Suppressed.untilTimeLimit(Duration.ofSeconds(5), Suppressed.BufferConfig.maxBytes(1_000_000_000L).emitEarlyWhenFull()))
...
My test case code.
After create input data, running test case cant't read suppress logic output record.
just return null
testDriver.pipeInput(recordFactory.create("input", key, dummy, 0L));
System.out.println(testDriver.readOutput("streams-result", Serdes.String().deserializer(), serde.deserializer()));
Can i test my suppress logic?
The simple answer is yes.
Some good references are Confluent Example Tests this example in particular tests the suppression feature. And many other examples always a good place to check first. Here is another example of mine written in Kotlin.
An explanation of the feature and testing it can be found in post 3 on this blog post
Some key points:
The window will only emit the final result as expected from the documents.
To flush the final results you will need to send an extra dummy event as seen in the examples such as confluents here.
You will need to manipulate the event time to test it as suppression works off the event time this can be provided by the test input topic API or use a custom TimestampExtractor.
For testing I recommend setting the following to remove cache and reduce commit interval.
props[StreamsConfig.CACHE_MAX_BYTES_BUFFERING_CONFIG] = 0
props[StreamsConfig.COMMIT_INTERVAL_MS_CONFIG] = 5
Hope this helps.

Duplicate jobs are being generated in DAG for the same action in Spark

I have a spark-streaming job in which I receive data from a message queue and process a bunch of records. In the process, I have a take() method on a dataset. Although the take action is happening in an expected manner, In the DAG visualization, I see multiple job ids created and all of them have the same take action. This is happening only when the data is in the order of a hundreds of thousand records. I didn't observe redundant jobs while running with tens of records in my local machine. Can anyone help me understand the reasoning behind this behavior?
The job ids - (91 to 95) are basically running the same action. Following is the code snippet corresponding to the mentioned action above.
val corruptedMessageArray: Array[ String ] = corruptedMessageDs.take(1);
if ( !corruptedMessageArray.isEmpty ) {
val firstCorruptedMessage: String = corruptedMessageArray( 0 )
}
Your question seems to be whether duplicate jobs are created by Spark.
If you look at the screenshot you will see that the jobs have a different number of tasks, hence it is not a simple matter of duplication.
I am not sure exactly what is happening, but it seems that for large datasets take() needs several quick subsequent jobs. Perhaps because it devises work, or perhaps because it needs to try how much work needs to be done.

GroupByKey not updating on very long PTransform with a Window

I'm working on a streaming Java Apache Beam (2.13.0) pipeline that is running in Google Cloud Dataflow. I have a long running PTransform (for a single input, it does a lot of work, outputs multiple outputs and can take >10 minutes).
I want to return early results from the processing to the user. I have a Window and Combine step afterwards. Early triggers do not seem to work with a long running PTransform. The Combine step outputs elements after the PTransform finishes processing the element (rather than returning early results).
I've tried many different early Window functions. E.g. I've tried doing forever element count triggers and it does not work. Ditto for forever processing time-based triggers (e.g. every 10 processing seconds). I've tried GlobalWindows, Fixed Windows, Session Windows, etc.
Here is rough pseudo code for what I'm doing.
p.apply(PubsubIO.readStrings().fromSubscription(options.getInput()));
.apply(FlatMapElements.via(new LongRunningCalculation()))
.apply(<I've tried a variety of window functions>)
.apply(Combine.perKey(new SumMetrics()))
.apply(DatastoreIO.v1().write().withProjectId(options.getProject()));
For the Window functions, I've tried many different Window functions to see if I can get anything to return early. I can't get it to return early.
Here is a basic one.
Window.into(new GlobalWindows())
.triggering(Repeatedly.forever(AfterPane.elementCountAtLeast(10)))
.withAllowedLateness(Duration.ZERO)
.discardingFiredPanes());
Even for this one, Even if the Window has added >>10 elements, the GroupBy in the Combine step does not output any rows.
Expected: If I have a long running PTransform, I'd still expect early triggers to still fire.
Actual: I can't seem to get early triggers to work.
Any advice?