Duplicate jobs are being generated in DAG for the same action in Spark - scala

I have a spark-streaming job in which I receive data from a message queue and process a bunch of records. In the process, I have a take() method on a dataset. Although the take action is happening in an expected manner, In the DAG visualization, I see multiple job ids created and all of them have the same take action. This is happening only when the data is in the order of a hundreds of thousand records. I didn't observe redundant jobs while running with tens of records in my local machine. Can anyone help me understand the reasoning behind this behavior?
The job ids - (91 to 95) are basically running the same action. Following is the code snippet corresponding to the mentioned action above.
val corruptedMessageArray: Array[ String ] = corruptedMessageDs.take(1);
if ( !corruptedMessageArray.isEmpty ) {
val firstCorruptedMessage: String = corruptedMessageArray( 0 )
}

Your question seems to be whether duplicate jobs are created by Spark.
If you look at the screenshot you will see that the jobs have a different number of tasks, hence it is not a simple matter of duplication.
I am not sure exactly what is happening, but it seems that for large datasets take() needs several quick subsequent jobs. Perhaps because it devises work, or perhaps because it needs to try how much work needs to be done.

Related

Flink AssignerWithPeriodicWatermarks getCurrentWatermark is never called

I'm trying to do a simple Sliding Window aggregation based on a Kafka source.
The events on Kafka all contain a timestamp element and are in ascending order. I've tried using different Periodic Watermarkers (ascending, bounded and a custom one to more easily debug what was going on internally). I can tell that the extractTimestamp method is always being called, but the getCurrentWatermark method is never called.
I've set the autoWatermarkInterval to even 1ms and even then, the watermark for each subtask is never updated. I've verified this by using the Flink UI and looking at the metric available.
I've read quite a few similar questions around this topic on SO and most were about the window never emitting due to several reasons. I haven't been able to identify a reason why it would never advance the watermark.
I've also confirmed that no data is being side outputted as late data.
The stream in it's most basic form:
val rfq = kafkaDataStream
.assignAscendingTimestamps(_.timestamp.toEpochMilli)
.keyBy("id")
val lateTag = new OutputTag[RFQ]("late") {}
val predictions: DataStream[RFQPrediction] = rfq
.window(SlidingEventTimeWindows.of(5,3))
.sideOutputLateData(lateTag)
.aggregate(new PricePredictionsAggregate)
.name("windowed-predictions")
I've verified that it works fine with an AssignerWithPunctuatedWatermarks.
What could be the cause of the getCurrentWatermark method to never get called, even though the interval is set to 1ms?
The test data that I'm feeding through uses a limited list of ids for which events are continually being generated with an ever-increasing timestamp.
Thanks a lot!

GroupByKey not updating on very long PTransform with a Window

I'm working on a streaming Java Apache Beam (2.13.0) pipeline that is running in Google Cloud Dataflow. I have a long running PTransform (for a single input, it does a lot of work, outputs multiple outputs and can take >10 minutes).
I want to return early results from the processing to the user. I have a Window and Combine step afterwards. Early triggers do not seem to work with a long running PTransform. The Combine step outputs elements after the PTransform finishes processing the element (rather than returning early results).
I've tried many different early Window functions. E.g. I've tried doing forever element count triggers and it does not work. Ditto for forever processing time-based triggers (e.g. every 10 processing seconds). I've tried GlobalWindows, Fixed Windows, Session Windows, etc.
Here is rough pseudo code for what I'm doing.
p.apply(PubsubIO.readStrings().fromSubscription(options.getInput()));
.apply(FlatMapElements.via(new LongRunningCalculation()))
.apply(<I've tried a variety of window functions>)
.apply(Combine.perKey(new SumMetrics()))
.apply(DatastoreIO.v1().write().withProjectId(options.getProject()));
For the Window functions, I've tried many different Window functions to see if I can get anything to return early. I can't get it to return early.
Here is a basic one.
Window.into(new GlobalWindows())
.triggering(Repeatedly.forever(AfterPane.elementCountAtLeast(10)))
.withAllowedLateness(Duration.ZERO)
.discardingFiredPanes());
Even for this one, Even if the Window has added >>10 elements, the GroupBy in the Combine step does not output any rows.
Expected: If I have a long running PTransform, I'd still expect early triggers to still fire.
Actual: I can't seem to get early triggers to work.
Any advice?

Spark UI active jobs getting stuck when using scala parallel collection

I have a dataFrame of 1000 columns, and I am trying to get some statistics by doing some operations on each column. I need to sort each column so, I can't basically do multi column operations on it. I am doing all these column operations in a function called processColumn
def processColumn(df: DataFrame): Double = {
// sort the column
// get some statistics
}
To get this done, I am persisting the dataframe in memory, and doing a scala multi thread processing on it. So, the code is something like this
Let say the initial dataframe is df
df.columns.grouped(100).foreach { columnGroups =>
val newDf = df.select(columnGroups.head, columnGroups.tail:_*)
newDf.persist()
val parallelCol = columnGroups.par
parallelCol.tasksupport = new ForkJoinTaskSupport(
new scala.concurrent.forkjoin.ForkJoinPool(4)
)
parallelCol.foreach { columnName =>
val result = processColumn(df.select(columnName))
// I am storing result here to a synchronized list
}
newDf.unpersist()
}
So, if you see, I am specifying 4 threads to run at a time. But what happens sometimes is that one of the threads gets stuck, and I have more than 4 active jobs running. And the ones that gets stuck never finishes.
I feel the threads that starts from scala parallel collections have a time out, where sometimes it don't wait for all jobs to finish. And then the unpersist gets called. So, the active job is now stuck forever. I am trying to figure it out by going to source code to see if scala collections operations have a timeout, but haven't been able to figure it out for sure.
Any help will be highly appreciated. Also, please let me know if you have any questions. Thank you.

How to keep temporary output files in Spark

I'm writing a map only sparkSQL job which looks like
val lines = sc.textFile(inputPath)
val df = lines.map { line => ... }.toDF("col0", "col1")
df.write.parquet(output)
As the job takes quite a long time to compute, I would like to save and keep the results of the tasks which successfully terminated, even if the overall job fails or gets killed.
I noticed that, during the computation, in the output directory some temporary files are created.
I inspected them and noticed that, since my job has only a mapper, what is saved there is the output of the successful tasks.
The problem is that the job failed and I couldn't analyse what it could compute because the temp files were deleted.
Does anyone have some idea how to deal with this situation?
Cheers!
Change the output committer to DirectParquetOutputCommitter.
sc.hadoopConfiguration.set("spark.sql.parquet.output.committer.class", "org.apache.spark.sql.parquet.DirectParquetOutputCommitter"
Note that if you've turned on speculative execution, then you have to turn it off to use a direct output committer.

Get list of executions filtered by parameter value

I am using Spring-batch 3.0.4 stable. While submitting a job I add some specific parameters to its execution, say, a tag. Jobs information is persisted in the DB.
Later on I will need to retrieve all the executions marked with a particular tag.
Currently I see 2 options:
Get all job instances with org.springframework.batch.core.explore.JobExplorer#findJobInstancesByJobName. For each instance get all available executions with org.springframework.batch.core.explore.JobExplorer#getJobExecutions. Filter the resulting collection of executions checking its JobParameters.
Write my own JdbcTemplate-based DAO implementation to run the select query.
While the former option seems pretty inefficient, the latter one suggests writing extra code to deal with the Spring-specific database tables structure.
Is there any option I am missing here?