Why there are mostly 200 tasks in stages for Structured Streaming even I change the default configuration by setting `spark.sql.shuffle.partitions`? - scheduled-tasks

I have a spark structured streaming application run in yarn mode.
I am trying to reduce the number of tasks, and I notice that most stages have 200 tasks. And I have set --conf "spark.sql.shuffle.partitions=40" --conf "spark.default.parallelism=40", but this does not work.
the code like:
df.withWatermark("ts", "5 minutes")
.groupBy(window($"ts", "5 minutes"), $"user",... )
.agg(count($"A"), sum($"B"))
.select("window.start", "window.end",... )
.writeStream
.outputMode("update")
.foreach(writer())
.option("checkpointLocation", checkpointDir)
.trigger()
.start()

When you use option("checkpointLocation", checkpointDir) you need to delete the checkpointDir in order to allow take effect the new value of spark.default.parallelism. Spark uses the existing checkpoint information when restart the stream and this imply maintaining 200 (or whatever previous value) partitions. At least Spark 3.1.1 can not "reshape" the stream...

Task is a piece of work/process on a partition. In an ideal situation for a given stage, number of tasks is directly proportional to number of partitions. So, please check number of partitions in your DataFrame. Hope this helps.
syntax: df.rdd.getNumPartitions

Related

In spark structured streaming, is there a way to sleep the read operation during maintenance window for the database

I am developing a spark structured streaming job that reads from a Kafka topic and writes to Jdbc Database.
The Database is supposed to have a maintenance window and I am trying to figure out a way of handling the case without aborting the job.
My code:
// read data from kafka and transform into required DF.
val transformDF = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", config.kafkaBootstrapServers)
.option("startingOffsets", "latest")
.option("subscribePattern", config.topics)
.load()
.transform(toRaw)
//write
val query = transformDF
.writeStream
.option("checkpointLocation", config.checkpointLocation)
.foreachBatch { (batchDF: DataFrame, batchId: Long) =>
batchDF.write
.format("jdbc")
.option("url", config.url)
.option("user", config.username)
.option("password", config.password)
.option(JDBCOptions.JDBC_TABLE_NAME, tableName.get)
.option("stringtype", "unspecified")
.mode(SaveMode.Append)
.save()
})
}.outputMode(OutputMode.Append()).start()
try {
query.awaitTermination()
} catch {
case e: Exception => logger.error("Error", e)
}
Right now, if DB is not available the code goes to exception block and is aborted. I want to avoid that, instead, I want the further reading of messages to be halted. I am trying to avoid the manual process of resubmitting the job.
Is this possible?
Spark: 2.4.5
No, there is not in a reliable way. BTW, No is also an answer.
Logic for checking exceptions are generally via try / catch running on the driver and is as you have coded. That is the accepted paradigm and logical as well one could argue. But this approach is the driver generated error.
As unexpected situations at Executor level are already standardly handled by the Spark Framework itself for Structured Streaming, and if the error is non-recoverable, then the App / Job simply crashes after signalling of error(s) back to the driver unless you code try / catch within the various foreachXXX constructs. That said, it is not clear the micro batch will be recoverable in such an approach afaics, some part of the microbatch is highly likely lost. Hard to test though.
Given that Spark has things standardly catered for that you cannot hook into, why would it be possible to insert a loop / try/catch in the source of the program? Likewise broadcast variables area an issue - although some have techniques around this so they say. But it is not in the spirit of the framework.
So, good question as I wonder(ed) about this (in the past).

Spark sliding window performance

I've setup a pipeline for incoming events from a stream in Apache Kafka.
Spark connects to Kafka, get the stream from a topic and process some "simple" aggregation tasks.
As I'm trying to build a service that should have a low latency refresh (below 1 second) I've built a simple Spark streaming app in Scala.
val windowing = events.window(Seconds(30), Seconds(1))
val spark = SparkSession
.builder()
.appName("Main Processor")
.getOrCreate()
import spark.implicits._
// Go into RDD of DStream
windowing.foreachRDD(rdd => {
// Convert RDD of JSON into DataFrame
val df = spark.read.json(rdd)
// Process only if received DataFrame is not empty
if (!df.head(1).isEmpty) {
// Create a view for Spark SQL
val rdf = df.select("user_id", "page_url")
rdf.createOrReplaceTempView("currentView")
val countDF = spark.sql("select count(distinct user_id) as sessions from currentView")
countDF.show()
}
It works as expected. My concerns are about performance at this point. Spark is running on a 4 CPUs Ubuntu server for testing purpose.
The CPU usage is about 35% all the time. I'm wondering if the incomming data from the stream have let's say 500 msg/s how would the CPU usage will evolve? Will it grow exp. or in a linear way?
If you can share your experience with Apache Spark in that kind of situation I'd appreciate it.
Last open question is if I set the sliding window interval to 500ms (as I'd like) will this blow up? I mean, it seems that Spark streaming features are fresh and the batch processing architecture may be a limitation in really real time data processing, isn't it?

SparkException: Could not execute broadcast in time

I am using spark structured streaming to write some transformed dataframes using function:
def parquetStreamWriter(dataPath: String, checkpointPath: String)(df: DataFrame): Unit = {
df.writeStream
.trigger(Trigger.Once)
.format("parquet")
.option("checkpointLocation", checkpointPath)
.start(dataPath)
}
When I am calling this function less number of time in code (1 or 2 dataframes written) it works fine but when I am calling it for more number of times (like writing 15 to 20 dataframes in a loop, I am getting following exception and some of the jobs are failing in databricks:-
Caused by: org.apache.spark.SparkException: Could not execute broadcast in
time. You can disable broadcast join by setting
spark.sql.autoBroadcastJoinThreshold to -1.
at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:191)
My transformation has one broadcast join but i tried removing broadcast in join in code but got same error.
I tried setting spark conf spark.sql.autoBroadcastJoinThreshold to -1. as mentioned in error but got same exception again.
Can you suggest where am i going wrong ?
It's difficult to judge w/o seeing the execution plan (esp. not sure about broadcasted volume), but increasing the spark.sql.broadcastTimeout could help (please find full configuration description here).
this can be solved by setting spark.sql.autoBroadcastJoinThreshold to a higher value
if one has no idea about the execution time for that particular dataframe u can directly set spark.sql.autoBroadcastJoinThreshold to -1 i.e. (spark.sql.autoBroadcastJoinThreshold -1) this will disable the time limit bound over the execution of the dataframe

Spark-Running Batch Job with 15 minutes interval

I am using Scala,
I tried Spark streaming, but if by any chance my streaming job crashed for more than 15 minutes, this will generate data loss.
So I just want to know, how to manually keep checkpoints in batch job?
The directories of input data looks like the following
Data --> 20170818 --> (timestamp) --> (many .json files)
The data are uploaded every 5 minutes.
Thanks!
You may use readStream feature in structured streaming to monitor a directory and pick up new files. Spark automatically handles checkpointing and tracking for you.
val ds = spark.readStream
.format("text")
.option("maxFilesPerTrigger", 1)
.load(logDirectory)
Here is a link to for additional material on the topic: https://jaceklaskowski.gitbooks.io/spark-structured-streaming/spark-sql-streaming-FileStreamSource.html
I personally used format("text") but you should be able to change to format("json"), here is more details on json format: https://databricks.com/blog/2017/02/23/working-complex-data-formats-structured-streaming-apache-spark-2-1.html

Unbounded table is spark structured streaming

I'm starting to learn Spark and am having a difficult time understanding the rationality behind Structured Streaming in Spark. Structured streaming treats all the data arriving as an unbounded input table, wherein every new item in the data stream is treated as new row in the table. I have the following piece of code to read in incoming files to the csvFolder.
val spark = SparkSession.builder.appName("SimpleApp").getOrCreate()
val csvSchema = new StructType().add("street", "string").add("city", "string")
.add("zip", "string").add("state", "string").add("beds", "string")
.add("baths", "string").add("sq__ft", "string").add("type", "string")
.add("sale_date", "string").add("price", "string").add("latitude", "string")
.add("longitude", "string")
val streamingDF = spark.readStream.schema(csvSchema).csv("./csvFolder/")
val query = streamingDF.writeStream
.format("console")
.start()
What happens if I dump a 1GB file to the folder. As per the specs, the streaming job is triggered every few milliseconds. If Spark encounters such a huge file in the next instant, won't it run out of memory while trying to load the file. Or does it automatically batch it? If yes, is this batching parameter configurable?
See the example
The key idea is to treat any data stream as an unbounded table: new records added to the stream are like rows being appended to the table.
This allows us to treat both batch and streaming data as tables. Since tables and DataFrames/Datasets are semantically synonymous, the same batch-like DataFrame/Dataset queries can be applied to both batch and streaming data.
In Structured Streaming Model, this is how the execution of this query is performed.
Question : If Spark encounters such a huge file in the next instant, won't it run out of memory while trying to load the file. Or does it automatically
batch it? If yes, is this batching parameter configurable?
Answer : There is no point of OOM since it is RDD(DF/DS)lazily initialized. of
course you need to re-partition before processing to ensure equal
number of partitions and data spread across executors uniformly...