SparkException: Could not execute broadcast in time - scala

I am using spark structured streaming to write some transformed dataframes using function:
def parquetStreamWriter(dataPath: String, checkpointPath: String)(df: DataFrame): Unit = {
df.writeStream
.trigger(Trigger.Once)
.format("parquet")
.option("checkpointLocation", checkpointPath)
.start(dataPath)
}
When I am calling this function less number of time in code (1 or 2 dataframes written) it works fine but when I am calling it for more number of times (like writing 15 to 20 dataframes in a loop, I am getting following exception and some of the jobs are failing in databricks:-
Caused by: org.apache.spark.SparkException: Could not execute broadcast in
time. You can disable broadcast join by setting
spark.sql.autoBroadcastJoinThreshold to -1.
at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:191)
My transformation has one broadcast join but i tried removing broadcast in join in code but got same error.
I tried setting spark conf spark.sql.autoBroadcastJoinThreshold to -1. as mentioned in error but got same exception again.
Can you suggest where am i going wrong ?

It's difficult to judge w/o seeing the execution plan (esp. not sure about broadcasted volume), but increasing the spark.sql.broadcastTimeout could help (please find full configuration description here).

this can be solved by setting spark.sql.autoBroadcastJoinThreshold to a higher value
if one has no idea about the execution time for that particular dataframe u can directly set spark.sql.autoBroadcastJoinThreshold to -1 i.e. (spark.sql.autoBroadcastJoinThreshold -1) this will disable the time limit bound over the execution of the dataframe

Related

insert sequence into dataframe returning NullPointerException [duplicate]

sessionIdList is of type :
scala> sessionIdList
res19: org.apache.spark.rdd.RDD[String] = MappedRDD[17] at distinct at <console>:30
When I try to run below code :
val x = sc.parallelize(List(1,2,3))
val cartesianComp = x.cartesian(x).map(x => (x))
val kDistanceNeighbourhood = sessionIdList.map(s => {
cartesianComp.filter(v => v != null)
})
kDistanceNeighbourhood.take(1)
I receive exception :
14/05/21 16:20:46 ERROR Executor: Exception in task ID 80
java.lang.NullPointerException
at org.apache.spark.rdd.RDD.filter(RDD.scala:261)
at $line94.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:38)
at $line94.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:36)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:312)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
However if I use :
val l = sc.parallelize(List("1","2"))
val kDistanceNeighbourhood = l.map(s => {
cartesianComp.filter(v => v != null)
})
kDistanceNeighbourhood.take(1)
Then no exception is displayed
The difference between the two code snippets is that in first snippet sessionIdList is of type :
res19: org.apache.spark.rdd.RDD[String] = MappedRDD[17] at distinct at <console>:30
and in second snippet "l" is of type
scala> l
res13: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[32] at parallelize at <console>:12
Why is this error occuring ?
Do I need to convert sessionIdList to ParallelCollectionRDD in order to fix this ?
Spark doesn't support nesting of RDDs (see https://stackoverflow.com/a/14130534/590203 for another occurrence of the same problem), so you can't perform transformations or actions on RDDs inside of other RDD operations.
In the first case, you're seeing a NullPointerException thrown by the worker when it tries to access a SparkContext object that's only present on the driver and not the workers.
In the second case, my hunch is the job was run locally on the driver and worked purely by accident.
Its a reasonable question and I have heard it asked it enough times that. I'm going to try to take a stab at explaining why this is true, because it might help.
Nested RDDs will always throw an exception in production. Nested function calls as I think you are describing them here, if it means calling an RDD operation inside an RDD operation, will cause also cause failures since it is actually the same thing. (RDDs are immutable, so performing an RDD operation such as a "map" is equivalent to creating a new RDD.) The in ability to create nested RDDs is a necessary consequence of the way an RDD is defined and the way the Spark Application is set up.
An RDD is a distributed collection of objects (called partitions) that live on the Spark Executors. Spark executors cannot communicate with each other, only with the Spark driver. The RDD operations are all computed in pieces on these partitions.Because the RDD's executor environment isn't recursive (i.e. you can configure a Spark driver to be on a spark executor with sub executors) neither can an RDD.
In your program, you have created a distributed collection of partitions of integers. You are then performing a mapping operation. When the Spark driver sees a mapping operation, it sends the instructions to do the mapping to the executors, who perform the transformation on each partition in parallel. But your mapping cannot be done, because on each partition you are trying to call the "whole RDD" to perform another distributed operation. This can't not be done, because each partition does not have access to the information on the other partitions, if it did, the computation couldn't run in parallel.
What you can do instead, because the data you need in the map is probably small (since you are doing a filter, and the filter does not require any information about sessionIdList) is to first filter the session ID list. Then collect that list to the driver. Then broadcast it to the executors, where you can use it in the map. If the sessionID list is too large, you will probably need to do a join.

How to count number of rows per window in streaming queries?

Scenario: Working on Spark Streaming in Structured SQL. I have to implement a "info" dataset about how many rows I've processed in the last "window".
A little bit of code.
val invalidData: Dataset[String] =
parsedData.filter(record => !record.isValid).map(record => record.rawInput)
val validData: Dataset[FlatOutput] = parsedData
.filter(record => record.isValid)
I have two Dataset. But since I'm working on Streaming I cannot perform a .count (Error raised: Queries with streaming sources must be executed with writeStream.start())
val infoDataset = validData
.select(count("*") as "valid")
but a new error occurs: Append output mode not supported when there are streaming aggregations on streaming DataFrames/DataSets without watermark and I don't want to set outputMode as complete since I don't want total count from beginning, but just last "windowed" batch.
Unfortunately I don't have any column that I could register as watermark for these datasets.
Is there a way to know how many rows are processed in each iteration?
It seems like StreamingQueryStatus could be of some help.

Pyspark - WARN BisectingKMeans: The input RDD is not directly cached

I'm running bisecting kmeans as
bkm_test=BisectingKMeans().setK(5).setSeed(1)
rdf.cache()
assembled.cache()
model_test=bkm_test.fit(assembled)
I cached the two dataframes as I keep getting the error, but it doesn't make a difference, I found this question which is similar but with kmeans.
But I also get a WARN Executor error below. Is this only something inside the algorithm that I can't fix?
17/08/14 21:53:17 WARN BisectingKMeans: The input RDD 306 is not directly cached, which may hurt performance if its parent RDDs are also not cached.
17/08/14 21:53:17 WARN Executor: 1 block locks were not released by TID = 132:
[rdd_302_0]
This comes from BisectingKMeans within MLlib, which Spark ML uses internally. MLlib uses RDDs of vectors, while Spark ML is DataFrame oriented, so the ML version of BisectingKMeans converts your DataFrame into an RDD of Vector values. The conversion isn't cached, so you end up getting the error.
Hopefully, this isn't a major slowdown. I haven't found an easy way to force the caching of the converted RDD.

Exceeding `spark.driver.maxResultSize` without bringing any data to the driver

I have a Spark application that performs a large join
val joined = uniqueDates.join(df, $"start_date" <= $"date" && $"date" <= $"end_date")
and then aggregates the resulting DataFrame down to one with maybe 13k rows. In the course of the join, the job fails with the following error message:
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 78021 tasks is bigger than spark.driver.maxResultSize (2.0 GB)
This was happening before without setting spark.driver.maxResultSize, and so I set spark.driver.maxResultSize=2G. Then, I made a slight change to the join condition, and the error resurfaces.
Edit: In resizing the cluster, I also doubled the number of partitions the DataFrame assumes in a .coalesce(256) to a .coalesce(512), so I can't be sure it's not because of that.
My question is, since I am not collecting anything to the driver, why should spark.driver.maxResultSize matter at all here? Is the driver's memory being used for something in the join that I'm not aware of?
Just because you don't collect anything explicitly it doesn't mean that nothing is collected. Since the problem occurs during a join, the most likely explanation is that execution plan uses broadcast join. In that case Spark will collect data first, and then broadcast it.
Depending on the configuration and pipeline:
Make sure that spark.sql.autoBroadcastJoinThreshold is smaller than spark.driver.maxResultSize.
Make sure you don't force broadcast join on a data of unknown size.
While nothing indicates it is the problem here, be careful when using Spark ML utilities. Some of these (most notably indexers) can bring significant amounts of data to the driver.
To determine if broadcasting is indeed the problem please check the execution plan, and if needed, remove broadcast hints and disable automatic broadcasts:
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)
In theory, exception is not always related with customer data.
Technical information about tasks execution results send to Driver Node in serialized form, and this information can take more memory then threshold.
Prove:
Error message located in org.apache.spark.scheduler.TaskSetManager#canFetchMoreResults
val msg = s"Total size of serialized results of ${calculatedTasks} tasks " +
Method called in org.apache.spark.scheduler.TaskResultGetter#enqueueSuccessfulTask
val (result, size) = serializer.get().deserialize[TaskResult[_]](serializedData) match {
case directResult: DirectTaskResult[_] =>
if (!taskSetManager.canFetchMoreResults(serializedData.limit())) {
return
}
If tasks number is huge, mentioned exception can occurs.

Spark broadcast error: exceeds spark.akka.frameSize Consider using broadcast

I have a large data called "edges"
org.apache.spark.rdd.RDD[org.apache.spark.graphx.Edge[(String, Int)]] = MappedRDD[27] at map at <console>:52
When I was working in standalone mode, I was able to collect, count and save this file. Now, on a cluster, I'm getting this error
edges.count
...
Serialized task 28:0 was 12519797 bytes which exceeds spark.akka.frameSize
(10485760 bytes). Consider using broadcast variables for large values.
Same with .saveAsTextFile("edges")
This is from the spark-shell. I have tried using the option
--driver-java-options "-Dspark.akka.frameSize=15"
But when I do that, it just hangs indefinitely. Any help would be appreciated.
** EDIT **
My standalone mode was on Spark 1.1.0 and my cluster is Spark 1.0.1.
Also, the hanging occurs when I go to count, collect or saveAs* the RDD, but defining it or doing filters on it work just fine.
The "Consider using broadcast variables for large values" error message usually indicates that you've captured some large variables in function closures. For example, you might have written something like
val someBigObject = ...
rdd.mapPartitions { x => doSomething(someBigObject, x) }.count()
which causes someBigObject to be captured and serialized with your task. If you're doing something like that, you can use a broadcast variable instead, which will cause only a reference to the object to be stored in the task itself, while the actual object data will be sent separately.
In Spark 1.1.0+, it isn't strictly necessary to use broadcast variables for this, since tasks will automatically be broadcast (see SPARK-2521 for more details). There are still reasons to use broadcast variables (such as sharing a big object across multiple actions / jobs), but you won't need to use it to avoid frame size errors.
Another option is to increase the Akka frame size. In any Spark version, you should be able to set the spark.akka.frameSize setting in SparkConf prior to creating your SparkContext. As you may have noticed, though, this is a little harder in spark-shell, where the context is created for you. In newer versions of Spark (1.1.0 and higher), you can pass --conf spark.akka.frameSize=16 when launching spark-shell. In Spark 1.0.1 or 1.0.2, you should be able to pass --driver-java-options "-Dspark.akka.frameSize=16" instead.