Pyspark - WARN BisectingKMeans: The input RDD is not directly cached - pyspark

I'm running bisecting kmeans as
bkm_test=BisectingKMeans().setK(5).setSeed(1)
rdf.cache()
assembled.cache()
model_test=bkm_test.fit(assembled)
I cached the two dataframes as I keep getting the error, but it doesn't make a difference, I found this question which is similar but with kmeans.
But I also get a WARN Executor error below. Is this only something inside the algorithm that I can't fix?
17/08/14 21:53:17 WARN BisectingKMeans: The input RDD 306 is not directly cached, which may hurt performance if its parent RDDs are also not cached.
17/08/14 21:53:17 WARN Executor: 1 block locks were not released by TID = 132:
[rdd_302_0]

This comes from BisectingKMeans within MLlib, which Spark ML uses internally. MLlib uses RDDs of vectors, while Spark ML is DataFrame oriented, so the ML version of BisectingKMeans converts your DataFrame into an RDD of Vector values. The conversion isn't cached, so you end up getting the error.
Hopefully, this isn't a major slowdown. I haven't found an easy way to force the caching of the converted RDD.

Related

In Spark, how objects and variables are kept in memory and across different executors?

In Spark, how objects and variables are kept in memory and across different executors?
I am using:
Spark 3.0.0
Scala 2.12
I am working on writing a Spark Structured Streaming job with a custom Stream Source. Before the execution of the spark query, I create a bunch of metadata which is used by my Spark Streaming Job
I am trying to understand how this metadata is kept in memory across different executors?
Example Code:
case class JobConfig(fieldName: String, displayName: String, castTo: String)
val jobConfigs:List[JobConfig] = build(); //build the job configs
val query = spark
.readStream
.format("custom-streaming")
.load
query
.writeStream
.trigger(Trigger.ProcessingTime(2, TimeUnit.MINUTES))
.foreachBatch { (batchDF: DataFrame, batchId: Long) => {
CustomJobExecutor.start(jobConfigs) //CustomJobExecutor does data frame transformations and save the data in PostgreSQL.
}
}.outputMode(OutputMode.Append()).start().awaitTermination()
Need help in understanding following:
In the sample code, how Spark will keep "jobConfigs" in memory across different executors?
Is there any added advantage of broadcasting?
What is the efficient way of keeping the variables which can't be deserialized?
Local variables are copied for each task meanwhile broadcasted variables are copied only per executor. From docs
Spark actions are executed through a set of stages, separated by distributed “shuffle” operations. Spark automatically broadcasts the common data needed by tasks within each stage. The data broadcasted this way is cached in serialized form and deserialized before running each task. This means that explicitly creating broadcast variables is only useful when tasks across multiple stages need the same data or when caching the data in deserialized form is important.
It means that if your jobConfigs is large enough and the number of tasks and stages where the variable is used significantly larger than the number of executors, or deserialization is time-consuming, in that case, broadcast variables can make a difference. In other cases, they don't.

checkpointing / persisting / shuffling does not seem to 'short circuit' the lineage of an rdd as detailed in 'learning spark' book

In learning Spark, I read the following:
In addition to pipelining, Spark’s internal scheduler may truncate the lineage of the RDD graph if an existing RDD has already been persisted in cluster memory or on disk. Spark can “short-circuit” in this case and just begin computing based on the persisted RDD. A second case in which this truncation can happen is when an RDD is already materialized as a side effect of an earlier shuffle, even if it was not explicitly persist()ed. This is an under-the-hood optimization that takes advantage of the fact that Spark shuffle outputs are written to disk, and exploits the fact that many times portions of the RDD graph are recomputed.
So, I decided to try to see this in action with a simple program (below):
val pairs = spark.sparkContext.parallelize(List((1,2)))
val x = pairs.groupByKey()
x.toDebugString // before collect
x.collect()
x.toDebugString // after collect
spark.sparkContext.setCheckpointDir("/tmp")
// try both checkpointing and persisting to disk to cut lineage
x.checkpoint()
x.persist(org.apache.spark.storage.StorageLevel.DISK_ONLY)
x.collect()
x.toDebugString // after checkpoint
I did not see what I expected after reading the above paragraph from the Spark book. I saw the exact same output of toDebugString each time I invoked this method -- each time indicating two stages (where I would have expected only one stage after the checkpoint was supposed to have truncated the lineage.) like this:
scala> x.toDebugString // after collect
res5: String =
(8) ShuffledRDD[1] at groupByKey at <console>:25 []
+-(8) ParallelCollectionRDD[0] at parallelize at <console>:23 []
I am wondering if the key thing that I overlooked might be the word "may", as in the "schedule MAY truncate the lineage". Is this truncation something that might happen given the same program that I wrote above, under other circumstances ? Or is the little program that I wrote not doing the right thing to force the lineage truncation ? Thanks in advance for any insight you can provide !
I think that you should do persist/checkpoint before you do first collect.
From that code for me it looks correct what you get since when spark does first collect it does not know that it should persist or save anything.
Also probably you need to save result of x.persist and then use it...
I propose - try it:
val pairs = spark.sparkContext.parallelize(List((1,2)))
val x = pairs.groupByKey()
x.checkpoint()
x.persist(org.apache.spark.storage.StorageLevel.DISK_ONLY)
// **Also maybe do val xx = x.persist(...) and use xx later.**
x.toDebugString // before collect
x.collect()
x.toDebugString // after collect
spark.sparkContext.setCheckpointDir("/tmp")
// try both checkpointing and persisting to disk to cut lineage
x.collect()
x.toDebugString // after checkpoint

SparkException: Could not execute broadcast in time

I am using spark structured streaming to write some transformed dataframes using function:
def parquetStreamWriter(dataPath: String, checkpointPath: String)(df: DataFrame): Unit = {
df.writeStream
.trigger(Trigger.Once)
.format("parquet")
.option("checkpointLocation", checkpointPath)
.start(dataPath)
}
When I am calling this function less number of time in code (1 or 2 dataframes written) it works fine but when I am calling it for more number of times (like writing 15 to 20 dataframes in a loop, I am getting following exception and some of the jobs are failing in databricks:-
Caused by: org.apache.spark.SparkException: Could not execute broadcast in
time. You can disable broadcast join by setting
spark.sql.autoBroadcastJoinThreshold to -1.
at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:191)
My transformation has one broadcast join but i tried removing broadcast in join in code but got same error.
I tried setting spark conf spark.sql.autoBroadcastJoinThreshold to -1. as mentioned in error but got same exception again.
Can you suggest where am i going wrong ?
It's difficult to judge w/o seeing the execution plan (esp. not sure about broadcasted volume), but increasing the spark.sql.broadcastTimeout could help (please find full configuration description here).
this can be solved by setting spark.sql.autoBroadcastJoinThreshold to a higher value
if one has no idea about the execution time for that particular dataframe u can directly set spark.sql.autoBroadcastJoinThreshold to -1 i.e. (spark.sql.autoBroadcastJoinThreshold -1) this will disable the time limit bound over the execution of the dataframe

Spark + Scala transformations, immutability & memory consumption overheads

I have gone through some videos in Youtube regarding Spark architecture.
Even though Lazy evaluation, Resilience of data creation in case of failures, good functional programming concepts are reasons for success of Resilenace Distributed Datasets, one worrying factor is memory overhead due to multiple transformations resulting into memory overheads due data immutability.
If I understand the concept correctly, Every transformations is creating new data sets and hence the memory requirements will gone by those many times. If I use 10 transformations in my code, 10 sets of data sets will be created and my memory consumption will increase by 10 folds.
e.g.
val textFile = sc.textFile("hdfs://...")
val counts = textFile.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
Above example has three transformations : flatMap, map and reduceByKey. Does it implies I need 3X memory of data for X size of data?
Is my understanding correct? Is caching RDD is only solution to address this issue?
Once I start caching, it may spill over to disk due to large size and performance would be impacted due to disk IO operations. In that case, performance of Hadoop and Spark are comparable?
EDIT:
From the answer and comments, I have understood lazy initialization and pipeline process. My assumption of 3 X memory where X is initial RDD size is not accurate.
But is it possible to cache 1 X RDD in memory and update it over the pipleline? How does cache () works?
First off, the lazy execution means that functional composition can occur:
scala> val rdd = sc.makeRDD(List("This is a test", "This is another test",
"And yet another test"), 1)
rdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[70] at makeRDD at <console>:27
scala> val counts = rdd.flatMap(line => {println(line);line.split(" ")}).
| map(word => {println(word);(word,1)}).
| reduceByKey((x,y) => {println(s"$x+$y");x+y}).
| collect
This is a test
This
is
a
test
This is another test
This
1+1
is
1+1
another
test
1+1
And yet another test
And
yet
another
1+1
test
2+1
counts: Array[(String, Int)] = Array((And,1), (is,2), (another,2), (a,1), (This,2), (yet,1), (test,3))
First note that I force the parallelism down to 1 so that we can see how this looks on a single worker. Then I add a println to each of the transformations so that we can see how the workflow moves. You see that it processes the line, then it processes the output of that line, followed by the reduction. So, there are not separate states stored for each transformation as you suggested. Instead, each piece of data is looped through the entire transformation up until a shuffle is needed, as can be seen by the DAG visualization from the UI:
That is the win from the laziness. As to Spark v Hadoop, there is already a lot out there (just google it), but the gist is that Spark tends to utilize network bandwidth out of the box, giving it a boost right there. Then, there a number of performance improvements gained by laziness, especially if a schema is known and you can utilize the DataFrames API.
So, overall, Spark beats MR hands down in just about every regard.
The memory requirements of Spark not 10 times if you have 10 transformations in your Spark job. When you specify the steps of transformations in a job Spark builds a DAG which will allow it to execute all the steps in the jobs. After that it breaks the job down into stages. A stage is a sequence of transformations which Spark can execute on dataset without shuffling.
When an action is triggered on the RDD, Spark evaluates the DAG. It just applies all the transformations in a stage together until it hits the end of the stage, so it is unlikely for the memory pressure to be 10 time unless each transformation leads to a shuffle (in which case it is probably a badly written job).
I would recommend watching this talk and going through the slides.

Spark broadcast error: exceeds spark.akka.frameSize Consider using broadcast

I have a large data called "edges"
org.apache.spark.rdd.RDD[org.apache.spark.graphx.Edge[(String, Int)]] = MappedRDD[27] at map at <console>:52
When I was working in standalone mode, I was able to collect, count and save this file. Now, on a cluster, I'm getting this error
edges.count
...
Serialized task 28:0 was 12519797 bytes which exceeds spark.akka.frameSize
(10485760 bytes). Consider using broadcast variables for large values.
Same with .saveAsTextFile("edges")
This is from the spark-shell. I have tried using the option
--driver-java-options "-Dspark.akka.frameSize=15"
But when I do that, it just hangs indefinitely. Any help would be appreciated.
** EDIT **
My standalone mode was on Spark 1.1.0 and my cluster is Spark 1.0.1.
Also, the hanging occurs when I go to count, collect or saveAs* the RDD, but defining it or doing filters on it work just fine.
The "Consider using broadcast variables for large values" error message usually indicates that you've captured some large variables in function closures. For example, you might have written something like
val someBigObject = ...
rdd.mapPartitions { x => doSomething(someBigObject, x) }.count()
which causes someBigObject to be captured and serialized with your task. If you're doing something like that, you can use a broadcast variable instead, which will cause only a reference to the object to be stored in the task itself, while the actual object data will be sent separately.
In Spark 1.1.0+, it isn't strictly necessary to use broadcast variables for this, since tasks will automatically be broadcast (see SPARK-2521 for more details). There are still reasons to use broadcast variables (such as sharing a big object across multiple actions / jobs), but you won't need to use it to avoid frame size errors.
Another option is to increase the Akka frame size. In any Spark version, you should be able to set the spark.akka.frameSize setting in SparkConf prior to creating your SparkContext. As you may have noticed, though, this is a little harder in spark-shell, where the context is created for you. In newer versions of Spark (1.1.0 and higher), you can pass --conf spark.akka.frameSize=16 when launching spark-shell. In Spark 1.0.1 or 1.0.2, you should be able to pass --driver-java-options "-Dspark.akka.frameSize=16" instead.