Spark (v2) does not generate output if the size is more than 2 GB - scala

My Spark application writes outputs that range from several KBs to GBs. I have been facing problem in generating output for certain cases when the file size appears to be more than 2 GB, wherein nothing seems to happen. I hardly see any CPU usage. However, in case where the output size is less than 2 GB, such as 1.3 GB, the same application works flawlessly. Also, please note that writing output is the last stage and all the computations using the data to be written gets correctly and completely processed (as can be seen from debug output) -- hence driver storing the data is not an issue. Besides, the size of the executor memory is also not an issue as I had increased it even to 90 GB while 30GB also seems to be adequate. The following is the code I am using to write the output. Please suggest any way to fix it.
var output = scala.collection.mutable.ListBuffer[String]()
...
output.toDF().coalesce(1).toDF().write.mode("overwrite")
.option("parserLib","univocity").option("ignoreLeadingWhiteSpace","false")
.option("ignoreTrailingWhiteSpace","false").format("csv").save(outputPath)
Other related parameters passed by spark-submit are as follows:
--driver-memory 150g \
--executor-cores 4 \
--executor-memory 30g \
--conf spark.cores.max=252 \
--conf spark.local.dir=/tmp \
--conf spark.rpc.message.maxSize=2047 \
--conf spark.driver.maxResultSize=50g \
The issue was observed on two different systems, one standalone and the other which is a spark cluster.

Based on Gabio's idea of reparitioning, I solved the problem as follows:
val tDF = output.toDF()
println("|#tDF partitions = " + tDF.rdd.partitions.size.toString)
tDF.write.mode("overwrite")
.option("parserLib","univocity").option("ignoreLeadingWhiteSpace","false")
.option("ignoreTrailingWhiteSpace","false").format("csv").save(outputPath)
The output ranged between 2.3 GB and 14 GB, so the source of the problem is elsewhere and perhaps not in spark.driver.maxResultSize.
A big thank you to #Gabio!

Related

spark turn off dynamic allocation

I want to make sure my spark job doesn't take more memory than what I pass, let's say 400GB is the max the job can use, from my understanding turning off dynamic allocation (spark.dynamicAllocation.enabled = false) and passing --num-executors --executor-memory --driver-memory do the job in Cloudera stack? correct if wrong.
is there any other setting that I have to set to make sure spark job doesn't go out of limit.
found a solution at my work Cloudera cluster has a special yarn parameter which doesn't let a job to go over certain limit which have to turned off or reset it.
https://community.cloudera.com/t5/Support-Questions/Yarn-memory-allocation-utilization/td-p/216290
https://docs.cloudera.com/HDPDocuments/HDP2/HDP-2.6.4/bk_command-line-installation/content/determine-hdp-memory-config.html

Container killed by YARN for exceeding memory limits.14.8 GB of 6 GB physical memory used

I have a spark job where I do the following
Load the data from parquet via spark sql and convert it to a
pandas df. The datasize is only 250 MB
Run an rdd.foreach to
iterate over a relatively some dataset(1000 rows) and take the
pandas df from step 1 and do some transformation.
I get a Container killed by YARN for exceeding memory limits error after some iterations .
Container killed by YARN for exceeding memory limits. 14.8 GB of 6 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead
I am unable to understand why the error says 14.8 GB of 6 GB physical memory used ?
I have tried increasing spark.yarn.executor.memoryOverhead
I have used the following command
spark-submit --master yarn --deploy-mode cluster --num-executors 4 --executor-cores 2 --executor-memory 2G --conf spark.yarn.executor.memoryOverhead=4096 --py-files test.zip app_main.py
I am using spark 2.3
yarn.scheduler.minimum-allocation-mb = 512 MB
yarn.nodemanager.resource.memory-mb = 126 GB
This is one of the common error when memoryOverhead option is used, it is better to use other options to tune jobs.
http://ashkrit.blogspot.com/2018/09/anatomy-of-apache-spark-job.html post talks about this issue and how to deal with it.

Executor is taking more memory than defined

spark-submit --num-executors 10 --executor-memory 5g --master yarn --executor-cores 3 --class com.octro.hbase.hbase_final /home/hadoop/testDir/nikunj/Hbase_data_maker/target/Hbase_data_maker-0.0.1-SNAPSHOT-jar-with-dependencies.jar main_user_profile
This is my command to execute my spark code on the cluster.
On this command my YARN page gives total memory allocated as
71GB
I tried searching on the internet for the various reason but didn't received any clear clarification.
Later I figured out it is using the formula as
No of Executors*(Memory*2)+1
Plus 1 is for the main container.But why that 2GB by default.?
It was because of 2GB memory overhead that was specified in configuration file of spark.
That's why it was taking 2GB more.

Is it possible to read NetCDF file above 1 GB using SRdd?

we are using SciSpark for reading NetCDF file using the concept of SRdd. We are getting error once we tiring to read above 1gb file.
val data = sc.OpenPath("/home/Project/TestData",List("rhum"))
is there any problem in this code ?
getting error : java.lang.OutOfMemoryError: Java heap space
If I understand it right, SciSpark is Spark library and you run your code with spark-shell or spark-submit. If so, you just need specify proper memory options, like this:
spark-shell --driver-memory 2g --executor-memory 8g

Spark reparition() function increases number of tasks per executor, how to increase number of executor

I'm working on IBM Server of 30gb ram (12 cores engine), I have provided all the cores to spark but still, it uses only 1 core, I tried while loading the file and got successful with the command
val name_db_rdd = sc.textFile("input_file.csv",12)
and able to provide all the 12 cores to the processing for the starting jobs but I want to split the operation in between the intermediate operations to the executors, so that it can use all the 12 cores.
Image - description
val new_rdd = rdd.repartition(12)
As you can see in this image only 1 executor is running and repartition function split the data to many tasks at one executor.
It depends how you're launching the job, but you probably want to add --num-executors to your command line when you're launching your spark job.
Something like
spark-submit
--num-executors 10 \
--driver-memory 2g \
--executor-memory 2g \
--executor-cores 1 \
might work well for you.
Have a look on the Running Spark on Yarn for more details, though some of the switches they mention are Yarn specific.