Is it possible to read NetCDF file above 1 GB using SRdd? - scala

we are using SciSpark for reading NetCDF file using the concept of SRdd. We are getting error once we tiring to read above 1gb file.
val data = sc.OpenPath("/home/Project/TestData",List("rhum"))
is there any problem in this code ?
getting error : java.lang.OutOfMemoryError: Java heap space

If I understand it right, SciSpark is Spark library and you run your code with spark-shell or spark-submit. If so, you just need specify proper memory options, like this:
spark-shell --driver-memory 2g --executor-memory 8g

Related

Spark (v2) does not generate output if the size is more than 2 GB

My Spark application writes outputs that range from several KBs to GBs. I have been facing problem in generating output for certain cases when the file size appears to be more than 2 GB, wherein nothing seems to happen. I hardly see any CPU usage. However, in case where the output size is less than 2 GB, such as 1.3 GB, the same application works flawlessly. Also, please note that writing output is the last stage and all the computations using the data to be written gets correctly and completely processed (as can be seen from debug output) -- hence driver storing the data is not an issue. Besides, the size of the executor memory is also not an issue as I had increased it even to 90 GB while 30GB also seems to be adequate. The following is the code I am using to write the output. Please suggest any way to fix it.
var output = scala.collection.mutable.ListBuffer[String]()
...
output.toDF().coalesce(1).toDF().write.mode("overwrite")
.option("parserLib","univocity").option("ignoreLeadingWhiteSpace","false")
.option("ignoreTrailingWhiteSpace","false").format("csv").save(outputPath)
Other related parameters passed by spark-submit are as follows:
--driver-memory 150g \
--executor-cores 4 \
--executor-memory 30g \
--conf spark.cores.max=252 \
--conf spark.local.dir=/tmp \
--conf spark.rpc.message.maxSize=2047 \
--conf spark.driver.maxResultSize=50g \
The issue was observed on two different systems, one standalone and the other which is a spark cluster.
Based on Gabio's idea of reparitioning, I solved the problem as follows:
val tDF = output.toDF()
println("|#tDF partitions = " + tDF.rdd.partitions.size.toString)
tDF.write.mode("overwrite")
.option("parserLib","univocity").option("ignoreLeadingWhiteSpace","false")
.option("ignoreTrailingWhiteSpace","false").format("csv").save(outputPath)
The output ranged between 2.3 GB and 14 GB, so the source of the problem is elsewhere and perhaps not in spark.driver.maxResultSize.
A big thank you to #Gabio!

Running Scala module with databricks-connect

I've tried to follow the instructions here to set up databricks-connect with IntelliJ. My understanding is that I can run code from the IDE and it will run on the databricks cluster.
I added the jar directory from the miniconda environment and moved it above all of the maven dependencies in File -> Project Structure...
However I think I did something wrong. When I tried to run my module I got the following error:
21/07/17 22:44:24 ERROR SparkContext: Error initializing SparkContext.
java.lang.IllegalArgumentException: System memory 259522560 must be at least 471859200. Please increase heap size using the --driver-memory option or spark.driver.memory in Spark configuration.
at org.apache.spark.memory.UnifiedMemoryManager$.getMaxMemory(UnifiedMemoryManager.scala:221)
at org.apache.spark.memory.UnifiedMemoryManager$.apply(UnifiedMemoryManager.scala:201)
at org.apache.spark.SparkEnv$.create(SparkEnv.scala:413)
at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:262)
at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:291)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:495)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2834)
at org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$2(SparkSession.scala:1016)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:1010)
at com.*.sitecomStreaming.sitecomStreaming$.main(sitecomStreaming.scala:184)
at com.*.sitecomStreaming.sitecomStreaming.main(sitecomStreaming.scala)
The system memory being 259 gb makes me think it's trying to run locally on my laptop instead of the dbx cluster? I'm not sure if this is correct and what I can do to get this up and running properly...
Any help is appreciated!
The driver in the databricks-connect is always running locally - only the executors are running in the cloud. Also, this reported memory is in the bytes, so 259522560 is ~256Mb - you can increase it using the option that it reports.
P.S. But if you're using structured streaming, then yes - it's a known limitation of databricks-connect.

Container killed by YARN for exceeding memory limits.14.8 GB of 6 GB physical memory used

I have a spark job where I do the following
Load the data from parquet via spark sql and convert it to a
pandas df. The datasize is only 250 MB
Run an rdd.foreach to
iterate over a relatively some dataset(1000 rows) and take the
pandas df from step 1 and do some transformation.
I get a Container killed by YARN for exceeding memory limits error after some iterations .
Container killed by YARN for exceeding memory limits. 14.8 GB of 6 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead
I am unable to understand why the error says 14.8 GB of 6 GB physical memory used ?
I have tried increasing spark.yarn.executor.memoryOverhead
I have used the following command
spark-submit --master yarn --deploy-mode cluster --num-executors 4 --executor-cores 2 --executor-memory 2G --conf spark.yarn.executor.memoryOverhead=4096 --py-files test.zip app_main.py
I am using spark 2.3
yarn.scheduler.minimum-allocation-mb = 512 MB
yarn.nodemanager.resource.memory-mb = 126 GB
This is one of the common error when memoryOverhead option is used, it is better to use other options to tune jobs.
http://ashkrit.blogspot.com/2018/09/anatomy-of-apache-spark-job.html post talks about this issue and how to deal with it.

Executor is taking more memory than defined

spark-submit --num-executors 10 --executor-memory 5g --master yarn --executor-cores 3 --class com.octro.hbase.hbase_final /home/hadoop/testDir/nikunj/Hbase_data_maker/target/Hbase_data_maker-0.0.1-SNAPSHOT-jar-with-dependencies.jar main_user_profile
This is my command to execute my spark code on the cluster.
On this command my YARN page gives total memory allocated as
71GB
I tried searching on the internet for the various reason but didn't received any clear clarification.
Later I figured out it is using the formula as
No of Executors*(Memory*2)+1
Plus 1 is for the main container.But why that 2GB by default.?
It was because of 2GB memory overhead that was specified in configuration file of spark.
That's why it was taking 2GB more.

Spark (Scala) Writing (and reading) to local file system from driver

1st Question :
I have a 2 node virtual cluster with hadoop.
I have a jar that runs a spark job.
This jar accepts as a cli argument : a path to a commands.txt file which tells the jar which commands to run.
I run the job with spark-submit, and i have noticed that my slave node wasn't running because it couldn't find the commands.txt file which was local on the master.
This is the command i used to run it :
./spark-1.6.1-bin-hadoop2.6/bin/spark-submit --class
univ.bigdata.course.MainRunner --master yarn\
--deploy-mode cluster --executor-memory 1g \
--num-executors 4 \
final-project-1.0-SNAPSHOT.jar commands commands.txt
Do i need to upload commands.txt to the hdfs and give the hdfs path instead as follows ? :
hdfs://master:9000/user/vagrant/commands.txt
2nd Question :
How do i write to a file on the driver machine in the cwd ?
I used a normal scala filewriter to write output to queries_out.txt and it worked fine when using spark submit with
-master local[]
But, when running in
-master yarn
I cant find the file, No exceptions are thrown but i just cant locate the file. It doesn't exist as if it was never written. Is there a way to write the results to a file on the driver machine locally ? Or should i only write results to HDFS ?
Thanks.
Question 1: Yes, uploading it to hdfs or any network accessible file system is how you solve your problem.
Question 2:
This is a bit tricky. Assuming your results are in a RDD you could call collect(), that will aggregate all the data on your driver process. Then, you have a standard collection in your hands which you could simply write on disk. Note that you should give your driver's process enough memory to be able to hold all results in memory, do not forget to also increase the maximum result size. The parameters are:
--driver-memory 16G
--conf "spark.driver.maxResultSize=15g"
This is has absolutely poor scaling behaviour in both communication complexity and memory (both in the size of the result RDD). This is the easiest way and perfectly fine for a toy project or when the data set is always small. In all other cases it will certainly blow up at some point.
The better way, as you may have mentioned, is to use the build-in "saveAs" methods to write to i.e. hdfs (or another storage format). You can check the documentation for that: http://spark.apache.org/docs/latest/programming-guide.html#actions
Note that if you only want to persist the RDD, because you are reusing it in several computations (like cache, but instead of holding it in memory hold it in disk) there is also a persist method on RDDs.
Solution was very simple, i changed --deploy-mode cluster to --deploy-mode client and then the file writes were done correctly on the machine where i ran the driver.
Answer to Question 1:
Submitting spark job with the --files tag followed by path to a local file downloads the file from the driver node to the cwd of all the worker nodes and thus be accessed just by using its name.