spark read json stuck on load file upper 1G

spark read json stuck on load file upper 1G - pyspark

when I try to load a JSON file upper from 1G, the process run forever without throw any exception.
dump=spark.read.json("hdfs://ip-000-00-0-000.aws.foobar.com:8020/user/hadoop/mixpanel-event2017-12-11a2.txt")
I am using:
Spark - 2.0.2,
Master - m4.4xlarge
Core - 4m4.4xlarge
Run on Pyspark

That sounds that you need more memory. (Spark default driver memory is 1G)

Related

Scala code execution on master of spark cluster?

The spark application uses some API calls which do not use spark-session. I believe when the piece of code doesn't use spark it is getting executed on the master node!
Why do I want to know this?
I am getting a java heap space error while I am trying to POST some files using API calls and I believe if I upgrade the master and increase driver mem it can be solved.
I want to understand how this type of application is executed on the Spark cluster?
Is my understanding right or am I missing something?

It depends - closures/functions passed to the built-in function transform or any code in udfs you create, code in forEachBatch (and maybe a few other places) will run on the workers. Other code runs on driver

Running Scala module with databricks-connect

I've tried to follow the instructions here to set up databricks-connect with IntelliJ. My understanding is that I can run code from the IDE and it will run on the databricks cluster.
I added the jar directory from the miniconda environment and moved it above all of the maven dependencies in File -> Project Structure...
However I think I did something wrong. When I tried to run my module I got the following error:
21/07/17 22:44:24 ERROR SparkContext: Error initializing SparkContext.
java.lang.IllegalArgumentException: System memory 259522560 must be at least 471859200. Please increase heap size using the --driver-memory option or spark.driver.memory in Spark configuration.
at org.apache.spark.memory.UnifiedMemoryManager$.getMaxMemory(UnifiedMemoryManager.scala:221)
at org.apache.spark.memory.UnifiedMemoryManager$.apply(UnifiedMemoryManager.scala:201)
at org.apache.spark.SparkEnv$.create(SparkEnv.scala:413)
at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:262)
at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:291)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:495)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2834)
at org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$2(SparkSession.scala:1016)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:1010)
at com.*.sitecomStreaming.sitecomStreaming$.main(sitecomStreaming.scala:184)
at com.*.sitecomStreaming.sitecomStreaming.main(sitecomStreaming.scala)
The system memory being 259 gb makes me think it's trying to run locally on my laptop instead of the dbx cluster? I'm not sure if this is correct and what I can do to get this up and running properly...
Any help is appreciated!

The driver in the databricks-connect is always running locally - only the executors are running in the cloud. Also, this reported memory is in the bytes, so 259522560 is ~256Mb - you can increase it using the option that it reports.
P.S. But if you're using structured streaming, then yes - it's a known limitation of databricks-connect.

How do I run "s3-dist-cp" command inside pyspark shell / pyspark script in EMR 5.x

I had some problems in running a s3-dist-cp" command in my pyspark script as I needed some data movement from s3 to hdfs for performance enhancement. so here I am sharing this.

import os
os.system("/usr/bin/s3-dist-cp --src=s3://aiqdatabucket/aiq-inputfiles/de_pulse_ip/latest/ --dest=/de_pulse/ --groupBy='.*(additional).*' --targetSize=64 --outputCodec=none")
Note : - please make sure that you give the full path of s3-dist-cp like (/usr/bin/s3-dist-cp)
also, I think we can use subprocess.

If you're running a pyspark application, you'll have to stop the spark application first. The s3-dist-cp will hang because the pyspark application is blocking.
spark.stop() # spark context
os.system("/usr/bin/s3-dist-cp ...")

Spark (Scala) Writing (and reading) to local file system from driver

1st Question :
I have a 2 node virtual cluster with hadoop.
I have a jar that runs a spark job.
This jar accepts as a cli argument : a path to a commands.txt file which tells the jar which commands to run.
I run the job with spark-submit, and i have noticed that my slave node wasn't running because it couldn't find the commands.txt file which was local on the master.
This is the command i used to run it :
./spark-1.6.1-bin-hadoop2.6/bin/spark-submit --class
univ.bigdata.course.MainRunner --master yarn\
--deploy-mode cluster --executor-memory 1g \
--num-executors 4 \
final-project-1.0-SNAPSHOT.jar commands commands.txt
Do i need to upload commands.txt to the hdfs and give the hdfs path instead as follows ? :
hdfs://master:9000/user/vagrant/commands.txt
2nd Question :
How do i write to a file on the driver machine in the cwd ?
I used a normal scala filewriter to write output to queries_out.txt and it worked fine when using spark submit with
-master local[]
But, when running in
-master yarn
I cant find the file, No exceptions are thrown but i just cant locate the file. It doesn't exist as if it was never written. Is there a way to write the results to a file on the driver machine locally ? Or should i only write results to HDFS ?
Thanks.

Question 1: Yes, uploading it to hdfs or any network accessible file system is how you solve your problem.
Question 2:
This is a bit tricky. Assuming your results are in a RDD you could call collect(), that will aggregate all the data on your driver process. Then, you have a standard collection in your hands which you could simply write on disk. Note that you should give your driver's process enough memory to be able to hold all results in memory, do not forget to also increase the maximum result size. The parameters are:
--driver-memory 16G
--conf "spark.driver.maxResultSize=15g"
This is has absolutely poor scaling behaviour in both communication complexity and memory (both in the size of the result RDD). This is the easiest way and perfectly fine for a toy project or when the data set is always small. In all other cases it will certainly blow up at some point.
The better way, as you may have mentioned, is to use the build-in "saveAs" methods to write to i.e. hdfs (or another storage format). You can check the documentation for that: http://spark.apache.org/docs/latest/programming-guide.html#actions
Note that if you only want to persist the RDD, because you are reusing it in several computations (like cache, but instead of holding it in memory hold it in disk) there is also a persist method on RDDs.

Solution was very simple, i changed --deploy-mode cluster to --deploy-mode client and then the file writes were done correctly on the machine where i ran the driver.

Answer to Question 1:
Submitting spark job with the --files tag followed by path to a local file downloads the file from the driver node to the cwd of all the worker nodes and thus be accessed just by using its name.

Is it possible to read NetCDF file above 1 GB using SRdd?

we are using SciSpark for reading NetCDF file using the concept of SRdd. We are getting error once we tiring to read above 1gb file.
val data = sc.OpenPath("/home/Project/TestData",List("rhum"))
is there any problem in this code ?
getting error : java.lang.OutOfMemoryError: Java heap space

If I understand it right, SciSpark is Spark library and you run your code with spark-shell or spark-submit. If so, you just need specify proper memory options, like this:
spark-shell --driver-memory 2g --executor-memory 8g

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

spark read json stuck on load file upper 1G - pyspark

when I try to load a JSON file upper from 1G, the process run forever without throw any exception. dump=spark.read.json("hdfs://ip-000-00-0-000.aws.foobar.com:8020/user/hadoop/mixpanel-event2017-12-11a2.txt") I am using: Spark - 2.0.2, Master - m4.4xlarge Core - 4m4.4xlarge Run on Pyspark

That sounds that you need more memory. (Spark default driver memory is 1G)

Related

Scala code execution on master of spark cluster?

Running Scala module with databricks-connect

How do I run "s3-dist-cp" command inside pyspark shell / pyspark script in EMR 5.x

Spark (Scala) Writing (and reading) to local file system from driver

Is it possible to read NetCDF file above 1 GB using SRdd?

Categories

Resources