Running Scala module with databricks-connect - scala

I've tried to follow the instructions here to set up databricks-connect with IntelliJ. My understanding is that I can run code from the IDE and it will run on the databricks cluster.
I added the jar directory from the miniconda environment and moved it above all of the maven dependencies in File -> Project Structure...
However I think I did something wrong. When I tried to run my module I got the following error:
21/07/17 22:44:24 ERROR SparkContext: Error initializing SparkContext.
java.lang.IllegalArgumentException: System memory 259522560 must be at least 471859200. Please increase heap size using the --driver-memory option or spark.driver.memory in Spark configuration.
at org.apache.spark.memory.UnifiedMemoryManager$.getMaxMemory(UnifiedMemoryManager.scala:221)
at org.apache.spark.memory.UnifiedMemoryManager$.apply(UnifiedMemoryManager.scala:201)
at org.apache.spark.SparkEnv$.create(SparkEnv.scala:413)
at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:262)
at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:291)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:495)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2834)
at org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$2(SparkSession.scala:1016)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:1010)
at com.*.sitecomStreaming.sitecomStreaming$.main(sitecomStreaming.scala:184)
at com.*.sitecomStreaming.sitecomStreaming.main(sitecomStreaming.scala)
The system memory being 259 gb makes me think it's trying to run locally on my laptop instead of the dbx cluster? I'm not sure if this is correct and what I can do to get this up and running properly...
Any help is appreciated!

The driver in the databricks-connect is always running locally - only the executors are running in the cloud. Also, this reported memory is in the bytes, so 259522560 is ~256Mb - you can increase it using the option that it reports.
P.S. But if you're using structured streaming, then yes - it's a known limitation of databricks-connect.

Related

Session isn't active Pyspark in an AWS EMR cluster

I have opened an AWS EMR cluster and in pyspark3 jupyter notebook I run this code:
"..
textRdd = sparkDF.select(textColName).rdd.flatMap(lambda x: x)
textRdd.collect().show()
.."
I got this error:
An error was encountered:
Invalid status code '400' from http://..../sessions/4/statements/7 with error payload: {"msg":"requirement failed: Session isn't active."}
Running the line:
sparkDF.show()
works!
I also created a small subset of the file and all my code runs fine.
What is the problem?
I had the same issue and the reason for the timeout is the driver running out of memory. Since you run collect() all the data gets sent to the driver. By default the driver memory is 1000M when creating a spark application through JupyterHub even if you set a higher value through config.json. You can see that by executing the code from within a jupyter notebook
spark.sparkContext.getConf().get('spark.driver.memory')
1000M
To increase the driver memory just do
%%configure -f
{"driverMemory": "6000M"}
This will restart the application with increased driver memory. You might need to use higher values for your data. Hope it helps.
From This stack overflow question's answer which worked for me
Judging by the output, if your application is not finishing with a FAILED status, that sounds like a Livy timeout error: your application is likely taking longer than the defined timeout for a Livy session (which defaults to 1h), so even despite the Spark app succeeds your notebook will receive this error if the app takes longer than the Livy session's timeout.
If that's the case, here's how to address it:
1. edit the /etc/livy/conf/livy.conf file (in the cluster's master node)
2. set the livy.server.session.timeout to a higher value, like 8h (or larger, depending on your app)
3. restart Livy to update the setting: sudo restart livy-server in the cluster's master
4. test your code again
Alternative way to edit this setting - https://allinonescript.com/questions/54220381/how-to-set-livy-server-session-timeout-on-emr-cluster-boostrap
Just a restart helped solve this problem for me. On your Jupyter Notebook, go to -->Kernel-->>Restart
Once done, if you run the cell with "spark" command you will see that a new spark session gets established.
You might get some insights from this similar Stack Overflow thread: Timeout error: Error with 400 StatusCode: "requirement failed: Session isn't active."
Solution might be to increase spark.executor.heartbeatInterval. Default is 10 seconds.
See EMR's official documentation on how to change Spark defaults:
You change the defaults in spark-defaults.conf using the spark-defaults configuration classification or the maximizeResourceAllocation setting in the spark configuration classification.
Insufficient reputation to comment.
I tried increasing heartbeat Interval to a much higher (100 seconds), still the same result. FWIW, the error shows up in < 9s.
What worked for me is adding {"Classification": "spark-defaults", "Properties": {"spark.driver.memory": "20G"}} to the EMR configuration.

Spark (Scala) Writing (and reading) to local file system from driver

1st Question :
I have a 2 node virtual cluster with hadoop.
I have a jar that runs a spark job.
This jar accepts as a cli argument : a path to a commands.txt file which tells the jar which commands to run.
I run the job with spark-submit, and i have noticed that my slave node wasn't running because it couldn't find the commands.txt file which was local on the master.
This is the command i used to run it :
./spark-1.6.1-bin-hadoop2.6/bin/spark-submit --class
univ.bigdata.course.MainRunner --master yarn\
--deploy-mode cluster --executor-memory 1g \
--num-executors 4 \
final-project-1.0-SNAPSHOT.jar commands commands.txt
Do i need to upload commands.txt to the hdfs and give the hdfs path instead as follows ? :
hdfs://master:9000/user/vagrant/commands.txt
2nd Question :
How do i write to a file on the driver machine in the cwd ?
I used a normal scala filewriter to write output to queries_out.txt and it worked fine when using spark submit with
-master local[]
But, when running in
-master yarn
I cant find the file, No exceptions are thrown but i just cant locate the file. It doesn't exist as if it was never written. Is there a way to write the results to a file on the driver machine locally ? Or should i only write results to HDFS ?
Thanks.
Question 1: Yes, uploading it to hdfs or any network accessible file system is how you solve your problem.
Question 2:
This is a bit tricky. Assuming your results are in a RDD you could call collect(), that will aggregate all the data on your driver process. Then, you have a standard collection in your hands which you could simply write on disk. Note that you should give your driver's process enough memory to be able to hold all results in memory, do not forget to also increase the maximum result size. The parameters are:
--driver-memory 16G
--conf "spark.driver.maxResultSize=15g"
This is has absolutely poor scaling behaviour in both communication complexity and memory (both in the size of the result RDD). This is the easiest way and perfectly fine for a toy project or when the data set is always small. In all other cases it will certainly blow up at some point.
The better way, as you may have mentioned, is to use the build-in "saveAs" methods to write to i.e. hdfs (or another storage format). You can check the documentation for that: http://spark.apache.org/docs/latest/programming-guide.html#actions
Note that if you only want to persist the RDD, because you are reusing it in several computations (like cache, but instead of holding it in memory hold it in disk) there is also a persist method on RDDs.
Solution was very simple, i changed --deploy-mode cluster to --deploy-mode client and then the file writes were done correctly on the machine where i ran the driver.
Answer to Question 1:
Submitting spark job with the --files tag followed by path to a local file downloads the file from the driver node to the cwd of all the worker nodes and thus be accessed just by using its name.

Has anyone been successful running Apache Spark & Shark on Cassandra

I am trying to configure a 5 node cassandra cluster to run Spark/Shark to test out some Hive queries.
I have installed Spark, Scala, Shark and configured according to Amplab [Running Shark on a cluster] https://github.com/amplab/shark/wiki/Running-Shark-on-a-Cluster.
I am able to get into the Shark CLI and when I try to create an EXTERNAL TABLE out of one of my Cassandra ColumnFamily tables, I keep getting this error
Failed with exception
org.apache.hadoop.hive.ql.metadata.HiveException: Error in loading
storage
handler.org.apache.hadoop.hive.cassandra.CassandraStorageHandler
FAILED: Execution Error, return code 1 from
org.apache.hadoop.hive.ql.exec.DDLTask
I have configured HIVE_HOME, HADOOP_HOME, SCALA_HOME. Perhaps I'm pointing HIVE_HOME and HADOOP_HOME to the wrong paths? HADOOP_HOME is set to my Cassandra hadoop folder (/etc/dse/cassandra), HIVE_HOME is set to the unpacked Amplad download of Hadoop1/hive, and I have also set HIVE_CONF_DIR to my Cassandra Hive path (/etc/dse/hive).
Am I missing any steps? Or have I configured these locations wrongly? Any ideas please? Any help will be very much appreciated. Thanks
Yes, I have got it.
Try https://github.com/2013Commons/hive-cassandra
whick is working with cassandra 2.0.4, hive 0.11, hadoop 2.0

Heroku memory leak with Play2 scala

Was doing some stretch (ab) test to my 1 heroku dyno and dev database with 20 connections limit.
During the calls (that access database with squeryl the heap allocation is increasing causing R14 (memory more than 512MB))
I cannot seem to reproduce the problem (at that levels at least locally).
Is there any way to get heroku heap dump and analyze it to get some clue?
Is there any known issues with play2, scala, squeryl and heroku memory leak?
Update
If i do System.gc at the end of the controller everything seems to be fine and slower ofc...I create a lot of object at that call but shouldn't heroku's JVM take care of gc? Also if i schedule gc call periodically don't free memory
There's a great article for troubleshooting memory issues on Heroku:
https://devcenter.heroku.com/articles/java-memory-issues
In your case, you can add the GC flags to JAVA_OPTS to see memory details. I'd suggest the following flags:
heroku config:add JAVA_OPTS="-Xmx384m -Xss512k -XX:+UseCompressedOops -XX:+PrintGCDetails -XX:+PrintHeapAtGC -XX:+PrintGCDateStamps"
There's also a simple java agent that you can add to your process if you want a little more info from JMX about your memory. You can also take a look at monitoring addons like New Relic if you want to go into more depth, but I think you should be fine with the flags and java agent.
I had this issue as well, and answered it here.
I had the same issue. Heroku is telling you the machine is running out
of memory, not the Java VM. There is actually a bug in the Heroku Play
2.2 deployment, the startup script reads java_opts, not JAVA_OPTS.
I fixed it by setting both:
heroku config:add java_opts='-Xmx384m -Xms384m -Xss512k -XX:+UseCompressedOops'
heroku config:add JAVA_OPTS='-Xmx384m -Xms384m -Xss512k -XX:+UseCompressedOops'
I also had to set -Xms otherwise I got an error saying the min and max
were incompatible. I guess Play2.2 was using a default higher than
384m.

Hadoop - Map and reduce stuck at 0% all the time

I'm trying to implement the PageRank algorithm on Hadoop platform with Eclipse, but I'm facing some unusual problems :). I tried this locally: installed cygwin, set up Hadoop 0.19.2 (and 0.18.0), started the necessary daemons and installed Eclipse 3.3.1. I uploaded testinf .txt file and then tried to run the WordCount example or even a simple .java and I got this output (for about 100 times :)) ):
10/07/22 22:10:23 INFO mapred.FileInputFormat: Total input paths to process : 1
10/07/22 22:10:23 INFO mapred.JobClient: Running job: job_201007220415_0017
10/07/22 22:10:24 INFO mapred.JobClient: map 0% reduce 0%
Map and reduce are 0% all the time. I tried with Hadoop on virtual machine and I got the same situation.
I followed all the instructions from Hadoop page and other useful pages, but it didn't resolve my problem. Any suggestions? :)
That sounds like a problem more with your Hadoop setup than with Eclipse. Make sure you have all the pieces of your cluster running, i.e. DataNode(s), TaskTracker(s), JobTracker. If those are all running, it might be a problem with the way you're setting up the job.
Are you bent to-do this in Java? If not you can use a Ruby gem called WUKONG that has a pagerank example http://github.com/mrflip/wukong/tree/master/examples/pagerank/