How to add jar to Spark in Pycharm - pyspark

I want to debug Spark code in PyCharm because it is easier to debug. But I need to add a spark-redis.jar otherwise Failed to find data source: redis
The code to connect to redis is
spark = SparkSession \
.builder \
.appName("Streaming Image Consumer") \
.config("spark.redis.host", self.redis_host) \
.config("spark.redis.port", self.redis_port) \
.getOrCreate()
How to do fix it if using PyCharm?
I have tried adding spark.driver.extraClassPath in $SPARK_HOME/conf/spark-defaults.conf but it does not work.
I also tried adding environment variable PYSPARK_SUBMIT_ARGS --jars ... in run configuration but it raise other error

Adding spark.driver.extraClassPath to spark-defaults.conf works for me with Spark 2.3.1
cat /Users/oleksiidiagiliev/Soft/spark-2.3.1-bin-hadoop2.7/conf/spark-defaults.conf
spark.driver.extraClassPath /Users/oleksiidiagiliev/.m2/repository/com/redislabs/spark-redis/2.3.1-SNAPSHOT/spark-redis-2.3.1-SNAPSHOT-jar-with-dependencies.jar
Please note, this is a jar with dependencies (you can build one from sources using mvn clean install -DskipTests).
Aslo I added pyspark libraries and SPARK_HOME environment variable to PyCharm project as described here https://medium.com/parrot-prediction/integrating-apache-spark-2-0-with-pycharm-ce-522a6784886f

Related

Why does the classpath used by spark-submit unexpectedly have jars from under the python installation?

I have a jar file that contains some Scala (and Java) code that I run using the following spark-submit command:
spark-submit
--verbose
--class mycompany.MyClass
--conf spark.driver.extraJavaOptions=-Dconfig.resource=dev-test.conf
--conf "spark.executor.extraJavaOptions=-Dconfig.resource=dev-test.conf -verbose:class"
--conf 'spark.driver.extraJavaOptions=-verbose:class'
--master yarn
--driver-library-path /usr/lib/hadoop-lzo/lib/native/
--jars /home/hadoop/mydir/spark-utils-1.1.0-SNAPSHOT.jar,/usr/lib/phoenix/phoenix-client-hbase-2.4-5.1.2.jar,/usr/lib/hadoop-lzo/lib/hadoop-lzo.jar,/usr/lib/hadoop/lib/commons-compress-1.18.jar,/usr/lib/hadoop/hadoop-aws-3.2.1-amzn-5.jar,/usr/share/aws/aws-java-sdk/aws-java-sdk-bundle-1.12.31.jar
--files /home/hadoop/mydir/dev-test.conf
--queue default /home/hadoop/mydir/spark-utils-1.1.0-SNAPSHOT.jar
<<args to MyClass>>
When I run it, I get an error message - “IAMInstanceCredentialsProvider not found”, which is caused by a version mismatch. It seems IAMInstanceCredentialsProvider was added to hadoop-aws in version 3.3.0 and we want to use 3.2.1. I've gone through our maven dependencies and feel confident that we are not trying to use 3.3.x anywhere.
I've attempted to debug the problem by adding some "verbose" arguments to the command, and I've also added some debug code to MyClass to print out the classpath in effect, following the instructions from here.
When I look at the output, the classpath in effect when we run the spark-submit command includes a lot of jars included with Python, including /usr/local/lib/python3.7/site-packages/pyspark/jars/hadoop-client-api-3.3.1.jar. Thus far, I've been unable to figure out why we are loading jars from /usr/local/lib/python3.7.
Can anybody explain to me where those dependencies are coming from, or suggest a way that I could debug where those dependencies come from? I thought the python might be a result of some environment variable setting, but if so, it doesn't seem to be set at the top level:
set|grep -i python
doesn't return anything.

Unable to import cosmosDB packages in spark-shell

I am trying to upload some data from dataframe to azure cosmosDB.
I have downloaded the below jar files and added to my local folder along with eventHub_Jars.
azure-cosmosdb-spark_2.3.0_2.11-1.3.3.jar
azure-cosmosdb-2.0.0.jar
azure-documentdb-1.16.4.jar
documentdb-bulkexecutor-2.4.1.jar
Below is the script i used to open the shell script which is working.
shell-script --master local --jars eventHub_Jars/azure-cosmosdb-spark_2.3.0_2.11-1.3.3.jar
When I use the shell script along with eventHub jars or other jars as
spark-shell --master local --packages com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.13 --jars eventHub_Jars/scala-library-2.11.12.jar, azure-eventhubs-spark_2.11-2.3.2.jar, azure-eventhubs-1.0.2.jar, proton-j-0.25.0.jar, scala-java8-compat_2.11-0.9.0.jar, slf4j-api-1.7.25.jar, azure-cosmosdb-spark_2.3.0_2.11-1.3.3.jar
Shell script is opening
But when I try to import
import com.microsoft.azure.cosmosdb.spark.config.Config
it is throwing the below error
error: object cosmosdb is not a member of package com.microsoft.azure
import com.microsoft.azure.cosmosdb.spark.config.Config
what could be the reason for the above error.?
Is there any syntax issue? It seems like the only first jar added is working. If we try to import any package from any other jars, it will throw the above error!
When I tried this I had an issue with the --jars option using the relative path to retrieve the jar files unless I added "file:///" to the start of the path where I had stored the jar files.
For example if a jar file was located in /usr/local/spark/jars_added/ (a folder I created) the required path for the --jars option is file:///usr/local/spark/jars_added/*.jar where "*" represents your jar name.
The following won't be the same on your machine, however, you get the idea for specifying the jar files.
spark-shell
--master local
--packages com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.13
--jars file:///usr/local/spark/jars_added/eventHub_Jars/scala-library-2.11.12.jar,
file:///usr/local/spark/jars_added/azure-eventhubs-spark_2.11-2.3.2.jar,
file:///usr/local/spark/jars_added/azure-eventhubs-1.0.2.jar,
file:///usr/local/spark/jars_added/proton-j-0.25.0.jar,
file:///usr/local/spark/jars_added/scala-java8-compat_2.11-0.9.0.jar,
file:///usr/local/spark/jars_added/slf4j-api-1.7.25.jar,
file:///usr/local/spark/jars_added/azure-cosmosdb-spark_2.3.0_2.11-1.3.3.jar
Alternatively, you can copy the jar files to the default location where jar files are retrieved for each spark session (note if you have a jars folder in $SPARK_HOME this will override the default location. In case readers are unsure the $SPARK_HOME is most likely equal to /usr/local/spark). On my machine jars are retrieved from /usr/local/spark/assembly/target/scala-2.11/jars by default for example.
It is working when I specify the full path for each jars after --jars
spark-shell --master local --packages com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.13 --jars eventHub_Jars/scala-library-2.11.12.jar,eventHub_Jars/azure-eventhubs-spark_2.11-2.3.2.jar,eventHub_Jars/azure-eventhubs-1.0.2.jar,eventHub_Jars/proton-j-0.25.0.jar,eventHub_Jars/scala-java8-compat_2.11-0.9.0.jar,eventHub_Jars/slf4j-api-1.7.25.jar,eventHub_Jars/azure-cosmosdb-spark_2.3.0_2.11-1.3.3.jar

spark-submit for a .scala file

I have been running some test spark scala code using probably a bad way of doing things with spark-shell:
spark-shell --conf spark.neo4j.bolt.password=Stuffffit --packages neo4j-contrib:neo4j-spark-connector:2.0.0-M2,graphframes:graphframes:0.2.0-spark2.0-s_2.11 -i neo4jsparkCluster.scala
This would execute my code on spark and pop into the shell when done.
Now that I am trying to run this on a cluster, I think I need to use spark-submit, to which I thought would be:
spark-submit --conf spark.neo4j.bolt.password=Stuffffit --packages neo4j-contrib:neo4j-spark-connector:2.0.0-M2,graphframes:graphframes:0.2.0-spark2.0-s_2.11 -i neo4jsparkCluster.scala
but it does not like the .scala file, somehow does it have to be compiled into a class? the scala code is a simple scala file with several helper classes defined in it and no real main class so to speak. I don't see int he help files but maybe I am missing it, can I just spark-submit a file or do I have to somehow give it the class? Thus changing my scala code?
I did add this to my scala code too:
went from this
val conf = new SparkConf.setMaster("local").setAppName("neo4jspark")
val sc = new SparkContext(conf)
To this:
val sc = new SparkContext(new SparkConf().setMaster("spark://192.20.0.71:7077")
There are 2 quick and dirty ways of doing this:
Without modifying the scala file
Simply use the spark shell with the -i flag:
$SPARK_HOME/bin/spark-shell -i neo4jsparkCluster.scala
Modifying the scala file to include a main method
a. Compile:
scalac -classpath <location of spark jars on your machine> neo4jsparkCluster
b. Submit it to your cluster:
/usr/lib/spark/bin/spark-submit --class <qualified class name> --master <> .
You will want to package your scala application with sbt and include Spark as a dependency within your build.sbt file.
See the self contained applications section of the quickstart guide for full instructions https://spark.apache.org/docs/latest/quick-start.html
You can take a look at the following Hello World example for Spark which packages your application as #zachdb86 already mentioned.
spark-hello-world

Spark Examples NoClassDefFoundError scopt/OptionParser

I've build Spark 2.1 source code successfully.
However, when I run some of the examples (e.g., org.apache.spark.examples.mllib.BinaryClassification), I get the following error.
Exception in thread "main" java.lang.NoClassDefFoundError: scopt/OptionParser
I tried to run those examples using Spark 2.1 pre-built version (examples/jars/spark-examples_2.11-2.1.0.jar), and I got the same error. Spark 1.6 pre-built version works (lib/spark-examples-1.6.2-hadoop2.6.0.jar). There are posts related to this error, but they don't seem to be applicable because Spark examples folder does not have any .sbtfile.
I found the answer. To avoid the error, scopt_x.xx-x.x.x.jar should also be submitted using --jars. When you build Spark examples, in addition to spark-examples_x.xx-x.x.x.jar, scopt_x.xx-x.x.x.jar will be built too (in my case in the same target folder examples/target/scala-2.11/jars).
Once you have the jar file, you can submit it with your applications:
./bin/spark-submit \
--jars examples/target/scala-2.11/jars/scopt_x.xx-x.x.x.jar \
--class org.apache.spark.examples.mllib.BinaryClassification \
--master ...

--files in SPARK_SUBMIT_OPTIONS not working in zeppelin

I have a python package with many modules built into an .egg file and I want to use this inside zeppelin notebook. Acc to the zeppelin documentation, to pass this package to zeppelin spark interpreter, you can export it through --files option in SPARK_SUBMIT_OPTIONS in conf/zeppelin-env.sh.
When I add the .egg through the --files option in SPARK_SUBMIT_OPTIONS , zeppelin notebook is not throwing error, but I am not able to import the module inside the zeppelin notebook.
What's the correct way to pass an .egg file zeppelin spark intrepreter?
Spark version is 1.6.2 and zeppelin version is 0.6.0
The zepplein-env.sh file contains the follwing:
export SPARK_HOME=/home/me/spark-1.6.1-bin-hadoop2.6
export SPARK_SUBMIT_OPTIONS="--jars /home/me/spark-csv-1.5.0-s_2.10.jar,/home/me/commons-csv-1.4.jar --files /home/me/models/Churn-zeppelin/package/build/dist/fly_libs-1.1-py2.7.egg"
You also need to adjust the PYTHONPATH on the executor nodes:
export SPARK_SUBMIT_OPTIONS="... --conf 'spark.executorEnv.PYTHONPATH=fly_libs-1.1-py2.7.egg:pyspark.zip:py4j-0.10.3-src.zip' ..."
It does not seem to be possible to append to an existing python path, therefore make sure you list all the required dependencies.