--files in SPARK_SUBMIT_OPTIONS not working in zeppelin - pyspark

I have a python package with many modules built into an .egg file and I want to use this inside zeppelin notebook. Acc to the zeppelin documentation, to pass this package to zeppelin spark interpreter, you can export it through --files option in SPARK_SUBMIT_OPTIONS in conf/zeppelin-env.sh.
When I add the .egg through the --files option in SPARK_SUBMIT_OPTIONS , zeppelin notebook is not throwing error, but I am not able to import the module inside the zeppelin notebook.
What's the correct way to pass an .egg file zeppelin spark intrepreter?
Spark version is 1.6.2 and zeppelin version is 0.6.0
The zepplein-env.sh file contains the follwing:
export SPARK_HOME=/home/me/spark-1.6.1-bin-hadoop2.6
export SPARK_SUBMIT_OPTIONS="--jars /home/me/spark-csv-1.5.0-s_2.10.jar,/home/me/commons-csv-1.4.jar --files /home/me/models/Churn-zeppelin/package/build/dist/fly_libs-1.1-py2.7.egg"

You also need to adjust the PYTHONPATH on the executor nodes:
export SPARK_SUBMIT_OPTIONS="... --conf 'spark.executorEnv.PYTHONPATH=fly_libs-1.1-py2.7.egg:pyspark.zip:py4j-0.10.3-src.zip' ..."
It does not seem to be possible to append to an existing python path, therefore make sure you list all the required dependencies.

Related

PySpark - How to build a java jar to be used in spark-submit --packages

Need this jar for listening to metrics data for streaming query. The jar works when passed from file system using --jars, but gives "unresolved dependency" error passed as --packages when read from remote repo (jfrog). But, all other dependencies works like abris, kafka etc.
EDIT- previously asked question with no conclusion: How to use custom jars in spark-submit --packages

Unable to import cosmosDB packages in spark-shell

I am trying to upload some data from dataframe to azure cosmosDB.
I have downloaded the below jar files and added to my local folder along with eventHub_Jars.
azure-cosmosdb-spark_2.3.0_2.11-1.3.3.jar
azure-cosmosdb-2.0.0.jar
azure-documentdb-1.16.4.jar
documentdb-bulkexecutor-2.4.1.jar
Below is the script i used to open the shell script which is working.
shell-script --master local --jars eventHub_Jars/azure-cosmosdb-spark_2.3.0_2.11-1.3.3.jar
When I use the shell script along with eventHub jars or other jars as
spark-shell --master local --packages com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.13 --jars eventHub_Jars/scala-library-2.11.12.jar, azure-eventhubs-spark_2.11-2.3.2.jar, azure-eventhubs-1.0.2.jar, proton-j-0.25.0.jar, scala-java8-compat_2.11-0.9.0.jar, slf4j-api-1.7.25.jar, azure-cosmosdb-spark_2.3.0_2.11-1.3.3.jar
Shell script is opening
But when I try to import
import com.microsoft.azure.cosmosdb.spark.config.Config
it is throwing the below error
error: object cosmosdb is not a member of package com.microsoft.azure
import com.microsoft.azure.cosmosdb.spark.config.Config
what could be the reason for the above error.?
Is there any syntax issue? It seems like the only first jar added is working. If we try to import any package from any other jars, it will throw the above error!
When I tried this I had an issue with the --jars option using the relative path to retrieve the jar files unless I added "file:///" to the start of the path where I had stored the jar files.
For example if a jar file was located in /usr/local/spark/jars_added/ (a folder I created) the required path for the --jars option is file:///usr/local/spark/jars_added/*.jar where "*" represents your jar name.
The following won't be the same on your machine, however, you get the idea for specifying the jar files.
spark-shell
--master local
--packages com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.13
--jars file:///usr/local/spark/jars_added/eventHub_Jars/scala-library-2.11.12.jar,
file:///usr/local/spark/jars_added/azure-eventhubs-spark_2.11-2.3.2.jar,
file:///usr/local/spark/jars_added/azure-eventhubs-1.0.2.jar,
file:///usr/local/spark/jars_added/proton-j-0.25.0.jar,
file:///usr/local/spark/jars_added/scala-java8-compat_2.11-0.9.0.jar,
file:///usr/local/spark/jars_added/slf4j-api-1.7.25.jar,
file:///usr/local/spark/jars_added/azure-cosmosdb-spark_2.3.0_2.11-1.3.3.jar
Alternatively, you can copy the jar files to the default location where jar files are retrieved for each spark session (note if you have a jars folder in $SPARK_HOME this will override the default location. In case readers are unsure the $SPARK_HOME is most likely equal to /usr/local/spark). On my machine jars are retrieved from /usr/local/spark/assembly/target/scala-2.11/jars by default for example.
It is working when I specify the full path for each jars after --jars
spark-shell --master local --packages com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.13 --jars eventHub_Jars/scala-library-2.11.12.jar,eventHub_Jars/azure-eventhubs-spark_2.11-2.3.2.jar,eventHub_Jars/azure-eventhubs-1.0.2.jar,eventHub_Jars/proton-j-0.25.0.jar,eventHub_Jars/scala-java8-compat_2.11-0.9.0.jar,eventHub_Jars/slf4j-api-1.7.25.jar,eventHub_Jars/azure-cosmosdb-spark_2.3.0_2.11-1.3.3.jar

how to add the Postgresql JDBC driver jar to the classpath Within the Spark Shell?

I am trying to add Postgresql jar to spark shell using
"scala> :require < path_to_postgresql-42.0.0.jar >" but it showing error
"< console >:1: error: ';' expected but double literal found"
any suggestion ??
There are some way to add your dependencies
1st add package as dependency
Spark will pull package from maven center for you
spark-shell --packages org.postgresql:postgresql:42.0.0
2nd add jars to your spark shell
Change the path to your case
spark-shell --jars /YOUR/PATH/postgresql-42.0.0.jar
Is there a way to verify setting up your class path is successful or not.

EMR Notebook Scala kernel import graphframes library

Running spark-shell --packages "graphframes:graphframes:0.7.0-spark2.4-s_2.11" in the bash shell works and I can successfully import graphframes 0.7, but when I try to use it in a scala jupyter notebook like this:
import scala.sys.process._
"spark-shell --packages \"graphframes:graphframes:0.7.0-spark2.4-s_2.11\""!
import org.graphframes._
gives error message:
<console>:53: error: object graphframes is not a member of package org
import org.graphframes._
Which from what I can tell means that it runs the bash command, but then still cannot find the retrieved package.
I am doing this on an EMR Notebook running a spark scala kernel.
Do I have to set some sort of spark library path in the jupyter environment?
That simply shouldn't work. What your code does is a simple attempt to start a new independent Spark shell. Furthermore Spark packages have to loaded when the SparkContext is initialized for the first time.
You should either add (assuming these are correct versions)
spark.jars.packages graphframes:graphframes:0.7.0-spark2.4-s_2.11
to your Spark configuration files, or use equivalent in your SparkConf / SparkSessionBuilder.config before SparkSession is initialized.

MongoDB Spark Connector not working on Windows

I am using MongoDB-Spark Connector on Windows.I have Spark installend in C drive in C:/Spark.
I have clone MongoDB Spark connector in using following command in c drive -
git clone https://github.com/mongodb/mongo-spark.git
and mongo-spark folder is created in C drive.
When I am running following command in Spark bin folder -
C:\spark\bin>spark-shell --conf "spark.mongodb.input.uri=mongodb://127.0.0.1/test.CoOrder?readPreference=primaryPreferred" --conf "spark.mongodb.output.uri=mongodb://127.0.0.1/test.CoOrder1" --packages org.mongodb.spark:mongo-spark-connector_2.11:1.1.0
There is fllowing error -
'C:\spark\bin\spark-shell2.cmd" --conf "spark.mongodb.input.uri' is
not recognized as an internal or external command,operable program or
batch file.
How can I connect spark with MongoDB?
Here my spark is not connectoed to mongo-spark folder.How can I link spark with mongo-spark folder?
Thanks
The error is generally related to incomplete installation of Apache Spark on Windows. Ensure that you can first execute spark-shell command on it's own to get a Spark Scala shell.
Note that you do not need to clone mongo-spark git repository to use MongoDB Spark Connector, the spark-shell option --packages org.mongodb.spark:mongo-spark-connector_ will download the necessary jars from maven central.
See also MongoDB Spark Connector