Unable to import cosmosDB packages in spark-shell - scala

I am trying to upload some data from dataframe to azure cosmosDB.
I have downloaded the below jar files and added to my local folder along with eventHub_Jars.
azure-cosmosdb-spark_2.3.0_2.11-1.3.3.jar
azure-cosmosdb-2.0.0.jar
azure-documentdb-1.16.4.jar
documentdb-bulkexecutor-2.4.1.jar
Below is the script i used to open the shell script which is working.
shell-script --master local --jars eventHub_Jars/azure-cosmosdb-spark_2.3.0_2.11-1.3.3.jar
When I use the shell script along with eventHub jars or other jars as
spark-shell --master local --packages com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.13 --jars eventHub_Jars/scala-library-2.11.12.jar, azure-eventhubs-spark_2.11-2.3.2.jar, azure-eventhubs-1.0.2.jar, proton-j-0.25.0.jar, scala-java8-compat_2.11-0.9.0.jar, slf4j-api-1.7.25.jar, azure-cosmosdb-spark_2.3.0_2.11-1.3.3.jar
Shell script is opening
But when I try to import
import com.microsoft.azure.cosmosdb.spark.config.Config
it is throwing the below error
error: object cosmosdb is not a member of package com.microsoft.azure
import com.microsoft.azure.cosmosdb.spark.config.Config
what could be the reason for the above error.?
Is there any syntax issue? It seems like the only first jar added is working. If we try to import any package from any other jars, it will throw the above error!

When I tried this I had an issue with the --jars option using the relative path to retrieve the jar files unless I added "file:///" to the start of the path where I had stored the jar files.
For example if a jar file was located in /usr/local/spark/jars_added/ (a folder I created) the required path for the --jars option is file:///usr/local/spark/jars_added/*.jar where "*" represents your jar name.
The following won't be the same on your machine, however, you get the idea for specifying the jar files.
spark-shell
--master local
--packages com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.13
--jars file:///usr/local/spark/jars_added/eventHub_Jars/scala-library-2.11.12.jar,
file:///usr/local/spark/jars_added/azure-eventhubs-spark_2.11-2.3.2.jar,
file:///usr/local/spark/jars_added/azure-eventhubs-1.0.2.jar,
file:///usr/local/spark/jars_added/proton-j-0.25.0.jar,
file:///usr/local/spark/jars_added/scala-java8-compat_2.11-0.9.0.jar,
file:///usr/local/spark/jars_added/slf4j-api-1.7.25.jar,
file:///usr/local/spark/jars_added/azure-cosmosdb-spark_2.3.0_2.11-1.3.3.jar
Alternatively, you can copy the jar files to the default location where jar files are retrieved for each spark session (note if you have a jars folder in $SPARK_HOME this will override the default location. In case readers are unsure the $SPARK_HOME is most likely equal to /usr/local/spark). On my machine jars are retrieved from /usr/local/spark/assembly/target/scala-2.11/jars by default for example.

It is working when I specify the full path for each jars after --jars
spark-shell --master local --packages com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.13 --jars eventHub_Jars/scala-library-2.11.12.jar,eventHub_Jars/azure-eventhubs-spark_2.11-2.3.2.jar,eventHub_Jars/azure-eventhubs-1.0.2.jar,eventHub_Jars/proton-j-0.25.0.jar,eventHub_Jars/scala-java8-compat_2.11-0.9.0.jar,eventHub_Jars/slf4j-api-1.7.25.jar,eventHub_Jars/azure-cosmosdb-spark_2.3.0_2.11-1.3.3.jar

Related

PySpark - How to build a java jar to be used in spark-submit --packages

Need this jar for listening to metrics data for streaming query. The jar works when passed from file system using --jars, but gives "unresolved dependency" error passed as --packages when read from remote repo (jfrog). But, all other dependencies works like abris, kafka etc.
EDIT- previously asked question with no conclusion: How to use custom jars in spark-submit --packages

Why does the classpath used by spark-submit unexpectedly have jars from under the python installation?

I have a jar file that contains some Scala (and Java) code that I run using the following spark-submit command:
spark-submit
--verbose
--class mycompany.MyClass
--conf spark.driver.extraJavaOptions=-Dconfig.resource=dev-test.conf
--conf "spark.executor.extraJavaOptions=-Dconfig.resource=dev-test.conf -verbose:class"
--conf 'spark.driver.extraJavaOptions=-verbose:class'
--master yarn
--driver-library-path /usr/lib/hadoop-lzo/lib/native/
--jars /home/hadoop/mydir/spark-utils-1.1.0-SNAPSHOT.jar,/usr/lib/phoenix/phoenix-client-hbase-2.4-5.1.2.jar,/usr/lib/hadoop-lzo/lib/hadoop-lzo.jar,/usr/lib/hadoop/lib/commons-compress-1.18.jar,/usr/lib/hadoop/hadoop-aws-3.2.1-amzn-5.jar,/usr/share/aws/aws-java-sdk/aws-java-sdk-bundle-1.12.31.jar
--files /home/hadoop/mydir/dev-test.conf
--queue default /home/hadoop/mydir/spark-utils-1.1.0-SNAPSHOT.jar
<<args to MyClass>>
When I run it, I get an error message - “IAMInstanceCredentialsProvider not found”, which is caused by a version mismatch. It seems IAMInstanceCredentialsProvider was added to hadoop-aws in version 3.3.0 and we want to use 3.2.1. I've gone through our maven dependencies and feel confident that we are not trying to use 3.3.x anywhere.
I've attempted to debug the problem by adding some "verbose" arguments to the command, and I've also added some debug code to MyClass to print out the classpath in effect, following the instructions from here.
When I look at the output, the classpath in effect when we run the spark-submit command includes a lot of jars included with Python, including /usr/local/lib/python3.7/site-packages/pyspark/jars/hadoop-client-api-3.3.1.jar. Thus far, I've been unable to figure out why we are loading jars from /usr/local/lib/python3.7.
Can anybody explain to me where those dependencies are coming from, or suggest a way that I could debug where those dependencies come from? I thought the python might be a result of some environment variable setting, but if so, it doesn't seem to be set at the top level:
set|grep -i python
doesn't return anything.

how to add the Postgresql JDBC driver jar to the classpath Within the Spark Shell?

I am trying to add Postgresql jar to spark shell using
"scala> :require < path_to_postgresql-42.0.0.jar >" but it showing error
"< console >:1: error: ';' expected but double literal found"
any suggestion ??
There are some way to add your dependencies
1st add package as dependency
Spark will pull package from maven center for you
spark-shell --packages org.postgresql:postgresql:42.0.0
2nd add jars to your spark shell
Change the path to your case
spark-shell --jars /YOUR/PATH/postgresql-42.0.0.jar
Is there a way to verify setting up your class path is successful or not.

How to add jar files in pyspark anaconda?

from pyspark.sql import Row
from pyspark import SparkConf, SparkContext
conf=SparkConf().setAppName("2048roject").setMaster("local[*]")\
.set("spark.driver.maxResultSize", "80g").set("spark.executor.memory", "5g").set("spark.driver.memory", "60g")
sc=SparkContext.getOrCreate(conf)
dfv = sc.textFile("./part-001*.gz")
I have install pyspark thru anaconda and I can import pyspark in anaconda python. But I don't know how to add jar files in conf.
I tried
conf=SparkConf().setAppName("2048roject").setMaster("local[*]")\
.set("spark.driver.maxResultSize", "80g").set("spark.executor.memory", "5g").set("spark.driver.memory", "60g").set('spark.jars.packages','file:///XXX,jar')
but it doesn't work.
Any proper way to add jar file here ?
The docs say:
spark.jars.packages: Comma-separated list of Maven coordinates of jars to include on the driver and executor classpaths. The coordinates should be groupId:artifactId:version. If spark.jars.ivySettings is given artifacts will be resolved according to the configuration in the file, otherwise artifacts will be searched for in the local maven repo, then maven central and finally any additional remote repositories given by the command-line option --repositories. For more details, see Advanced Dependency Management.
Instead, you should simply use spark.jars:
spark.jars: Comma-separated list of jars to include on the driver and executor classpaths. Globs are allowed.
So:
conf=SparkConf().setAppName("2048roject").setMaster("local[*]")\
.set("spark.driver.maxResultSize", "80g").set("spark.executor.memory", "5g").set("spark.driver.memory", "60g").set('spark.jars.files','file:///XXX.jar')

--files in SPARK_SUBMIT_OPTIONS not working in zeppelin

I have a python package with many modules built into an .egg file and I want to use this inside zeppelin notebook. Acc to the zeppelin documentation, to pass this package to zeppelin spark interpreter, you can export it through --files option in SPARK_SUBMIT_OPTIONS in conf/zeppelin-env.sh.
When I add the .egg through the --files option in SPARK_SUBMIT_OPTIONS , zeppelin notebook is not throwing error, but I am not able to import the module inside the zeppelin notebook.
What's the correct way to pass an .egg file zeppelin spark intrepreter?
Spark version is 1.6.2 and zeppelin version is 0.6.0
The zepplein-env.sh file contains the follwing:
export SPARK_HOME=/home/me/spark-1.6.1-bin-hadoop2.6
export SPARK_SUBMIT_OPTIONS="--jars /home/me/spark-csv-1.5.0-s_2.10.jar,/home/me/commons-csv-1.4.jar --files /home/me/models/Churn-zeppelin/package/build/dist/fly_libs-1.1-py2.7.egg"
You also need to adjust the PYTHONPATH on the executor nodes:
export SPARK_SUBMIT_OPTIONS="... --conf 'spark.executorEnv.PYTHONPATH=fly_libs-1.1-py2.7.egg:pyspark.zip:py4j-0.10.3-src.zip' ..."
It does not seem to be possible to append to an existing python path, therefore make sure you list all the required dependencies.