MongoDB Spark Connector not working on Windows - mongodb

I am using MongoDB-Spark Connector on Windows.I have Spark installend in C drive in C:/Spark.
I have clone MongoDB Spark connector in using following command in c drive -
git clone https://github.com/mongodb/mongo-spark.git
and mongo-spark folder is created in C drive.
When I am running following command in Spark bin folder -
C:\spark\bin>spark-shell --conf "spark.mongodb.input.uri=mongodb://127.0.0.1/test.CoOrder?readPreference=primaryPreferred" --conf "spark.mongodb.output.uri=mongodb://127.0.0.1/test.CoOrder1" --packages org.mongodb.spark:mongo-spark-connector_2.11:1.1.0
There is fllowing error -
'C:\spark\bin\spark-shell2.cmd" --conf "spark.mongodb.input.uri' is
not recognized as an internal or external command,operable program or
batch file.
How can I connect spark with MongoDB?
Here my spark is not connectoed to mongo-spark folder.How can I link spark with mongo-spark folder?
Thanks

The error is generally related to incomplete installation of Apache Spark on Windows. Ensure that you can first execute spark-shell command on it's own to get a Spark Scala shell.
Note that you do not need to clone mongo-spark git repository to use MongoDB Spark Connector, the spark-shell option --packages org.mongodb.spark:mongo-spark-connector_ will download the necessary jars from maven central.
See also MongoDB Spark Connector

Related

PySpark - How to build a java jar to be used in spark-submit --packages

Need this jar for listening to metrics data for streaming query. The jar works when passed from file system using --jars, but gives "unresolved dependency" error passed as --packages when read from remote repo (jfrog). But, all other dependencies works like abris, kafka etc.
EDIT- previously asked question with no conclusion: How to use custom jars in spark-submit --packages

Spark Shell not working after adding support for Iceberg

We are doing POC on Iceberg and evaluating it first time.
Spark Environment:
Spark Standalone Cluster Setup ( 1 master and 5 workers)
Spark: spark-3.1.2-bin-hadoop3.2
Scala: 2.12.10
Java: 1.8.0_321
Hadoop: 3.2.0
Iceberg 0.13.1
As suggested in Iceberg's official documentation, to add support for Iceberg in Spark shell, we are adding Iceberg dependency while launching the Spark shell as below,
spark-shell --packages org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:0.13.1
After launching the Spark shell with the above command, we are not able to use the Spark shell at all. For all the commands (even non Iceberg) we are getting the same exception as below,
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/sql/catalyst/plans/logical/BinaryCommand
Below simple command also throwing same exception.
val df : DataFrame = spark.read.json("/spark-3.1.2-bin-hadoop3.2/examples/src/main/resources/people.json")
df.show()
In Spark source code, BinaryCommand class belongs to Spark SQL module, so tried explicitly adding Spark SQL dependency while launching Spark shell as below, but still getting same exception.
spark-shell --packages org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:0.13.1,org.apache.spark:spark-sql_2.12:3.1.2
When we launch spark-shell normally i.e. without Iceberg dependency, then it is working properly.
Any pointer in the right direction for troubleshooting would be really helpful.
Thanks.
We are using the wrong Iceberg version, choose the spark 3.2 iceberg jar but running Spark 3.1. After using the correct dependency version (i.e. 3.1), we are able to launch the Spark shell with Iceberg. Also no need to specify org.apache.spark Spark jars using packages since all of that will be on the classpath anyway.
spark-shell --packages org.apache.iceberg:iceberg-spark-runtime-3.1_2.12:0.13.1

IllegalAccessError when running spark job in EMR

I am attempting to run a spark job that accesses dynamodb and the old way of instantiating a dynamoDb client has been deprecated and it is now recommended to use the client builder.
Well, this works fine locally, but when I deploy to EMR i'm getting this error:
Exception in thread "main" java.lang.IllegalAccessError: tried to access class com.amazonaws.services.dynamodbv2.AmazonDynamoDBClientConfigurationFactory from class com.amazonaws.services.dynamodbv2.AmazonDynamoDBAsyncClientBuilder
My code that causes this is:
val dynamoDbClient = AmazonDynamoDBAsyncClientBuilder
.standard()
.withRegion(Regions.US_EAST_1)
.build()
my build.sbt contains:
libraryDependencies += "com.amazonaws" % "aws-java-sdk" % "1.11.114"
and my spark-submit command looks like this:
spark-submit --conf spark.eventLog.enabled=false --packages com.typesafe.play:play-json_2.11:2.5.9,com.github.traviscrawford:spark-dynamodb:0.0.6,com.amazonaws:aws-java-sdk:1.11.114 --master yarn --deploy-mode cluster --class Main application.jar
Does anyone have any ideas? Am I overlooking something basic?
Update
I noticed that EMR was running OpenJDK 1.8 and my local system was running Oracle Java 1.8. I changed the EMR cluster to match the java I was running, but there was still no change.
I dont have a perfect answer here but I'm struggling with a similar problem with a fat jar build Spark Driver running on EMR. So I drop my recent tour.
Try to run spark-submit with option -v and look into the logs about class paths and so forth. As I can see EMR is loading an aws-java-sdk as well. Its not clear to me which version of aws-java-sdk EMR is running? EMR release 4.7.0 states "Upgraded the AWS SDK for Java to 1.10.75" (http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-whatsnew.html).
Then add another argument --conf spark.driver.userClassPathFirst=true
to load the aws-java-sdk version your driver specifies.
Unfortunately the last step raises yarn errors like: Unable to load YARN support ... (some discussion on that: https://community.cloudera.com/t5/Advanced-Analytics-Apache-Spark/spark-submit-fails-after-setting-userClassPathFirst-to-true/td-p/46778)
Some discussion from the aws-java-sdk github repos: https://github.com/aws/aws-sdk-java/issues/1094
Conclusion: For now use apis of aws-java-sdk version 1.10.75

Spark Examples NoClassDefFoundError scopt/OptionParser

I've build Spark 2.1 source code successfully.
However, when I run some of the examples (e.g., org.apache.spark.examples.mllib.BinaryClassification), I get the following error.
Exception in thread "main" java.lang.NoClassDefFoundError: scopt/OptionParser
I tried to run those examples using Spark 2.1 pre-built version (examples/jars/spark-examples_2.11-2.1.0.jar), and I got the same error. Spark 1.6 pre-built version works (lib/spark-examples-1.6.2-hadoop2.6.0.jar). There are posts related to this error, but they don't seem to be applicable because Spark examples folder does not have any .sbtfile.
I found the answer. To avoid the error, scopt_x.xx-x.x.x.jar should also be submitted using --jars. When you build Spark examples, in addition to spark-examples_x.xx-x.x.x.jar, scopt_x.xx-x.x.x.jar will be built too (in my case in the same target folder examples/target/scala-2.11/jars).
Once you have the jar file, you can submit it with your applications:
./bin/spark-submit \
--jars examples/target/scala-2.11/jars/scopt_x.xx-x.x.x.jar \
--class org.apache.spark.examples.mllib.BinaryClassification \
--master ...

--files in SPARK_SUBMIT_OPTIONS not working in zeppelin

I have a python package with many modules built into an .egg file and I want to use this inside zeppelin notebook. Acc to the zeppelin documentation, to pass this package to zeppelin spark interpreter, you can export it through --files option in SPARK_SUBMIT_OPTIONS in conf/zeppelin-env.sh.
When I add the .egg through the --files option in SPARK_SUBMIT_OPTIONS , zeppelin notebook is not throwing error, but I am not able to import the module inside the zeppelin notebook.
What's the correct way to pass an .egg file zeppelin spark intrepreter?
Spark version is 1.6.2 and zeppelin version is 0.6.0
The zepplein-env.sh file contains the follwing:
export SPARK_HOME=/home/me/spark-1.6.1-bin-hadoop2.6
export SPARK_SUBMIT_OPTIONS="--jars /home/me/spark-csv-1.5.0-s_2.10.jar,/home/me/commons-csv-1.4.jar --files /home/me/models/Churn-zeppelin/package/build/dist/fly_libs-1.1-py2.7.egg"
You also need to adjust the PYTHONPATH on the executor nodes:
export SPARK_SUBMIT_OPTIONS="... --conf 'spark.executorEnv.PYTHONPATH=fly_libs-1.1-py2.7.egg:pyspark.zip:py4j-0.10.3-src.zip' ..."
It does not seem to be possible to append to an existing python path, therefore make sure you list all the required dependencies.