PySpark - How to build a java jar to be used in spark-submit --packages - pyspark

Need this jar for listening to metrics data for streaming query. The jar works when passed from file system using --jars, but gives "unresolved dependency" error passed as --packages when read from remote repo (jfrog). But, all other dependencies works like abris, kafka etc.
EDIT- previously asked question with no conclusion: How to use custom jars in spark-submit --packages

Related

Why does the classpath used by spark-submit unexpectedly have jars from under the python installation?

I have a jar file that contains some Scala (and Java) code that I run using the following spark-submit command:
spark-submit
--verbose
--class mycompany.MyClass
--conf spark.driver.extraJavaOptions=-Dconfig.resource=dev-test.conf
--conf "spark.executor.extraJavaOptions=-Dconfig.resource=dev-test.conf -verbose:class"
--conf 'spark.driver.extraJavaOptions=-verbose:class'
--master yarn
--driver-library-path /usr/lib/hadoop-lzo/lib/native/
--jars /home/hadoop/mydir/spark-utils-1.1.0-SNAPSHOT.jar,/usr/lib/phoenix/phoenix-client-hbase-2.4-5.1.2.jar,/usr/lib/hadoop-lzo/lib/hadoop-lzo.jar,/usr/lib/hadoop/lib/commons-compress-1.18.jar,/usr/lib/hadoop/hadoop-aws-3.2.1-amzn-5.jar,/usr/share/aws/aws-java-sdk/aws-java-sdk-bundle-1.12.31.jar
--files /home/hadoop/mydir/dev-test.conf
--queue default /home/hadoop/mydir/spark-utils-1.1.0-SNAPSHOT.jar
<<args to MyClass>>
When I run it, I get an error message - “IAMInstanceCredentialsProvider not found”, which is caused by a version mismatch. It seems IAMInstanceCredentialsProvider was added to hadoop-aws in version 3.3.0 and we want to use 3.2.1. I've gone through our maven dependencies and feel confident that we are not trying to use 3.3.x anywhere.
I've attempted to debug the problem by adding some "verbose" arguments to the command, and I've also added some debug code to MyClass to print out the classpath in effect, following the instructions from here.
When I look at the output, the classpath in effect when we run the spark-submit command includes a lot of jars included with Python, including /usr/local/lib/python3.7/site-packages/pyspark/jars/hadoop-client-api-3.3.1.jar. Thus far, I've been unable to figure out why we are loading jars from /usr/local/lib/python3.7.
Can anybody explain to me where those dependencies are coming from, or suggest a way that I could debug where those dependencies come from? I thought the python might be a result of some environment variable setting, but if so, it doesn't seem to be set at the top level:
set|grep -i python
doesn't return anything.

Unable to find akka Configurations with Spark Submit

I built a fat jar and I am trying to run it with spark-submit on an EMR or locally. here is the command:
spark-submit \
--deploy-mode client \
--class com.stash.data.omni.source.Runner myJar.jar \
<arguments>
I keep getting an error related to akka configurations:
Exception in thread "main" com.typesafe.config.ConfigException$Missing: No configuration setting found for key 'akka.version'
It seems like the jar cannot find the reference.confs for akka at all. Has anyone dealt with this? I am able to run it without spark-submit on my local machine.
I think the issue is bundling into a jar with all it's dependencies, which causes problems with Akka, as described in the documentation:
Akka’s configuration approach relies heavily on the notion of every
module/jar having its own reference.conf file. All of these will be
discovered by the configuration and loaded. Unfortunately this also
means that if you put/merge multiple jars into the same jar, you need
to merge all the reference.conf files as well: otherwise all defaults
will be lost.
You can follow this documentation to package your application and process to merge the reference.conf resources while bundling.It talks about packaging using sbt, maven, and gradle.
Let me know if it helps!!
it was my merge strategy. i had a catch all case _ => MergeStrategy.first. i changed it to case x => MergeStrategy.defaultMergeStrategy(x) and it worked.

Unable to import cosmosDB packages in spark-shell

I am trying to upload some data from dataframe to azure cosmosDB.
I have downloaded the below jar files and added to my local folder along with eventHub_Jars.
azure-cosmosdb-spark_2.3.0_2.11-1.3.3.jar
azure-cosmosdb-2.0.0.jar
azure-documentdb-1.16.4.jar
documentdb-bulkexecutor-2.4.1.jar
Below is the script i used to open the shell script which is working.
shell-script --master local --jars eventHub_Jars/azure-cosmosdb-spark_2.3.0_2.11-1.3.3.jar
When I use the shell script along with eventHub jars or other jars as
spark-shell --master local --packages com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.13 --jars eventHub_Jars/scala-library-2.11.12.jar, azure-eventhubs-spark_2.11-2.3.2.jar, azure-eventhubs-1.0.2.jar, proton-j-0.25.0.jar, scala-java8-compat_2.11-0.9.0.jar, slf4j-api-1.7.25.jar, azure-cosmosdb-spark_2.3.0_2.11-1.3.3.jar
Shell script is opening
But when I try to import
import com.microsoft.azure.cosmosdb.spark.config.Config
it is throwing the below error
error: object cosmosdb is not a member of package com.microsoft.azure
import com.microsoft.azure.cosmosdb.spark.config.Config
what could be the reason for the above error.?
Is there any syntax issue? It seems like the only first jar added is working. If we try to import any package from any other jars, it will throw the above error!
When I tried this I had an issue with the --jars option using the relative path to retrieve the jar files unless I added "file:///" to the start of the path where I had stored the jar files.
For example if a jar file was located in /usr/local/spark/jars_added/ (a folder I created) the required path for the --jars option is file:///usr/local/spark/jars_added/*.jar where "*" represents your jar name.
The following won't be the same on your machine, however, you get the idea for specifying the jar files.
spark-shell
--master local
--packages com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.13
--jars file:///usr/local/spark/jars_added/eventHub_Jars/scala-library-2.11.12.jar,
file:///usr/local/spark/jars_added/azure-eventhubs-spark_2.11-2.3.2.jar,
file:///usr/local/spark/jars_added/azure-eventhubs-1.0.2.jar,
file:///usr/local/spark/jars_added/proton-j-0.25.0.jar,
file:///usr/local/spark/jars_added/scala-java8-compat_2.11-0.9.0.jar,
file:///usr/local/spark/jars_added/slf4j-api-1.7.25.jar,
file:///usr/local/spark/jars_added/azure-cosmosdb-spark_2.3.0_2.11-1.3.3.jar
Alternatively, you can copy the jar files to the default location where jar files are retrieved for each spark session (note if you have a jars folder in $SPARK_HOME this will override the default location. In case readers are unsure the $SPARK_HOME is most likely equal to /usr/local/spark). On my machine jars are retrieved from /usr/local/spark/assembly/target/scala-2.11/jars by default for example.
It is working when I specify the full path for each jars after --jars
spark-shell --master local --packages com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.13 --jars eventHub_Jars/scala-library-2.11.12.jar,eventHub_Jars/azure-eventhubs-spark_2.11-2.3.2.jar,eventHub_Jars/azure-eventhubs-1.0.2.jar,eventHub_Jars/proton-j-0.25.0.jar,eventHub_Jars/scala-java8-compat_2.11-0.9.0.jar,eventHub_Jars/slf4j-api-1.7.25.jar,eventHub_Jars/azure-cosmosdb-spark_2.3.0_2.11-1.3.3.jar

IllegalAccessError when running spark job in EMR

I am attempting to run a spark job that accesses dynamodb and the old way of instantiating a dynamoDb client has been deprecated and it is now recommended to use the client builder.
Well, this works fine locally, but when I deploy to EMR i'm getting this error:
Exception in thread "main" java.lang.IllegalAccessError: tried to access class com.amazonaws.services.dynamodbv2.AmazonDynamoDBClientConfigurationFactory from class com.amazonaws.services.dynamodbv2.AmazonDynamoDBAsyncClientBuilder
My code that causes this is:
val dynamoDbClient = AmazonDynamoDBAsyncClientBuilder
.standard()
.withRegion(Regions.US_EAST_1)
.build()
my build.sbt contains:
libraryDependencies += "com.amazonaws" % "aws-java-sdk" % "1.11.114"
and my spark-submit command looks like this:
spark-submit --conf spark.eventLog.enabled=false --packages com.typesafe.play:play-json_2.11:2.5.9,com.github.traviscrawford:spark-dynamodb:0.0.6,com.amazonaws:aws-java-sdk:1.11.114 --master yarn --deploy-mode cluster --class Main application.jar
Does anyone have any ideas? Am I overlooking something basic?
Update
I noticed that EMR was running OpenJDK 1.8 and my local system was running Oracle Java 1.8. I changed the EMR cluster to match the java I was running, but there was still no change.
I dont have a perfect answer here but I'm struggling with a similar problem with a fat jar build Spark Driver running on EMR. So I drop my recent tour.
Try to run spark-submit with option -v and look into the logs about class paths and so forth. As I can see EMR is loading an aws-java-sdk as well. Its not clear to me which version of aws-java-sdk EMR is running? EMR release 4.7.0 states "Upgraded the AWS SDK for Java to 1.10.75" (http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-whatsnew.html).
Then add another argument --conf spark.driver.userClassPathFirst=true
to load the aws-java-sdk version your driver specifies.
Unfortunately the last step raises yarn errors like: Unable to load YARN support ... (some discussion on that: https://community.cloudera.com/t5/Advanced-Analytics-Apache-Spark/spark-submit-fails-after-setting-userClassPathFirst-to-true/td-p/46778)
Some discussion from the aws-java-sdk github repos: https://github.com/aws/aws-sdk-java/issues/1094
Conclusion: For now use apis of aws-java-sdk version 1.10.75

Why does running Spark job fail to find classes inside uberjar on EMR while it works locally fine?

I have a Spark Job that is using some external libraries to work. When I run the job locally through the main method from IntelliJ the job runs without any issues. However, when I assembly my job into a jarfile (I create an UberJAR using sbt) and I try to run it on EMR, it throws a ClassNotFoundException.
I have checked that the class is indeed inside the jarfile so it should be available for the job to run. I have also tried the spark-submit options spark.driver.extraClassPath, spark.driver.extraLibraryPath, spark.executor.extraClassPath and spark.executor.extraLibraryPath as well as spark.driver.userClassPathFirst and spark.executor.userClassPathFirst. Also, I tried doing in the code sparkContext.addJar("/mnt/jars/myJar"). None of them worked for me.
Also, when running on EMR I can read the log that says that the JAR was added (not sure if it is loaded on the classpath, but it should because other classes are being loaded properly):
15/11/02 04:10:26 INFO SparkContext: Added JAR file:///mnt/my-app-1.0-SNAPSHOT.jar at http://172.31.42.244:44471/jars/my-app-1.0-SNAPSHOT.jar with timestamp 1446437426661
I am running out of ideas about what else to try. I have been researching and I see few tickets on the Spark JIRA board but nothing similar to my issue.
I am running on EMR release-label 4.1.0 (Spark 1.5.0), Java 7, sbt 0.13.7 and Scala 2.10.5.
I think when launching your job on EMR you need to provide the s3 location for your jar dependencies a la the manual e.g. -u s3://sparksupport/libs. These jars will be added to the classpath when running spark.
It turned out to be a problem with SerializationUtils from Apache Commons Lang. There is an open issue where the class will throw a ClassNotFoundException even if the class is in the classpath in a multiple-classloader environment: https://issues.apache.org/jira/browse/LANG-1049
We moved away from the library and our Spark job is working fine now. The issue was not related with Spark finally.