How to replace jar internal dependencies when submitting spark submit - scala

I want to replace the library containing the business logic used in the application when submitting spark submit.
For example, I would like to replace the test v1.1.1 internal library that is being used in project "A" with test v2.2.2 upon spark submit.
spark-submit
$SPARK_HOME/bin/spark-submit \
--jars "$TEST_JARS_V2.2.2" \
--name $APP_NAME \
--master yarn \
--deploy-mode cluster \
--verbose \
--class AAA.AAA.batch.EntryPoint \
/.../.../A.jar(exclude test dependency)
sbt
"?.?.?" %% "test" % "1.1.1" % Provided
In this first way
I registered the internal library test v1.1.1 as "Provided" in project A, excluded it from the fat jar, and injected test library v2.2.2 through spark submit --jars.
But it seems that the test library cannot be injected.
The classes in the test library were not found.
Please advise if I did something wrong.

Related

Cloud Storage Client with Scala and Dataproc: missing libraries

I am trying to run a simple spark script in a dataproc cluster, that needs to read/write to a gcs bucket using scala and the java Cloud Storage Client Libraries. The script is the following:
//Build.sbt
name := "trialGCS"
version :="0.0.1"
scalaVersion := "2.12.10"
val sparkVersion = "3.0.1"
libraryDependencies ++= Seq(
// Spark core libraries
"org.apache.spark" %% "spark-core" % sparkVersion,
"org.apache.spark" %% "spark-sql" % sparkVersion,
"com.google.cloud" % "google-cloud-storage" % "1.113.15"
)
resolvers += Resolver.mavenLocal
package DEV
import com.google.cloud.storage.StorageOptions
object TrialGCS extends App {
import spark.implicits._
val storage = StorageOptions.getDefaultInstance.getService
}
I launch the script via terminal with the shell command:
gcloud dataproc jobs submit spark --class DEV.TrialGCS --jars target/scala-2.12/trialgcs_2.12-0.0.1.jar --cluster <CLUSTERNAME> --region=<REGIONNAME>
However this produces the error java.lang.NoClassDefFoundError: com/google/cloud/storage/Storage.
If I include the cloudstorage jar manually, changing --jars in the previous command with
--jars target/scala-2.12/trialgcs_2.12-0.0.1.jar,google-cloud-storage-1.113.15.jar
the error is now java.lang.NoClassDefFoundError: com/google/cloud/Service.
So, apparently it's a matter of missing libraries.
On the other hand if I use spark-shell --packages "com.google.cloud:google-cloud-storage:1.113.15" via ssh in the dataproc driver's vm all works perfectly.
How to solve this issue?
If it's ensured that you've the dependent jar in the driver machine, you can add the jars in the class path explicitly. You can try by the following command,
gcloud dataproc jobs submit spark \
--class DEV.TrialGCS \
--properties spark.driver.extraClassPath=<comma seperated full path of jars>,spark.jars.packages=com.google.cloud:google-cloud-storage:1.113.15 \
--cluster <CLUSTERNAME> --region=<REGIONNAME>
I've found the solution: to manage properly the package dependence, the google-cloud-storage library needs to be included via --properties=spark.jars.packages=<MAVEN_COORDINATES>, as shown in https://cloud.google.com/dataproc/docs/guides/manage-spark-dependencies . In my case this means
gcloud dataproc jobs submit spark --class DEV.TrialGCS \
--jars target/scala-2.12/trialgcs_2.12-0.0.1.jar \
--cluster <CLUSTERNAME> --region=<REGIONNAME> \
--properties=spark.jars.packages="com.google.cloud:google-cloud-storage:1.113.15"
When multiple maven coordinates for packages or multiple properties are necessary, it is necessary to escape the string: https://cloud.google.com/sdk/gcloud/reference/topic/escaping
For instance, for google-cloud-storage and kafka:
--properties=^#^spark.jars.packages=com.google.cloud:google-cloud-storage:1.113.15,org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.1,io#spark.executor.extraClassPath=org.apache.kafka_kafka-clients-2.4.1.jar#spark.driver.extraClassPath=org.apache.kafka_kafka-clients-2.4.1.jar

How can I run uncompiled Spark Scala/spark-shell code as a Dataproc job?

Normally, if I'm using Scala for Spark jobs I'll compile a jarfile and submit it with gcloud dataproc jobs submit spark, but sometimes for very lightweight jobs I might be using uncompiled Scala code in a notebook or using the spark-shell REPL, where I assume a SparkContext is already available.
For some of these lightweight use cases I can equivalently use PySpark and submit with gcloud dataproc jobs submit pyspark but sometimes I need easier access to Scala/Java libraries such as directly creating a org.apache.hadoop.fs.FileSystem object inside of map functions. Is there any easy way to submit such "spark-shell" equivalent jobs directly from a command-line using Dataproc Jobs APIs?
At the moment, there isn't a specialized top-level Dataproc Job type for uncompiled Spark Scala, but under the hood, spark-shell is just using the same mechanisms as spark-submit to run a specialized REPL driver: org.apache.spark.repl.Main. Thus, combining this with the --files flag available in gcloud dataproc jobs submit spark, you can just write snippets of Scala that you may have tested in a spark-shell or notebook session, and run that as your entire Dataproc job, assuming job.scala is a local file on your machine:
gcloud dataproc jobs submit spark --cluster ${CLUSTER} \
--class org.apache.spark.repl.Main \
--files job.scala \
-- -i job.scala
Just like any other file, you can also specify any Hadoop-compatible path in the --files argument as well, such as gs:// or even hdfs://, assuming you've already placed your job.scala file there:
gcloud dataproc jobs submit spark --cluster ${CLUSTER} \
--class org.apache.spark.repl.Main \
--files gs://${BUCKET}/job.scala \
-- -i job.scala
gcloud dataproc jobs submit spark --cluster ${CLUSTER} \
--class org.apache.spark.repl.Main \
--files hdfs:///tmp/job.scala \
-- -i job.scala
If you've staged your job file onto the Dataproc master node via an init action, you'd use file:/// to specify that the file is found on the cluster's local filesystem instead of your local filesystem where you're running gcloud:
gcloud dataproc jobs submit spark --cluster ${CLUSTER} \
--class org.apache.spark.repl.Main \
--files file:///tmp/job.scala \
-- -i job.scala
Note in all cases, the file becomes a local file in the working-directory of the main driver job, so the argument to "-i" can just be a relative path to the filename.

scala spark to read file from hdfs cluster

I am learning to develop spark applications using Scala. And I am in my very first steps.
I have my scala IDE on windows. configured and runs smoothly if reading files from local drive. However, I have access to a remote hdfs cluster and Hive database, and I want to develop, try, and test my applications against that Hadoop cluster... but I don't know how :(
If I try
val rdd=sc.textFile("hdfs://masternode:9000/user/hive/warehouse/dwh_db_jrtf.db/discipline")
I will get an error that contains:
Exception in thread "main" java.io.IOException: Failed on local exception: com.google.protobuf.InvalidProtocolBufferException: Protocol message end-group tag did not match expected tag.; Host Details : local host is: "MyLap/11.22.33.44"; destination host is: "masternode":9000;
Can anyone guide me please ?
You can use SBT to package your code in a .jar file. scp your file on your Node then try to submit it by doing a spark-submit.
spark-submit \
--class <main-class> \
--master <master-url> \
--deploy-mode <deploy-mode> \
--conf <key>=<value> \
... # other options
<application-jar> \
[application-arguments]
You can't access to your Cluster from your Windows Machine in that way.

Why is the application name defined in code not taken to display in RUNNING Applications in YARN UI?

This is the relevant part of my Spark application where I set the application's name using appName.
import org.apache.spark.sql.SparkSession
object sample extends App {
val spark = SparkSession.
builder().
appName("Cortex-Batch"). // <-- application name
enableHiveSupport().
getOrCreate()
I check the name of the Spark application in the Hadoop YARN cluster under RUNNING Applications and don't see the name I defined in the code. Why?
I use spark-submit with a property file using --properties-file as follows:
/usr/hdp/current/spark2-client/bin/spark-submit \
--master yarn \
--deploy-mode cluster \
--class com.jpmc.cortex.LoadCortexDataLake \
--verbose \
--properties-file /home/e707698/cortex-batch.properties \
--jars /usr/hdp/current/spark-client/lib/datanucleus-api-jdo-3.2.6.‌​jar,/usr/hdp/current‌​/spark-client/lib/da‌​tanucleus-core-3.2.1‌​0.jar,/usr/hdp/curre‌​nt/spark-client/lib/‌​datanucleus-rdbms-3.‌​2.9.jar \
/home/e707698/cortex-data-lake-batch.jar "/tmp/clickfiles1" "cortex_dev.xpo_click1"
Instead, the app name given in property file is taken. I tried to remove the property from the properties file, but then the name is the full class name of the Spark application, i.e. /com/jpmc/cortex/LoadCortexDataLake.
What could I be missing?
--name works. I am now able to see what I give in --name with spark-submit in Yarn Running applications.
When we run spark in cluster mode Yarn application is created before sparkcontext is created, hence we need to give app name as --name in SparkSubmit command.
In client mode we can set the app name in the program like spark.appname("Default App Name")

how to use Spark-submit configuration: jars,packages:in cluster mode?

When use Spark-submit in cluster mode(yarn-cluster),jars and packages configuration confused me: for jars, i can put them in HDFS, instead of in local directory . But for packages, because they build with Maven, with HDFS,it can't work. my way like below:
spark-submit --jars hdfs:///mysql-connector-java-5.1.39-bin.jar --driver-class-path /home/liac/test/mysql-connector-java-5.1.39/mysql-connector-java-5.1.39-bin.jar --conf "spark.mongodb.input.uri=mongodb://192.168.27.234/test.myCollection2?readPreference=primaryPreferred" --conf "spark.mongodb.output.uri=mongodb://192.168.27.234/test.myCollection2" --packages com.mongodb.spark:hdfs:///user/liac/package/jars/mongo-spark-connector_2.11-1.0.0-assembly.jar:1.0.0 --py-files /home/liac/code/diagnose_disease/tool.zip main_disease_tag_spark.py --master yarn-client
error occur:
`Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: Provided Maven Coordinates must be in the form 'groupId:artifactId:version'. The coordinate provided is: com.mongodb.spark:hdfs:///user/liac/package/jars/mongo-spark-connector_2.11-1.0.0-assembly.jar:1.0.0
Anyone can tell me how to use jars and packages in cluster mode? and what's wrong with my way?
Your use of the --packages argument is wrong:
--packages com.mongodb.spark:hdfs:///user/liac/package/jars/mongo-spark-connector_2.11-1.0.0-assembly.jar:1.0.0
It needs to be in the form of groupId:artifactId:version as the output suggests. You cannot use a URL with it.
An example for using mongoDB with spark with the built-in repository support:
$SPARK_HOME/bin/spark-shell --packages org.mongodb.spark:mongo-spark-connector_2.11:1.0.0
If you insist on using your own jar you can provide it via --repositories. The value of the argument is
Comma-separated list of remote repositories to search for the Maven coordinates specified in packages.
For example, in your case, it could be
--repositories hdfs:///user/liac/package/jars/ --packages org.mongodb.spark:mongo-spark-connector_2.11:1.0.0