Cloud Storage Client with Scala and Dataproc: missing libraries - scala

I am trying to run a simple spark script in a dataproc cluster, that needs to read/write to a gcs bucket using scala and the java Cloud Storage Client Libraries. The script is the following:
//Build.sbt
name := "trialGCS"
version :="0.0.1"
scalaVersion := "2.12.10"
val sparkVersion = "3.0.1"
libraryDependencies ++= Seq(
// Spark core libraries
"org.apache.spark" %% "spark-core" % sparkVersion,
"org.apache.spark" %% "spark-sql" % sparkVersion,
"com.google.cloud" % "google-cloud-storage" % "1.113.15"
)
resolvers += Resolver.mavenLocal
package DEV
import com.google.cloud.storage.StorageOptions
object TrialGCS extends App {
import spark.implicits._
val storage = StorageOptions.getDefaultInstance.getService
}
I launch the script via terminal with the shell command:
gcloud dataproc jobs submit spark --class DEV.TrialGCS --jars target/scala-2.12/trialgcs_2.12-0.0.1.jar --cluster <CLUSTERNAME> --region=<REGIONNAME>
However this produces the error java.lang.NoClassDefFoundError: com/google/cloud/storage/Storage.
If I include the cloudstorage jar manually, changing --jars in the previous command with
--jars target/scala-2.12/trialgcs_2.12-0.0.1.jar,google-cloud-storage-1.113.15.jar
the error is now java.lang.NoClassDefFoundError: com/google/cloud/Service.
So, apparently it's a matter of missing libraries.
On the other hand if I use spark-shell --packages "com.google.cloud:google-cloud-storage:1.113.15" via ssh in the dataproc driver's vm all works perfectly.
How to solve this issue?

If it's ensured that you've the dependent jar in the driver machine, you can add the jars in the class path explicitly. You can try by the following command,
gcloud dataproc jobs submit spark \
--class DEV.TrialGCS \
--properties spark.driver.extraClassPath=<comma seperated full path of jars>,spark.jars.packages=com.google.cloud:google-cloud-storage:1.113.15 \
--cluster <CLUSTERNAME> --region=<REGIONNAME>

I've found the solution: to manage properly the package dependence, the google-cloud-storage library needs to be included via --properties=spark.jars.packages=<MAVEN_COORDINATES>, as shown in https://cloud.google.com/dataproc/docs/guides/manage-spark-dependencies . In my case this means
gcloud dataproc jobs submit spark --class DEV.TrialGCS \
--jars target/scala-2.12/trialgcs_2.12-0.0.1.jar \
--cluster <CLUSTERNAME> --region=<REGIONNAME> \
--properties=spark.jars.packages="com.google.cloud:google-cloud-storage:1.113.15"
When multiple maven coordinates for packages or multiple properties are necessary, it is necessary to escape the string: https://cloud.google.com/sdk/gcloud/reference/topic/escaping
For instance, for google-cloud-storage and kafka:
--properties=^#^spark.jars.packages=com.google.cloud:google-cloud-storage:1.113.15,org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.1,io#spark.executor.extraClassPath=org.apache.kafka_kafka-clients-2.4.1.jar#spark.driver.extraClassPath=org.apache.kafka_kafka-clients-2.4.1.jar

Related

How to replace jar internal dependencies when submitting spark submit

I want to replace the library containing the business logic used in the application when submitting spark submit.
For example, I would like to replace the test v1.1.1 internal library that is being used in project "A" with test v2.2.2 upon spark submit.
spark-submit
$SPARK_HOME/bin/spark-submit \
--jars "$TEST_JARS_V2.2.2" \
--name $APP_NAME \
--master yarn \
--deploy-mode cluster \
--verbose \
--class AAA.AAA.batch.EntryPoint \
/.../.../A.jar(exclude test dependency)
sbt
"?.?.?" %% "test" % "1.1.1" % Provided
In this first way
I registered the internal library test v1.1.1 as "Provided" in project A, excluded it from the fat jar, and injected test library v2.2.2 through spark submit --jars.
But it seems that the test library cannot be injected.
The classes in the test library were not found.
Please advise if I did something wrong.

How to read files uploaded by spark-submit on Kubernetes

I have Spark Jobs running on Yarn. These days I'm moving to Spark on Kubernetes.
On Kubernetes I'm having an issue: files uploaded via --files can't be read by Spark Driver.
On Yarn, as described in many answers I can read those files using Source.fromFile(filename).
But I can't read files in Spark on Kubernetes.
Spark version: 3.0.1
Scala version: 2.12.6
deploy-mode: cluster
submit commands
$ spark-submit --class <className> \
--name=<jobName> \
--master=k8s://https://api-hostname:6443 \
...
--deploy-mode=cluster \
--files app.conf \
--conf spark.kubernetes.file.upload.path=hdfs://<nameservice>/path/to/sparkUploads/ \
app.jar
After executing above command, app.conf is uploaded to hdfs://<nameservice>/path/to/sparkUploads/spark-upload-xxxxxxx/,
And in Driver's pod, I found app.conf in /tmp/spark-******/ directory, app.jar as well.
But Driver can't read app.conf, Source.fromFile(filename) returns null, there was no permission problems.
Update 1
In Spark Web UI->"Environment" Tab, spark://<pod-name>-svc.ni.svc:7078/files/app.conf in "Classpath Entries" menu. Does this mean app.conf is available in classpath?
On the other hand, in Spark on Yarn user.dir property was included in System classpath.
I found SPARK-31726: Make spark.files available in driver with cluster deploy mode on kubernetes
Update 2
I found that driver pod's /opt/spark/work-dir/ dir was included in classpath.
but /opt/spark/work-dir/ is empty on driver pod whereas on executor pod it contains app.conf and app.jar.
I think that is the problem and SPARK-31726 describes this.
Update 3
After reading Jacek's answer, I tested org.apache.spark.SparkFiles.getRootDirectory().
It returns /var/data/spark-357eb33e-1c17-4ad4-b1e8-6f878b1d8253/spark-e07d7e84-0fa7-410e-b0da-7219c412afa3/userFiles-59084588-f7f6-4ba2-a3a3-9997a780af24
Update 4 - work around
First, I make ConfigMaps to save files that I want to read driver/executors
Next, The ConfigMaps are mounted on driver/executors. To mount ConfigMap, use Pod Template or Spark Operator
--files files should be accessed using SparkFiles.get utility:
get(filename: String): String
Get the absolute path of a file added through SparkContext.addFile().
I found the another temporary solution in spark 3.3.0
We can use flag --archives. The files without tar, tar.gz, zip are ignored unpacking step and after that they are placed on working dir of driver and executor.
Although the docs of --archive don't mention executor, I tested and it's working.

scala spark to read file from hdfs cluster

I am learning to develop spark applications using Scala. And I am in my very first steps.
I have my scala IDE on windows. configured and runs smoothly if reading files from local drive. However, I have access to a remote hdfs cluster and Hive database, and I want to develop, try, and test my applications against that Hadoop cluster... but I don't know how :(
If I try
val rdd=sc.textFile("hdfs://masternode:9000/user/hive/warehouse/dwh_db_jrtf.db/discipline")
I will get an error that contains:
Exception in thread "main" java.io.IOException: Failed on local exception: com.google.protobuf.InvalidProtocolBufferException: Protocol message end-group tag did not match expected tag.; Host Details : local host is: "MyLap/11.22.33.44"; destination host is: "masternode":9000;
Can anyone guide me please ?
You can use SBT to package your code in a .jar file. scp your file on your Node then try to submit it by doing a spark-submit.
spark-submit \
--class <main-class> \
--master <master-url> \
--deploy-mode <deploy-mode> \
--conf <key>=<value> \
... # other options
<application-jar> \
[application-arguments]
You can't access to your Cluster from your Windows Machine in that way.

how to use Spark-submit configuration: jars,packages:in cluster mode?

When use Spark-submit in cluster mode(yarn-cluster),jars and packages configuration confused me: for jars, i can put them in HDFS, instead of in local directory . But for packages, because they build with Maven, with HDFS,it can't work. my way like below:
spark-submit --jars hdfs:///mysql-connector-java-5.1.39-bin.jar --driver-class-path /home/liac/test/mysql-connector-java-5.1.39/mysql-connector-java-5.1.39-bin.jar --conf "spark.mongodb.input.uri=mongodb://192.168.27.234/test.myCollection2?readPreference=primaryPreferred" --conf "spark.mongodb.output.uri=mongodb://192.168.27.234/test.myCollection2" --packages com.mongodb.spark:hdfs:///user/liac/package/jars/mongo-spark-connector_2.11-1.0.0-assembly.jar:1.0.0 --py-files /home/liac/code/diagnose_disease/tool.zip main_disease_tag_spark.py --master yarn-client
error occur:
`Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: Provided Maven Coordinates must be in the form 'groupId:artifactId:version'. The coordinate provided is: com.mongodb.spark:hdfs:///user/liac/package/jars/mongo-spark-connector_2.11-1.0.0-assembly.jar:1.0.0
Anyone can tell me how to use jars and packages in cluster mode? and what's wrong with my way?
Your use of the --packages argument is wrong:
--packages com.mongodb.spark:hdfs:///user/liac/package/jars/mongo-spark-connector_2.11-1.0.0-assembly.jar:1.0.0
It needs to be in the form of groupId:artifactId:version as the output suggests. You cannot use a URL with it.
An example for using mongoDB with spark with the built-in repository support:
$SPARK_HOME/bin/spark-shell --packages org.mongodb.spark:mongo-spark-connector_2.11:1.0.0
If you insist on using your own jar you can provide it via --repositories. The value of the argument is
Comma-separated list of remote repositories to search for the Maven coordinates specified in packages.
For example, in your case, it could be
--repositories hdfs:///user/liac/package/jars/ --packages org.mongodb.spark:mongo-spark-connector_2.11:1.0.0

No suitable driver found for jdbc:postgresql when trying to connect spark-shell (inside a docker container) to my PostgreSQL database

I'm running Spark on a docker container (sequenceiq/spark).
I launched it like this:
docker run --link dbHost:dbHost -v my/path/to/postgres/jar:postgres/ -it -h sandbox sequenceiq/spark:1.6.0 bash
I'm sure that the postgreSQL database is accessible through the address postgresql://user:password#localhost:5432/ticketapp.
I start the spark-shell with spark-shell --jars postgres/postgresql-9.4-1205.jdbc42.jar and since I can connect from my Play! application that has as dependency "org.postgresql" % "postgresql" % "9.4-1205-jdbc42" it seems that I have the correct jar. (I also don't any warning saying that the local jar does not exist.)
But when I try to connect to my database with:
val jdbcDF = sqlContext.read.format("jdbc").options(
Map("url" -> "jdbc:postgresql://dbHost:5432/ticketapp?user=user&password=password",
"dbtable" -> "events")
).load()
(I also tried the url jdbc:postgresql://user:root#dbHost:5432/ticketapp)
as it is explained in the spark documentation, I get this error:
java.sql.SQLException: No suitable driver found for jdbc:postgresql://dbHost:5432/ticketapp?user=simon&password=root
What am I doing wrong?
As far as I know you need to include the JDBC driver for you particular database on the spark classpath. According to documentation (https://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases) it should be done like this:
SPARK_CLASSPATH=postgresql-9.3-1102-jdbc41.jar bin/spark-shell