use an external library in pyspark job in a Spark cluster from google-dataproc - import

I have a spark cluster I created via google dataproc. I want to be able to use the csv library from databricks (see https://github.com/databricks/spark-csv). So I first tested it like this:
I started a ssh session with the master node of my cluster, then I input:
pyspark --packages com.databricks:spark-csv_2.11:1.2.0
Then it launched a pyspark shell in which I input:
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('gs:/xxxx/foo.csv')
df.show()
And it worked.
My next step is to launch this job from my main machine using the command:
gcloud beta dataproc jobs submit pyspark --cluster <my-dataproc-cluster> my_job.py
But here It does not work and I get an error. I think because I did not gave the --packages com.databricks:spark-csv_2.11:1.2.0 as an argument, but I tried 10 different ways to give it and I did not manage.
My question are:
was the databricks csv library installed after I typed pyspark --packages com.databricks:spark-csv_2.11:1.2.0
can I write a line in my job.py in order to import it?
or what params should I give to my gcloud command to import it or install it?

Short Answer
There are quirks in ordering of arguments where --packages isn't accepted by spark-submit if it comes after the my_job.py argument. To workaround this, you can do the following when submitting from Dataproc's CLI:
gcloud beta dataproc jobs submit pyspark --cluster <my-dataproc-cluster> \
--properties spark.jars.packages=com.databricks:spark-csv_2.11:1.2.0 my_job.py
Basically, just add --properties spark.jars.packages=com.databricks:spark-csv_2.11:1.2.0 before the .py file in your command.
Long Answer
So, this is actually a different issue than the known lack of support for --jars in gcloud beta dataproc jobs submit pyspark; it appears that without Dataproc explicitly recognizing --packages as a special spark-submit-level flag, it tries to pass it after the application arguments so that spark-submit lets the --packages fall through as an application argument rather than properly parsing it as a submission-level option. Indeed, in an SSH session, the following does not work:
# Doesn't work if job.py depends on that package.
spark-submit job.py --packages com.databricks:spark-csv_2.11:1.2.0
But switching the order of the arguments does work again, even though in the pyspark case, both orderings work:
# Works with dependencies on that package.
spark-submit --packages com.databricks:spark-csv_2.11:1.2.0 job.py
pyspark job.py --packages com.databricks:spark-csv_2.11:1.2.0
pyspark --packages com.databricks:spark-csv_2.11:1.2.0 job.py
So even though spark-submit job.py is supposed to be a drop-in replacement for everything that previously called pyspark job.py, the difference in parse ordering for things like --packages means it's not actually a 100% compatible migration. This might be something to follow up with on the Spark side.
Anyhow, fortunately there's a workaround, since --packages is just another alias for the Spark property spark.jars.packages, and Dataproc's CLI supports properties just fine. So you can just do the following:
gcloud beta dataproc jobs submit pyspark --cluster <my-dataproc-cluster> \
--properties spark.jars.packages=com.databricks:spark-csv_2.11:1.2.0 my_job.py
Note that the --properties must come before the my_job.py, otherwise it gets sent as an application argument rather than as a configuration flag. Hope that works for you! Note that the equivalent in an SSH session would be spark-submit --packages com.databricks:spark-csv_2.11:1.2.0 job.py.

Additionally to #Dennis.
Note that if you need to load multiple external packages, you need to specify a custom escape character like so:
--properties ^#^spark.jars.packages=org.elasticsearch:elasticsearch-spark_2.10:2.3.2,com.data‌​bricks:spark-avro_2.10:2.0.1
Note the ^#^ right before the package list.
See gcloud topic escaping for more details.

Related

Using Spark-Submit to write to S3 in "local" mode using S3A Directory Committer

I'm currently running PySpark via local mode. I want to be able to efficiently output parquet files to S3 via the S3 Directory Committer. This PySpark instance is using the local disk, not HDFS, as it is being submitted via spark-submit --master local[*].
I can successfully write to my S3 Instance without enabling the directory committer. However, this involves writing staging files to S3 and renaming them, which is slow and unreliable. I would like for Spark to write to my local filesystem as a temporary store, and then copy to S3.
I have the following configuration in my PySpark conf:
self.spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
self.spark.conf.set("spark.hadoop.fs.s3a.committer.name", "directory")
self.spark.conf.set("spark.sql.sources.commitProtocolClass", "org.apache.spark.internal.io.cloud.PathOutputCommitProtocol")
self.spark.conf.set("spark.sql.parquet.output.committer.class", "org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter")
My spark-submit command looks like this:
spark-submit --master local[*] --py-files files.zip --packages com.amazonaws:aws-java-sdk:1.11.375,org.apache.hadoop:hadoop-aws:3.2.0,org.apache.spark.internal.io.cloud.PathOutputCommitProtocol --driver-memory 4G --name clean-raw-recording_data main.py
spark-submit gives me the following error, due to the requisite JAR not being in place:
java.lang.ClassNotFoundException: org.apache.spark.internal.io.cloud.PathOutputCommitProtocol
My questions are:
Which JAR (specifically, the maven coordinates) do I need to include in spark-submit --packages in order to be able to reference PathOutputCommitProtocol?
Once I have (1) working, will I be able to use PySpark's local mode to stage temporary files on the local filesystem? Or is HDFS a strict requirement?
I need this to be running in local mode, not cluster mode.
EDIT:
I got this to work with the following configuration:
Using pyspark version 3.1.2 and the package
org.apache.spark:spark-hadoop-cloud_2.12:3.1.1.3.1.7270.0-253.
I needed to add the cloudera repository using the --repositories option for spark-submit:
spark-submit --repositories https://repository.cloudera.com/artifactory/cloudera-repos/ --packages com.amazonaws:aws-java-sdk:1.11.375,org.apache.hadoop:hadoop-aws:3.2.0,org.apache.spark:spark-hadoop-cloud_2.12:3.1.1.3.1.7270.0-253
you need the spark-hadoop-cloud module for the release of spark you are using
the committer is happy using the local fs (it's now the public integration test suites work https://github.com/hortonworks-spark/cloud-integration. all that's needed is a "real" filesystem shared across all workers and the spark driver, so the driver gets the manifests of each pending commit.
print the _SUCCESS file after a job to see what the committer did: 0 byte file == old committer, JSON with diagnostics == new one

H2o Package not found Scala Sparkling Water

I am trying to run Sparkling Water on my Local instance of Spark 2.1.0.
I followed documentation on H2o for Sparling Water. But when I try to execute
sparkling-shell.cmd
I am getting following error :
The filename, directory name, or volume label syntax is incorrect.
I look into the batch file and I am getting this error when the following command is executed:
C:\Users\Mansoor\libs\spark\spark-2.1.0/bin/spark-shell.cmd --jars C:\Users\Mansoor\libs\H2o\sparkling\bin\../assembly/build/libs/sparkling-water-assembly_2.11-2.1.0-all.jar --driver-memory 3G --conf spark.driver.extraJavaOptions="-XX:MaxPermSize=384m"
When I remove --conf spark.driver.extraJavaOptions="-XX:MaxPermSize=384m", Spark starts but I am unable to import the packages of H2o.
import org.apache.spark.h2o._
error: object h2o is not a member of package org.apache.spark
I tried everything I could but unable to solve this issue. Could someone help me in this? Thanks
Please try to correct your path:
C:\Users\Mansoor\libs\spark\spark-2.1.0/bin/spark-shell.cmd --jars C:\Users\Mansoor\libs\H2o\sparkling\bin\..\assembly\build\libs\sparkling-water-assembly_2.11-2.1.0-all.jar --driver-memory 3G --conf spark.driver.extraJavaOptions="-XX:MaxPermSize=384m"
There is also doc page about RSparkling at Windows, which can contain different troubleshooting tips...
https://github.com/h2oai/sales-engineering/tree/master/megan/RSparklingAndWindows
Problem is with spark-shell command while submitting jars. Workaround is to modify spark-defaults.conf
Adding spark.driver.extraClassPath and spark.executor.extraClassPath parameters to spark-defaults.conf file as follows:
spark.driver.extraClassPath \path\to\jar\sparkling-water-assembly_version>-all.jar
spark.executor.extraClassPath \path\to\jar\sparkling-water-assembly_version>-all.jar
And Remove --jars \path\to\jar\sparkling-water-assembly_version>-all.jar from sparkling-shell2.cmd

Dataproc: Jupyter pyspark notebook unable to import graphframes package

In Dataproc spark cluster, graphframe package is available in spark-shell but not in jupyter pyspark notebook.
Pyspark kernel config:
PACKAGES_ARG='--packages graphframes:graphframes:0.2.0-spark2.0-s_2.11'
Following is the cmd to initialize cluster :
gcloud dataproc clusters create my-dataproc-cluster --properties spark.jars.packages=com.databricks:graphframes:graphframes:0.2.0-spark2.0-s_2.11 --metadata "JUPYTER_PORT=8124,INIT_ACTIONS_REPO=https://github.com/{xyz}/dataproc-initialization-actions.git" --initialization-actions gs://dataproc-initialization-actions/jupyter/jupyter.sh --num-workers 2 --properties spark:spark.executorEnv.PYTHONHASHSEED=0,spark:spark.yarn.am.memory=1024m --worker-machine-type=n1-standard-4 --master-machine-type=n1-standard-4
This is an old bug with Spark Shells and YARN, that I thought was fixed in SPARK-15782, but apparently this case was missed.
The suggested workaround is adding
import os
sc.addPyFile(os.path.expanduser('~/.ivy2/jars/graphframes_graphframes-0.2.0-spark2.0-s_2.11.jar'))
before your import.
I found another way to do add packages which works on Jupyter notebook:
spark = SparkSession.builder \
.appName("Python Spark SQL") \ \
.config("spark.jars.packages", "graphframes:graphframes:0.5.0-spark2.1-s_2.11") \
.getOrCreate()
If you can use EMR notebooks then you can install additional Python libraries/dependencies using install_pypi_package() API within the notebook. These dependencies(including transitive dependencies if any) will be installed on all executor nodes.
More details here: https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-managed-notebooks-scoped-libraries.html
The simplest way is to start jupyter with pyspark and graphframes is to start jupyter out from pyspark with the additional package attached
Just open your terminal and set the two environment variables and start pyspark with the graphframes package
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS=notebook
pyspark --packages graphframes:graphframes:0.6.0-spark2.3-s_2.11
the advantage of this is also that if you later on want to run your code via spark-submit you can use the same start command

Spark cannot find the postgres jdbc driver

EDIT: See the edit at the end
First of all, I am using Spark 1.5.2 on Amazon EMR and using Amazon RDS for my postgres database. Second is that I am a complete newbie in this world of Spark and Hadoop and MapReduce.
Essentially my problem is the same as for this guy:
java.sql.SQLException: No suitable driver found when loading DataFrame into Spark SQL
So the dataframe is loaded, but when I try to evaluate it (doing df.show(), where df is the dataframe) gives me the error:
java.sql.SQLException: No suitable driver found for jdbc:postgresql://mypostgres.cvglvlp29krt.eu-west-1.rds.amazonaws.com:5432/mydb
I should note that I start spark like this:
spark-shell --driver-class-path /home/hadoop/postgresql-9.4.1207.jre7.jar
The solutions suggest delivering the jar onto the worker nodes and setting the classpath on them somehow, which I don't really understand how to do. But then they say that apparently the issue was fixed in Spark 1.4, and I'm using 1.5.2, and still having this issue, so what is going on?
EDIT: Looks like I resolved the issue, however I still don't quite understand why this works and the thing above doesn't, so I guess my questions is now why does doing this:
spark-shell --driver-class-path /home/hadoop/postgresql-9.4.1207.jre7.jar --conf spark.driver.extraClassPath=/home/hadoop/postgresql-9.4.1207.jre7.jar --jars /home/hadoop/postgresql-9.4.1207.jre7.jar
solve the problem? I just added the path as a parameter into some more of the flags it seems.
spark-shell --driver-class-path .... --jars ... works because all jar files listed in --jars are automatically distributed over the cluster.
Alternatively you could use
spark-shell --packages org.postgresql:postgresql:9.4.1207.jre7
and specify driver class as an option for DataFrameReader / DataFrameWriter
val df = sqlContext.read.format("jdbc").options(Map(
"url" -> url, "dbtable" -> table, "driver" -> "org.postgresql.Driver"
)).load()
or even manually copy required jars to the workers and place these somewhere on the CLASSPATH.

Exception after Setting property 'spark.sql.hive.metastore.jars' in 'spark-defaults.conf'

Given below is the version of Spark & Hive I have installed in my system
Spark : spark-1.4.0-bin-hadoop2.6
Hive : apache-hive-1.0.0-bin
I have configured the Hive installation to use MySQL as Metastore. The goal is to access the MySQL Metastore & execute HiveQL queries inside spark-shell(using HiveContext)
So far I am able to execute the HiveQL queries by accessing the Derby Metastore(As described here, believe Spark-1.4 comes bundled with Hive 0.13.1 which in turn uses the internal Derby database as Metastore)
Then I tried to point spark-shell to my external Metastore(MySQL in this case) by setting the property(as suggested here) given below in $SPARK_HOME/conf/spark-defaults.conf,
spark.sql.hive.metastore.jars /home/mountain/hv/lib:/home/mountain/hp/lib
I have also copied $HIVE_HOME/conf/hive-site.xml into $SPARK_HOME/conf. But I am getting the following exception when I start the spark-shell
mountain#mountain:~/del$ spark-shell
Spark context available as sc.
java.lang.ClassNotFoundException: java.lang.NoClassDefFoundError:
org/apache/hadoop/hive/ql/session/SessionState when creating Hive client
using classpath: file:/home/mountain/hv/lib/, file:/home/mountain/hp/lib/
Please make sure that jars for your version of hive and hadoop are
included in the paths passed to spark.sql.hive.metastore.jars.
Am I missing something (or) not setting the property spark.sql.hive.metastore.jars correctly?
Note: In Linux Mint verified.
If you are setting properties in spark-defaults.conf, spark will take those settings only when you submit your job using spark-submit.
file: spark-defaults.conf
spark.driver.extraJavaOptions -Dlog4j.configuration=file:log4j.properties -Dspark.yarn.app.container.log.dir=app-logs -Dlogfile.name=hello-spark
spark.jars.packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.1,org.apache.spark:spark-avro_2.12:3.0.1
In the terminal run your job say wordcount.py
spark-submit /path-to-file/wordcount.py
If you want to run your job in development mode from an IDE then you should use config() method. Here we will set Kafka jar packages
spark = SparkSession.builder \
.appName('Hello Spark') \
.master('local[3]') \
.config("spark.streaming.stopGracefullyOnShutdown", "true") \
.config("spark.jars.packages", "org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.1") \
.getOrCreate()
Corrupted version of hive-site.xml will cause this... please copy the correct hive-site.xml