I'm a beginner with docker and spark with python and I'm trying out some spark examples, extracting data from a local PostgreSQL database. I've experimenting locally on a windows 10 machine running LTS Ubuntu 20.04. My docker-compose version is 1.28.
I keep running into the same issue however, how do I add such-and-such a driver to my docker images. In this case, it's the postgresql jdbc driver. My question is very similar to this question. But, I'm using docker-compose instead of plain docker.
Here is the docker-compose section for the all-spark-notebook image:
services:
spark:
image: jupyter/all-spark-notebook:latest
ports:
- "8888:8888"
working_dir: /home/$USER/work
volumes:
- $PWD/work:/home/$USER/work
environment:
PYSPARK_SUBMIT_ARGS: --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.1 --jars /usr/share/java/postgresql.jar pyspark-shell
The packages entry is necessary to get my kafka integration to work in jupyter (and it does). The --jars entry is my attempt to reference the postgresql jdbc driver installed in the ubuntu LTS terminal using:
sudo apt-get install libpostgresql-jdbc-java libpostgresql-jdbc-java-doc
In python, I've tried this:
conf = SparkConf()
conf.set("spark.jars", "/usr/share/java/postgresql.jar")
findspark.init()
spark = SparkSession \
.builder \
.config(conf=conf) \
.appName("My App") \
.getOrCreate()
dataframe = spark.read.format('jdbc').options(\
url = "jdbc:postgresql://host.docker.internal:5432/postgres?user=user&password=***",\
database='postgres',
dbtable='cloud.some-table'
).load()
dataframe.show()
But, I get the following error message:
java.sql.SQLException: No suitable driver
just like the referenced previous poster.
Any ideas? This should be easy, but I'm struggling.
OK, since nobody has come back with an answer I'll post what worked for me (in the end). I'm not claiming this is the correct way to do this and I'm happy for someone to post up a better answer, but it may get someone out of trouble.
Since, different configurations (and versions!) require different solutions, I'll define my setup first. I'm using docker desktop for Windows 10 with Docker Engine V20.10.5. I'm managing my docker containers using docker-compose version 1.29.0. I'm using the latest all-spark-notebook (whatever version that is) and the postgresql-42.2.19 jdbc driver.
I'll also say that this is running on my local Windows machine with LTS installed and is for experimentation only.
The trick that worked for me was:
a) use a package for the jdbc driver with spark. In this way, spark installs the package from maven at runtime (when you create the spark instance within Jupyter) and...
volumes:
- $PWD/work:/home/$USER/work
environment:
PYSPARK_SUBMIT_ARGS: --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.1,org.postgresql:postgresql:42.2.19 --driver-class-path /home/jovyan/.ivy2/jars/org.postgresql_postgresql-42.2.19.jar pyspark-shell
b) Understand where the package jars are unpacked and use that directory to tell spark where to find the associated jars. In my case, I used this command to start spark within Jupyter notebook:
spark = SparkSession \
.builder \
.config("spark.driver.extraClassPath", "/home/jovyan/.ivy2/jars/org.postgresql_postgresql-42.2.19.jar") \
.appName("My App") \
.getOrCreate()
One other thing to note, this can be a bit flaky. If spark figures it needs to re-pull the files from maven (it'll do this the first time around, obviously), the library isn't picked up and the connection fails. However, running stop and up -d to recycle the containers and re-running the python script makes the connection happy. I don't pretend I know why, but my suspicion is that the way I have things set up, there's some dependency there.
Related
I am aware of Change Apache Livy's Python Version and How do i setup Pyspark in Python 3 with spark-env.sh.template.
I also have seen the Livy documentation
However, none of that works. Livy keeps using Python 2.7 no matter what.
This is running Livy 0.6.0 on an EMR cluster.
I have changed the PYSPARK_PYTHON environment variable to /usr/bin/python3 in the hadoop user, my user, the root, and ec2-user. Logging into the EMR master node via ssh and running pyspark starts python3 as expected. But, Livy keeps using python2.7.
I added export PYSPARK_PYTHON=/usr/bin/python3 to the /etc/spark/conf/spark-env.sh file. Livy keeps using python2.7.
I added "spark.yarn.appMasterEnv.PYSPARK_PYTHON":"/usr/bin/python3" and "spark.executorEnv.PYSPARK_PYTHON":"/usr/bin/python3" to the items listed below and in every case . Livy keeps using python2.7.
sparkmagic config.json and config_other_settings.json files before starting a PySpark kernel Jupyter
Session Properties in the sparkmagic %manage_spark Jupyter widget. Livy keeps using python2.7.
%%spark config cell-magic before the line-magic %spark add --session test --url http://X.X.X.X:8998 --auth None --language python
Note: This works without any issues in another EMR cluster running Livy 0.7.0 I have gone over all of the settings on the other cluster and cannot find what is different. I did not have to do any of this on the other cluster, Livy just used python3 by default.
How exactly do I get Livy to use python3 instead of python2?
Finally just found an answer after posting.
I ran the following in a PySpark kernel Jupyter session cell before running any code to start the PySpark session on the remote EMR cluster via Livy.
%%configure -f
{ "conf":{
"spark.pyspark.python": "python3"
}
}
Simply adding "spark.pyspark.python": "python3" to the .sparkmagic config.json or config_other_settings.json also worked.
Confusing that this does not match the official Livy documentation.
I have Spark Jobs running on Yarn. These days I'm moving to Spark on Kubernetes.
On Kubernetes I'm having an issue: files uploaded via --files can't be read by Spark Driver.
On Yarn, as described in many answers I can read those files using Source.fromFile(filename).
But I can't read files in Spark on Kubernetes.
Spark version: 3.0.1
Scala version: 2.12.6
deploy-mode: cluster
submit commands
$ spark-submit --class <className> \
--name=<jobName> \
--master=k8s://https://api-hostname:6443 \
...
--deploy-mode=cluster \
--files app.conf \
--conf spark.kubernetes.file.upload.path=hdfs://<nameservice>/path/to/sparkUploads/ \
app.jar
After executing above command, app.conf is uploaded to hdfs://<nameservice>/path/to/sparkUploads/spark-upload-xxxxxxx/,
And in Driver's pod, I found app.conf in /tmp/spark-******/ directory, app.jar as well.
But Driver can't read app.conf, Source.fromFile(filename) returns null, there was no permission problems.
Update 1
In Spark Web UI->"Environment" Tab, spark://<pod-name>-svc.ni.svc:7078/files/app.conf in "Classpath Entries" menu. Does this mean app.conf is available in classpath?
On the other hand, in Spark on Yarn user.dir property was included in System classpath.
I found SPARK-31726: Make spark.files available in driver with cluster deploy mode on kubernetes
Update 2
I found that driver pod's /opt/spark/work-dir/ dir was included in classpath.
but /opt/spark/work-dir/ is empty on driver pod whereas on executor pod it contains app.conf and app.jar.
I think that is the problem and SPARK-31726 describes this.
Update 3
After reading Jacek's answer, I tested org.apache.spark.SparkFiles.getRootDirectory().
It returns /var/data/spark-357eb33e-1c17-4ad4-b1e8-6f878b1d8253/spark-e07d7e84-0fa7-410e-b0da-7219c412afa3/userFiles-59084588-f7f6-4ba2-a3a3-9997a780af24
Update 4 - work around
First, I make ConfigMaps to save files that I want to read driver/executors
Next, The ConfigMaps are mounted on driver/executors. To mount ConfigMap, use Pod Template or Spark Operator
--files files should be accessed using SparkFiles.get utility:
get(filename: String): String
Get the absolute path of a file added through SparkContext.addFile().
I found the another temporary solution in spark 3.3.0
We can use flag --archives. The files without tar, tar.gz, zip are ignored unpacking step and after that they are placed on working dir of driver and executor.
Although the docs of --archive don't mention executor, I tested and it's working.
I am learning to develop spark applications using Scala. And I am in my very first steps.
I have my scala IDE on windows. configured and runs smoothly if reading files from local drive. However, I have access to a remote hdfs cluster and Hive database, and I want to develop, try, and test my applications against that Hadoop cluster... but I don't know how :(
If I try
val rdd=sc.textFile("hdfs://masternode:9000/user/hive/warehouse/dwh_db_jrtf.db/discipline")
I will get an error that contains:
Exception in thread "main" java.io.IOException: Failed on local exception: com.google.protobuf.InvalidProtocolBufferException: Protocol message end-group tag did not match expected tag.; Host Details : local host is: "MyLap/11.22.33.44"; destination host is: "masternode":9000;
Can anyone guide me please ?
You can use SBT to package your code in a .jar file. scp your file on your Node then try to submit it by doing a spark-submit.
spark-submit \
--class <main-class> \
--master <master-url> \
--deploy-mode <deploy-mode> \
--conf <key>=<value> \
... # other options
<application-jar> \
[application-arguments]
You can't access to your Cluster from your Windows Machine in that way.
When I'm running on a vanilla spark cluster, and wanting to run a pyspark script against a specific virtualenv, I can create the virtual environment, install packages as needed, and then zip the environment into a file, let's say venv.zip.
Then, at runtime, I can execute
spark-submit --archives venv.zip#VENV --master yarn script.py
and then, so long as I run
os.environ["PYSPARK_PYTHON"] = "VENV/bin/python" inside of script.py, the code will run against the virtual environment, and spark will handle provisioning the virtualenvironment to all of my clusters.
When I do this on dataproc, first, the hadoop-style hash aliasing doesn't work, and second, running
gcloud dataproc jobs submit pyspark script.py --archives venv.zip --cluster <CLUSTER_NAME>
with os.environ["PYSPARK_PYTHON"] = "venv.zip/bin/python" will produce:
Error from python worker:
venv/bin/python: 1: venv.zip/bin/python: Syntax error: word unexpected (expecting ")")
It's clearly seeing my python executable, and trying to run against it, but there really appears to be some sort of parsing error. What gives? Is there any way to pass the live python executable to use to dataproc the way that you can against a vanilla spark cluster?
Turns out I was distributing python binaries across OSes, and was boneheaded enough to not notice that I was doing so, and the incompatibility was causing the crash.
I'm running Spark on a docker container (sequenceiq/spark).
I launched it like this:
docker run --link dbHost:dbHost -v my/path/to/postgres/jar:postgres/ -it -h sandbox sequenceiq/spark:1.6.0 bash
I'm sure that the postgreSQL database is accessible through the address postgresql://user:password#localhost:5432/ticketapp.
I start the spark-shell with spark-shell --jars postgres/postgresql-9.4-1205.jdbc42.jar and since I can connect from my Play! application that has as dependency "org.postgresql" % "postgresql" % "9.4-1205-jdbc42" it seems that I have the correct jar. (I also don't any warning saying that the local jar does not exist.)
But when I try to connect to my database with:
val jdbcDF = sqlContext.read.format("jdbc").options(
Map("url" -> "jdbc:postgresql://dbHost:5432/ticketapp?user=user&password=password",
"dbtable" -> "events")
).load()
(I also tried the url jdbc:postgresql://user:root#dbHost:5432/ticketapp)
as it is explained in the spark documentation, I get this error:
java.sql.SQLException: No suitable driver found for jdbc:postgresql://dbHost:5432/ticketapp?user=simon&password=root
What am I doing wrong?
As far as I know you need to include the JDBC driver for you particular database on the spark classpath. According to documentation (https://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases) it should be done like this:
SPARK_CLASSPATH=postgresql-9.3-1102-jdbc41.jar bin/spark-shell