How to read files uploaded by spark-submit on Kubernetes - scala

I have Spark Jobs running on Yarn. These days I'm moving to Spark on Kubernetes.
On Kubernetes I'm having an issue: files uploaded via --files can't be read by Spark Driver.
On Yarn, as described in many answers I can read those files using Source.fromFile(filename).
But I can't read files in Spark on Kubernetes.
Spark version: 3.0.1
Scala version: 2.12.6
deploy-mode: cluster
submit commands
$ spark-submit --class <className> \
--name=<jobName> \
--master=k8s://https://api-hostname:6443 \
...
--deploy-mode=cluster \
--files app.conf \
--conf spark.kubernetes.file.upload.path=hdfs://<nameservice>/path/to/sparkUploads/ \
app.jar
After executing above command, app.conf is uploaded to hdfs://<nameservice>/path/to/sparkUploads/spark-upload-xxxxxxx/,
And in Driver's pod, I found app.conf in /tmp/spark-******/ directory, app.jar as well.
But Driver can't read app.conf, Source.fromFile(filename) returns null, there was no permission problems.
Update 1
In Spark Web UI->"Environment" Tab, spark://<pod-name>-svc.ni.svc:7078/files/app.conf in "Classpath Entries" menu. Does this mean app.conf is available in classpath?
On the other hand, in Spark on Yarn user.dir property was included in System classpath.
I found SPARK-31726: Make spark.files available in driver with cluster deploy mode on kubernetes
Update 2
I found that driver pod's /opt/spark/work-dir/ dir was included in classpath.
but /opt/spark/work-dir/ is empty on driver pod whereas on executor pod it contains app.conf and app.jar.
I think that is the problem and SPARK-31726 describes this.
Update 3
After reading Jacek's answer, I tested org.apache.spark.SparkFiles.getRootDirectory().
It returns /var/data/spark-357eb33e-1c17-4ad4-b1e8-6f878b1d8253/spark-e07d7e84-0fa7-410e-b0da-7219c412afa3/userFiles-59084588-f7f6-4ba2-a3a3-9997a780af24
Update 4 - work around
First, I make ConfigMaps to save files that I want to read driver/executors
Next, The ConfigMaps are mounted on driver/executors. To mount ConfigMap, use Pod Template or Spark Operator

--files files should be accessed using SparkFiles.get utility:
get(filename: String): String
Get the absolute path of a file added through SparkContext.addFile().

I found the another temporary solution in spark 3.3.0
We can use flag --archives. The files without tar, tar.gz, zip are ignored unpacking step and after that they are placed on working dir of driver and executor.
Although the docs of --archive don't mention executor, I tested and it's working.

Related

Adding PostgresSQL JDBC Driver to all-spark-notebook using docker-compose

I'm a beginner with docker and spark with python and I'm trying out some spark examples, extracting data from a local PostgreSQL database. I've experimenting locally on a windows 10 machine running LTS Ubuntu 20.04. My docker-compose version is 1.28.
I keep running into the same issue however, how do I add such-and-such a driver to my docker images. In this case, it's the postgresql jdbc driver. My question is very similar to this question. But, I'm using docker-compose instead of plain docker.
Here is the docker-compose section for the all-spark-notebook image:
services:
spark:
image: jupyter/all-spark-notebook:latest
ports:
- "8888:8888"
working_dir: /home/$USER/work
volumes:
- $PWD/work:/home/$USER/work
environment:
PYSPARK_SUBMIT_ARGS: --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.1 --jars /usr/share/java/postgresql.jar pyspark-shell
The packages entry is necessary to get my kafka integration to work in jupyter (and it does). The --jars entry is my attempt to reference the postgresql jdbc driver installed in the ubuntu LTS terminal using:
sudo apt-get install libpostgresql-jdbc-java libpostgresql-jdbc-java-doc
In python, I've tried this:
conf = SparkConf()
conf.set("spark.jars", "/usr/share/java/postgresql.jar")
findspark.init()
spark = SparkSession \
.builder \
.config(conf=conf) \
.appName("My App") \
.getOrCreate()
dataframe = spark.read.format('jdbc').options(\
url = "jdbc:postgresql://host.docker.internal:5432/postgres?user=user&password=***",\
database='postgres',
dbtable='cloud.some-table'
).load()
dataframe.show()
But, I get the following error message:
java.sql.SQLException: No suitable driver
just like the referenced previous poster.
Any ideas? This should be easy, but I'm struggling.
OK, since nobody has come back with an answer I'll post what worked for me (in the end). I'm not claiming this is the correct way to do this and I'm happy for someone to post up a better answer, but it may get someone out of trouble.
Since, different configurations (and versions!) require different solutions, I'll define my setup first. I'm using docker desktop for Windows 10 with Docker Engine V20.10.5. I'm managing my docker containers using docker-compose version 1.29.0. I'm using the latest all-spark-notebook (whatever version that is) and the postgresql-42.2.19 jdbc driver.
I'll also say that this is running on my local Windows machine with LTS installed and is for experimentation only.
The trick that worked for me was:
a) use a package for the jdbc driver with spark. In this way, spark installs the package from maven at runtime (when you create the spark instance within Jupyter) and...
volumes:
- $PWD/work:/home/$USER/work
environment:
PYSPARK_SUBMIT_ARGS: --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.1,org.postgresql:postgresql:42.2.19 --driver-class-path /home/jovyan/.ivy2/jars/org.postgresql_postgresql-42.2.19.jar pyspark-shell
b) Understand where the package jars are unpacked and use that directory to tell spark where to find the associated jars. In my case, I used this command to start spark within Jupyter notebook:
spark = SparkSession \
.builder \
.config("spark.driver.extraClassPath", "/home/jovyan/.ivy2/jars/org.postgresql_postgresql-42.2.19.jar") \
.appName("My App") \
.getOrCreate()
One other thing to note, this can be a bit flaky. If spark figures it needs to re-pull the files from maven (it'll do this the first time around, obviously), the library isn't picked up and the connection fails. However, running stop and up -d to recycle the containers and re-running the python script makes the connection happy. I don't pretend I know why, but my suspicion is that the way I have things set up, there's some dependency there.

Spark connector mongodb issue

I'm trying to establish a connection between apache spark and mongodb. I have spark version 3.0.0 installed and mongodb 4.2.8 installed on my pc. I am following official documentation to connect but I'm unable to.
When I include the --conf specification while activating it includes error. Although if I only include --package it establishes the connection but then I need conf while creating the dataset so it throws error saying create dataset.
I don't think I have understood how it is installed. Also I couldn't find anything of my version although GitHub site said it supports 3.0.
I am attaching error msg.
C:\WINDOWS\system32>C:\Spark\spark-3.0.0-bin-hadoop2.7\bin\pyspark --conf "spark.mongodb.input.uri=mongodb://127.0.0.1/test.myCollection?readPreference=primaryPreferred" \
Error: pyspark does not support any application options.
Usage: bin\pyspark.cmd [options]
Options:
--master MASTER_URL spark://host:port, mesos://host:port, yarn,
k8s://https://host:port, or local (Default: local[*]).
--deploy-mode DEPLOY_MODE Whether to launch the driver program locally ("client") or
on one of the worker machines inside the cluster ("cluster")
(Default: client).
--class CLASS_NAME Your application\'s main class (for Java / Scala apps).
--name NAME A name of your application.
--jars JARS Comma-separated list of jars to include on the driver
and executor classpaths.
--packages Comma-separated list of maven coordinates of jars to include
on the driver and executor classpaths. Will search the local
maven repo, then maven central and any additional remote
repositories given by --repositories. The format for the
coordinates should be groupId:artifactId:version.
--exclude-packages Comma-separated list of groupId:artifactId, to exclude while
resolving the dependencies provided in --packages to avoid
dependency conflicts.
--repositories Comma-separated list of additional remote repositories to
search for the maven coordinates given with --packages.
--py-files PY_FILES Comma-separated list of .zip, .egg, or .py files to place
on the PYTHONPATH for Python apps.
--files FILES Comma-separated list of files to be placed in the working
directory of each executor. File paths of these files
in executors can be accessed via SparkFiles.get(fileName).
--conf, -c PROP=VALUE Arbitrary Spark configuration property.
--properties-file FILE Path to a file from which to load extra properties. If not
specified, this will look for conf/spark-defaults.conf.
--driver-memory MEM Memory for driver (e.g. 1000M, 2G) (Default: 1024M).
--driver-java-options Extra Java options to pass to the driver.
--driver-library-path Extra library path entries to pass to the driver.
--driver-class-path Extra class path entries to pass to the driver. Note that
jars added with --jars are automatically included in the
classpath.
--executor-memory MEM Memory per executor (e.g. 1000M, 2G) (Default: 1G).
--proxy-user NAME User to impersonate when submitting the application.
This argument does not work with --principal / --keytab.
--help, -h Show this help message and exit.
--verbose, -v Print additional debug output.
--version, Print the version of current Spark.
Cluster deploy mode only:
--driver-cores NUM Number of cores used by the driver, only in cluster mode
(Default: 1).
Spark standalone or Mesos with cluster deploy mode only:
--supervise If given, restarts the driver on failure.
Spark standalone, Mesos or K8s with cluster deploy mode only:
--kill SUBMISSION_ID If given, kills the driver specified.
--status SUBMISSION_ID If given, requests the status of the driver specified.
Spark standalone, Mesos and Kubernetes only:
--total-executor-cores NUM Total cores for all executors.
Spark standalone, YARN and Kubernetes only:
--executor-cores NUM Number of cores used by each executor. (Default: 1 in
YARN and K8S modes, or all available cores on the worker
in standalone mode).
Spark on YARN and Kubernetes only:
--num-executors NUM Number of executors to launch (Default: 2).
If dynamic allocation is enabled, the initial number of
executors will be at least NUM.
--principal PRINCIPAL Principal to be used to login to KDC.
--keytab KEYTAB The full path to the file that contains the keytab for the
principal specified above.
Spark on YARN only:
--queue QUEUE_NAME The YARN queue to submit to (Default: "default").
--archives ARCHIVES Comma separated list of archives to be extracted into the
working directory of each executor.
**This is what happens when i dont include --conf while starting the shell**
>>> from pyspark.sql import SparkSession
>>> spark = SparkSession\
... .builder\
... .master('local')\
... .config('spark.mongodb.input.uri', 'mongodb://user:password#ip.x.x.x:27017/database01.data.coll')\
... .config('spark.mongodb.output.uri', 'mongodb://user:password#ip.x.x.x:27017/database01.data.coll')\
... .config('spark.jars.packages', 'org.mongodb.spark:mongo-spark-connector_2.11:2.3.1')\
... .getOrCreate()
>>> df01 = spark.read\
... .format("com.mongodb.spark.sql.DefaultSource")\
... .option("database","database01")\
... .option("collection", "collection01")\
... .load()
Traceback (most recent call last):
File "<stdin>", line 4, in <module>
File "C:\Spark\spark-3.0.0-bin-hadoop2.7\python\pyspark\sql\readwriter.py", line 184, in load
return self._df(self._jreader.load())
File "C:\Spark\spark-3.0.0-bin-hadoop2.7\python\lib\py4j-0.10.9-src.zip\py4j\java_gateway.py", line 1305, in __call__
File "C:\Spark\spark-3.0.0-bin-hadoop2.7\python\pyspark\sql\utils.py", line 137, in deco
raise_from(converted)
File "<string>", line 3, in raise_from
pyspark.sql.utils.IllegalArgumentException: requirement failed: Missing 'uri' property from options
>>>

scala spark to read file from hdfs cluster

I am learning to develop spark applications using Scala. And I am in my very first steps.
I have my scala IDE on windows. configured and runs smoothly if reading files from local drive. However, I have access to a remote hdfs cluster and Hive database, and I want to develop, try, and test my applications against that Hadoop cluster... but I don't know how :(
If I try
val rdd=sc.textFile("hdfs://masternode:9000/user/hive/warehouse/dwh_db_jrtf.db/discipline")
I will get an error that contains:
Exception in thread "main" java.io.IOException: Failed on local exception: com.google.protobuf.InvalidProtocolBufferException: Protocol message end-group tag did not match expected tag.; Host Details : local host is: "MyLap/11.22.33.44"; destination host is: "masternode":9000;
Can anyone guide me please ?
You can use SBT to package your code in a .jar file. scp your file on your Node then try to submit it by doing a spark-submit.
spark-submit \
--class <main-class> \
--master <master-url> \
--deploy-mode <deploy-mode> \
--conf <key>=<value> \
... # other options
<application-jar> \
[application-arguments]
You can't access to your Cluster from your Windows Machine in that way.

How to pass external configuration file to pyspark(Spark 2.x) program?

When I am running pyspark program interactive shell able to fetch the configuration file(config.ini) inside pyspark script,
But when I am trying to run same script using Spark submit command with master yarn and cluster deployment mode is cluster it giving me error as config file not exists, I have checked yarn log and able to see same, below is command for running the pyspark job.
spark2-submit --master yarn --deploy-mode cluster test.py /home/sys_user/ask/conf/config.ini
With spark2-sumbmit command there is parameter provided properties-file, you can use that to get this properties file available in spark-submit command.
e.g. spark2-submit --master yarn --deploy-mode cluster --properties-file $CONF_FILE_NAME pyspark_script.py
Pass the ini file in spark.files parameter
.config('spark.files', 'config/local/config.ini') \
Read in pyspark:
with open(SparkFiles.get('config.ini')) as config_file:
print(config_file.read())
It works for me.

Run specific virtualenv on dataproc cluster at spark-submit like in vanilla Spark

When I'm running on a vanilla spark cluster, and wanting to run a pyspark script against a specific virtualenv, I can create the virtual environment, install packages as needed, and then zip the environment into a file, let's say venv.zip.
Then, at runtime, I can execute
spark-submit --archives venv.zip#VENV --master yarn script.py
and then, so long as I run
os.environ["PYSPARK_PYTHON"] = "VENV/bin/python" inside of script.py, the code will run against the virtual environment, and spark will handle provisioning the virtualenvironment to all of my clusters.
When I do this on dataproc, first, the hadoop-style hash aliasing doesn't work, and second, running
gcloud dataproc jobs submit pyspark script.py --archives venv.zip --cluster <CLUSTER_NAME>
with os.environ["PYSPARK_PYTHON"] = "venv.zip/bin/python" will produce:
Error from python worker:
venv/bin/python: 1: venv.zip/bin/python: Syntax error: word unexpected (expecting ")")
It's clearly seeing my python executable, and trying to run against it, but there really appears to be some sort of parsing error. What gives? Is there any way to pass the live python executable to use to dataproc the way that you can against a vanilla spark cluster?
Turns out I was distributing python binaries across OSes, and was boneheaded enough to not notice that I was doing so, and the incompatibility was causing the crash.