Dataproc on GKE: python packages listed in properties not installed - pyspark

I created a dataproc cluster on a GKE cluster. The required packages already included inside the properties like examples in here. But when I submitted a job, it failed with an error: ModuleNotFoundError.
...
Waiting for job output...
PYSPARK_PYTHON=/opt/conda/bin/python
JAVA_HOME=/usr/lib/jvm/temurin-8-jdk-amd64
SPARK_EXTRA_CLASSPATH=
Merging Spark configs
Skipping merging /opt/spark/conf/spark-defaults.conf, file does not exist.
Skipping merging /opt/spark/conf/log4j.properties, file does not exist.
Skipping merging /opt/spark/conf/spark-env.sh, file does not exist.
Skipping custom init script, file does not exist.
Running heartbeat loop
Traceback (most recent call last):
File "/tmp/spark-d6516b57-0924-4ce2-9de8-a5c1116667b4/pkg.py", line 1, in <module>
from google.cloud import secretmanager
ModuleNotFoundError: No module named 'google'
This is the gcloud command I used:
gcloud dataproc clusters gke create gke-dp --region=asia-southeast1 --spark-engine-version=3.1 \
--gke-cluster=gke-spark --gke-cluster-location=asia-southeast1-b --namespace=dataproc \
--pools='name=dp-default,roles=default,machineType=n2-standard-2,min=1,max=1' \
--pools='name=dp-workers,roles=spark-driver;spark-executor,machineType=n2-standard-4,min=1,max=4' \
--properties='^#^dataproc:pip.packages=google-cloud-secret-manager==2.15.0,numpy==1.24.1#spark:spark.jars=https://jdbc.postgresql.org/download/postgresql-42.5.1.jar' \
--properties="dataproc:dataproc.gke.agent.google-service-account=dataproc#de-project.iam.gserviceaccount.com" \
--properties="dataproc:dataproc.gke.spark.driver.google-service-account=dataproc#de-project.iam.gserviceaccount.com" \
--properties="dataproc:dataproc.gke.spark.executor.google-service-account=dataproc#de-project.iam.gserviceaccount.com"

This functionality is not supported by Dataproc on GKE.

Related

How to read files uploaded by spark-submit on Kubernetes

I have Spark Jobs running on Yarn. These days I'm moving to Spark on Kubernetes.
On Kubernetes I'm having an issue: files uploaded via --files can't be read by Spark Driver.
On Yarn, as described in many answers I can read those files using Source.fromFile(filename).
But I can't read files in Spark on Kubernetes.
Spark version: 3.0.1
Scala version: 2.12.6
deploy-mode: cluster
submit commands
$ spark-submit --class <className> \
--name=<jobName> \
--master=k8s://https://api-hostname:6443 \
...
--deploy-mode=cluster \
--files app.conf \
--conf spark.kubernetes.file.upload.path=hdfs://<nameservice>/path/to/sparkUploads/ \
app.jar
After executing above command, app.conf is uploaded to hdfs://<nameservice>/path/to/sparkUploads/spark-upload-xxxxxxx/,
And in Driver's pod, I found app.conf in /tmp/spark-******/ directory, app.jar as well.
But Driver can't read app.conf, Source.fromFile(filename) returns null, there was no permission problems.
Update 1
In Spark Web UI->"Environment" Tab, spark://<pod-name>-svc.ni.svc:7078/files/app.conf in "Classpath Entries" menu. Does this mean app.conf is available in classpath?
On the other hand, in Spark on Yarn user.dir property was included in System classpath.
I found SPARK-31726: Make spark.files available in driver with cluster deploy mode on kubernetes
Update 2
I found that driver pod's /opt/spark/work-dir/ dir was included in classpath.
but /opt/spark/work-dir/ is empty on driver pod whereas on executor pod it contains app.conf and app.jar.
I think that is the problem and SPARK-31726 describes this.
Update 3
After reading Jacek's answer, I tested org.apache.spark.SparkFiles.getRootDirectory().
It returns /var/data/spark-357eb33e-1c17-4ad4-b1e8-6f878b1d8253/spark-e07d7e84-0fa7-410e-b0da-7219c412afa3/userFiles-59084588-f7f6-4ba2-a3a3-9997a780af24
Update 4 - work around
First, I make ConfigMaps to save files that I want to read driver/executors
Next, The ConfigMaps are mounted on driver/executors. To mount ConfigMap, use Pod Template or Spark Operator
--files files should be accessed using SparkFiles.get utility:
get(filename: String): String
Get the absolute path of a file added through SparkContext.addFile().
I found the another temporary solution in spark 3.3.0
We can use flag --archives. The files without tar, tar.gz, zip are ignored unpacking step and after that they are placed on working dir of driver and executor.
Although the docs of --archive don't mention executor, I tested and it's working.

Spark connector mongodb issue

I'm trying to establish a connection between apache spark and mongodb. I have spark version 3.0.0 installed and mongodb 4.2.8 installed on my pc. I am following official documentation to connect but I'm unable to.
When I include the --conf specification while activating it includes error. Although if I only include --package it establishes the connection but then I need conf while creating the dataset so it throws error saying create dataset.
I don't think I have understood how it is installed. Also I couldn't find anything of my version although GitHub site said it supports 3.0.
I am attaching error msg.
C:\WINDOWS\system32>C:\Spark\spark-3.0.0-bin-hadoop2.7\bin\pyspark --conf "spark.mongodb.input.uri=mongodb://127.0.0.1/test.myCollection?readPreference=primaryPreferred" \
Error: pyspark does not support any application options.
Usage: bin\pyspark.cmd [options]
Options:
--master MASTER_URL spark://host:port, mesos://host:port, yarn,
k8s://https://host:port, or local (Default: local[*]).
--deploy-mode DEPLOY_MODE Whether to launch the driver program locally ("client") or
on one of the worker machines inside the cluster ("cluster")
(Default: client).
--class CLASS_NAME Your application\'s main class (for Java / Scala apps).
--name NAME A name of your application.
--jars JARS Comma-separated list of jars to include on the driver
and executor classpaths.
--packages Comma-separated list of maven coordinates of jars to include
on the driver and executor classpaths. Will search the local
maven repo, then maven central and any additional remote
repositories given by --repositories. The format for the
coordinates should be groupId:artifactId:version.
--exclude-packages Comma-separated list of groupId:artifactId, to exclude while
resolving the dependencies provided in --packages to avoid
dependency conflicts.
--repositories Comma-separated list of additional remote repositories to
search for the maven coordinates given with --packages.
--py-files PY_FILES Comma-separated list of .zip, .egg, or .py files to place
on the PYTHONPATH for Python apps.
--files FILES Comma-separated list of files to be placed in the working
directory of each executor. File paths of these files
in executors can be accessed via SparkFiles.get(fileName).
--conf, -c PROP=VALUE Arbitrary Spark configuration property.
--properties-file FILE Path to a file from which to load extra properties. If not
specified, this will look for conf/spark-defaults.conf.
--driver-memory MEM Memory for driver (e.g. 1000M, 2G) (Default: 1024M).
--driver-java-options Extra Java options to pass to the driver.
--driver-library-path Extra library path entries to pass to the driver.
--driver-class-path Extra class path entries to pass to the driver. Note that
jars added with --jars are automatically included in the
classpath.
--executor-memory MEM Memory per executor (e.g. 1000M, 2G) (Default: 1G).
--proxy-user NAME User to impersonate when submitting the application.
This argument does not work with --principal / --keytab.
--help, -h Show this help message and exit.
--verbose, -v Print additional debug output.
--version, Print the version of current Spark.
Cluster deploy mode only:
--driver-cores NUM Number of cores used by the driver, only in cluster mode
(Default: 1).
Spark standalone or Mesos with cluster deploy mode only:
--supervise If given, restarts the driver on failure.
Spark standalone, Mesos or K8s with cluster deploy mode only:
--kill SUBMISSION_ID If given, kills the driver specified.
--status SUBMISSION_ID If given, requests the status of the driver specified.
Spark standalone, Mesos and Kubernetes only:
--total-executor-cores NUM Total cores for all executors.
Spark standalone, YARN and Kubernetes only:
--executor-cores NUM Number of cores used by each executor. (Default: 1 in
YARN and K8S modes, or all available cores on the worker
in standalone mode).
Spark on YARN and Kubernetes only:
--num-executors NUM Number of executors to launch (Default: 2).
If dynamic allocation is enabled, the initial number of
executors will be at least NUM.
--principal PRINCIPAL Principal to be used to login to KDC.
--keytab KEYTAB The full path to the file that contains the keytab for the
principal specified above.
Spark on YARN only:
--queue QUEUE_NAME The YARN queue to submit to (Default: "default").
--archives ARCHIVES Comma separated list of archives to be extracted into the
working directory of each executor.
**This is what happens when i dont include --conf while starting the shell**
>>> from pyspark.sql import SparkSession
>>> spark = SparkSession\
... .builder\
... .master('local')\
... .config('spark.mongodb.input.uri', 'mongodb://user:password#ip.x.x.x:27017/database01.data.coll')\
... .config('spark.mongodb.output.uri', 'mongodb://user:password#ip.x.x.x:27017/database01.data.coll')\
... .config('spark.jars.packages', 'org.mongodb.spark:mongo-spark-connector_2.11:2.3.1')\
... .getOrCreate()
>>> df01 = spark.read\
... .format("com.mongodb.spark.sql.DefaultSource")\
... .option("database","database01")\
... .option("collection", "collection01")\
... .load()
Traceback (most recent call last):
File "<stdin>", line 4, in <module>
File "C:\Spark\spark-3.0.0-bin-hadoop2.7\python\pyspark\sql\readwriter.py", line 184, in load
return self._df(self._jreader.load())
File "C:\Spark\spark-3.0.0-bin-hadoop2.7\python\lib\py4j-0.10.9-src.zip\py4j\java_gateway.py", line 1305, in __call__
File "C:\Spark\spark-3.0.0-bin-hadoop2.7\python\pyspark\sql\utils.py", line 137, in deco
raise_from(converted)
File "<string>", line 3, in raise_from
pyspark.sql.utils.IllegalArgumentException: requirement failed: Missing 'uri' property from options
>>>

How can I run uncompiled Spark Scala/spark-shell code as a Dataproc job?

Normally, if I'm using Scala for Spark jobs I'll compile a jarfile and submit it with gcloud dataproc jobs submit spark, but sometimes for very lightweight jobs I might be using uncompiled Scala code in a notebook or using the spark-shell REPL, where I assume a SparkContext is already available.
For some of these lightweight use cases I can equivalently use PySpark and submit with gcloud dataproc jobs submit pyspark but sometimes I need easier access to Scala/Java libraries such as directly creating a org.apache.hadoop.fs.FileSystem object inside of map functions. Is there any easy way to submit such "spark-shell" equivalent jobs directly from a command-line using Dataproc Jobs APIs?
At the moment, there isn't a specialized top-level Dataproc Job type for uncompiled Spark Scala, but under the hood, spark-shell is just using the same mechanisms as spark-submit to run a specialized REPL driver: org.apache.spark.repl.Main. Thus, combining this with the --files flag available in gcloud dataproc jobs submit spark, you can just write snippets of Scala that you may have tested in a spark-shell or notebook session, and run that as your entire Dataproc job, assuming job.scala is a local file on your machine:
gcloud dataproc jobs submit spark --cluster ${CLUSTER} \
--class org.apache.spark.repl.Main \
--files job.scala \
-- -i job.scala
Just like any other file, you can also specify any Hadoop-compatible path in the --files argument as well, such as gs:// or even hdfs://, assuming you've already placed your job.scala file there:
gcloud dataproc jobs submit spark --cluster ${CLUSTER} \
--class org.apache.spark.repl.Main \
--files gs://${BUCKET}/job.scala \
-- -i job.scala
gcloud dataproc jobs submit spark --cluster ${CLUSTER} \
--class org.apache.spark.repl.Main \
--files hdfs:///tmp/job.scala \
-- -i job.scala
If you've staged your job file onto the Dataproc master node via an init action, you'd use file:/// to specify that the file is found on the cluster's local filesystem instead of your local filesystem where you're running gcloud:
gcloud dataproc jobs submit spark --cluster ${CLUSTER} \
--class org.apache.spark.repl.Main \
--files file:///tmp/job.scala \
-- -i job.scala
Note in all cases, the file becomes a local file in the working-directory of the main driver job, so the argument to "-i" can just be a relative path to the filename.

How to know when dataproc initialization actions are done

I need to run a Dataproc cluster with both BigQuery and Cloud Storage connectors installed.
I use a variant of this script (because I have no access to the bucket used in the general one), everything is working fine but when I run a job, when the cluster is up and running, it always results in a Task was not acquired error.
I can fix this by simply restarting the dataproc agent on every nodes but I really need this to work properly to be able to run a job right after my cluster is created. it seems that this part of the script is not working properly:
# Restarts Dataproc Agent after successful initialization
# WARNING: this function relies on undocumented and not officially supported Dataproc Agent
# "sentinel" files to determine successful Agent initialization and not guaranteed
# to work in the future. Use at your own risk!
restart_dataproc_agent() {
# Because Dataproc Agent should be restarted after initialization, we need to wait until
# it will create a sentinel file that signals initialization competition (success or failure)
while [[ ! -f /var/lib/google/dataproc/has_run_before ]]; do
sleep 1
done
# If Dataproc Agent didn't create a sentinel file that signals initialization
# failure then it means that initialization succeded and it should be restarted
if [[ ! -f /var/lib/google/dataproc/has_failed_before ]]; then
service google-dataproc-agent restart
fi
}
export -f restart_dataproc_agent
# Schedule asynchronous Dataproc Agent restart so it will use updated connectors.
# It could not be restarted sycnhronously because Dataproc Agent should be restarted
# after its initialization, including init actions execution, has been completed.
bash -c restart_dataproc_agent & disown
My question here are:
How to know that the initialization actions are done?
Do I have/How to properly restart the Dataproc agent one my newly created cluster's nodes?
EDIT:
Here is the command I use to create a cluster (using the 1.3 image version):
gcloud dataproc --region europe-west1 \
clusters create my-cluster \
--bucket my-bucket \
--subnet default \
--zone europe-west1-b \
--master-machine-type n1-standard-1 \
--master-boot-disk-size 50 \
--num-workers 2 \
--worker-machine-type n1-standard-2 \
--worker-boot-disk-size 100 \
--image-version 1.3 \
--scopes 'https://www.googleapis.com/auth/cloud-platform' \
--project my-project \
--initialization-actions gs://dataproc-initialization-actions/connectors/connectors.sh \
--metadata 'gcs-connector-version=1.9.6' \
--metadata 'bigquery-connector-version=0.13.6'
Also, please note that the connectors initialization script has been fixed and works fine by now, so I am using it now but I still have to restart manually the dataproc agent to be able to run a job.
Dataproc agent logs Custom initialization actions finished. message in the /var/log/google-dataproc-agent.0.log file after initialization actions succeed.
No you don't need to restart Dataproc agent manually.
This issue is caused by Dataproc agent service restart in the connectors initialization action and should be resolved by this PR.
As for knowing when the initialization actions are finished, You can check the dataproc's status.state, if it's CREATING that means initialization actions are still being executed, if RUNNING that would mean that they are done!
Check here

Google Dataproc Agent reports failure when using initialization script

I am trying to set up a cluster with an initialization script, but I get the following error:
[BAD JSON: JSON Parse error: Unexpected identifier "Google"]
In the log folder the init script output log is absent.
This seems rather strange as it seemed to work past week, and the error message does not seem related to the init script, but rather to the input arguments for the cluster creation. I used the following command:
gcloud beta dataproc clusters create <clustername> --bucket <bucket> --zone <zone> --master-machine-type n1-standard-1 --master-boot-disk-size 10 --num-workers 2 --worker-machine-type n1-standard-1 --worker-boot-disk-size 10 --project <projectname> --initialization-actions <gcs-uri of script>
Apparently changing
#!/bin/sh
to
#!/bin/bash
and removing all "sudo" occurrences did the trick.
This particular error occurs most often when the initialization script is in a Cloud Storage (GCS) bucket to which the project running the cluster does not have access.
I would recommend double-checking the project which is being used for the cluster has read access to the bucket.