Py4JJavaError - getting error while submit PySpark job in Dataproc

Py4JJavaError - getting error while submit PySpark job in Dataproc - pyspark

I am getting error while trying to submit a PySpark job in Dataproc cluster.
gcloud submission command:
gcloud dataproc jobs submit pyspark --cluster test-cluster migrate_db_table.py
But getting below error:
('Exception occurs !!!', Py4JJavaError(u'An error occurred while
calling o31311.jdbc.\n', JavaObject id=o31313))
Any idea about this error?

Related

why dataproc not recognizing argument : spark.submit.deployMode=cluster?

I am submitting a spark job to dataproc this way :
gcloud dataproc jobs submit spark --cluster=$CLUSTER --region=$REGION --properties spark.jars.packages=com.google.cloud.spark:spark-bigquery-with-dependencies_2.11:0.19.1, spark.submit.deployMode=cluster --class path.to.my.main.class --jars=path.to.jars -- "-p" "some_arg" "-z" "some_other_arg"
But i am getting this error :
ERROR: (gcloud.dataproc.jobs.submit.spark) unrecognized arguments:
spark.submit.deployMode=cluster
Any idea why? thank you in advance for your help.
It works fine this way (without the cluster mode):
gcloud dataproc jobs submit spark --cluster=$CLUSTER --region=$REGION --properties spark.jars.packages=com.google.cloud.spark:spark-bigquery-with-dependencies_2.11:0.19.1 --class path.to.my.main.class --jars=path.to.jars -- "-p" "some_arg" "-z" "some_other_arg"

It seems you have a space between the first property and the second. Either remove it or surround both of them with quotes.
Another option is to replace this with
--packages com.google.cloud.spark:spark-bigquery-with-dependencies_2.11:0.19.1 --properties spark.submit.deployMode=cluster

How can I run uncompiled Spark Scala/spark-shell code as a Dataproc job?

Normally, if I'm using Scala for Spark jobs I'll compile a jarfile and submit it with gcloud dataproc jobs submit spark, but sometimes for very lightweight jobs I might be using uncompiled Scala code in a notebook or using the spark-shell REPL, where I assume a SparkContext is already available.
For some of these lightweight use cases I can equivalently use PySpark and submit with gcloud dataproc jobs submit pyspark but sometimes I need easier access to Scala/Java libraries such as directly creating a org.apache.hadoop.fs.FileSystem object inside of map functions. Is there any easy way to submit such "spark-shell" equivalent jobs directly from a command-line using Dataproc Jobs APIs?

At the moment, there isn't a specialized top-level Dataproc Job type for uncompiled Spark Scala, but under the hood, spark-shell is just using the same mechanisms as spark-submit to run a specialized REPL driver: org.apache.spark.repl.Main. Thus, combining this with the --files flag available in gcloud dataproc jobs submit spark, you can just write snippets of Scala that you may have tested in a spark-shell or notebook session, and run that as your entire Dataproc job, assuming job.scala is a local file on your machine:
gcloud dataproc jobs submit spark --cluster ${CLUSTER} \
--class org.apache.spark.repl.Main \
--files job.scala \
-- -i job.scala
Just like any other file, you can also specify any Hadoop-compatible path in the --files argument as well, such as gs:// or even hdfs://, assuming you've already placed your job.scala file there:
gcloud dataproc jobs submit spark --cluster ${CLUSTER} \
--class org.apache.spark.repl.Main \
--files gs://${BUCKET}/job.scala \
-- -i job.scala
gcloud dataproc jobs submit spark --cluster ${CLUSTER} \
--class org.apache.spark.repl.Main \
--files hdfs:///tmp/job.scala \
-- -i job.scala
If you've staged your job file onto the Dataproc master node via an init action, you'd use file:/// to specify that the file is found on the cluster's local filesystem instead of your local filesystem where you're running gcloud:
gcloud dataproc jobs submit spark --cluster ${CLUSTER} \
--class org.apache.spark.repl.Main \
--files file:///tmp/job.scala \
-- -i job.scala
Note in all cases, the file becomes a local file in the working-directory of the main driver job, so the argument to "-i" can just be a relative path to the filename.

Dataproc arguments not being read on spark submit

I am using dataproc to submit jobs on spark. However on spark-submit, non-spark arguments are being read as spark arguments!
I am receiving the error/warning below when running a particular job.
Warning: Ignoring non-spark config property: dataproc:dataproc.conscrypt.provider.enable=false
gcloud dataproc jobs submit spark \
--cluster my-cluster \
--region us-east1 \
--properties dataproc:dataproc.conscrypt.provider.enable=false,spark.executor.extraJavaOptions=$SPARK_CONF,spark.executor.memory=${MEMORY}G,spark.executor.cores=$total_cores \
--class com.sample.run \
--jars gs://jars/jobs.jar \
-- 1000
I would like to know whats wrong with my current format. Thanks in advance.

spark-submit just silently ignored conf options that did not start with spark.
thats the reason for this property it was saying it was ignored.
--properties dataproc:dataproc.conscrypt.provider.enable=false
any property you should pass as spark.propertyname
this is just warning.
Why this property required :
The Conscrypt security provider has been temporarily changed from the
default to an optional security provider. This change was made due to
incompatibilities with some workloads. The Conscrypt provider will be
re-enabled as the default with the release of Cloud Dataproc 1.2 in
the future. In the meantime, you can re-enable the Conscrypt provider
when creating a cluster by specifying this Cloud Dataproc property:
--properties dataproc:dataproc.conscrypt.provider.enable=true
This has to be specified when creating cluster since this is cluster property, not property of spark. (means spark framework cant able to understand this and simply ignored.)
Example usage :
gcloud beta dataproc clusters create my-test
--project my-project
--subnet prod-sub-1
--zone southamerica-east1-a
--region=southamerica-east1
--master-machine-type n1-standard-4
--master-boot-disk-size 40
--num-workers 5
--worker-machine-type n1-standard-4
--worker-boot-disk-size 20
--image-version 1.2
--tags internal,ssh,http-server,https-server
--properties dataproc:dataproc.conscrypt.provider.enable=false
--format=json
--max-idle=10m
and then start job like this...
gcloud dataproc jobs submit pyspark gs://path-to-script/spark_full_job.py
--cluster=my-test
--project=my-project
--region=southamerica-east1
--jars=gs://path-to-driver/mssql-jdbc-6.4.0.jre8.jar
--format=json -- [JOB_ARGS]

Pass packages to pyspark running on dataproc from airflow?

We have an Airflow DAG that involves running a pyspark job on Dataproc. We need a jdbc driver during the job, which I'd normally pass to the dataproc submit command:
gcloud dataproc jobs submit pyspark \
--cluster my-cluster \
--properties spark.jars.packages=mysql:mysql-connector-java:6.0.6 \
--py-files ...
But how can I do it with Airflow's DataProcPySparkOperator?
For now we're adding this library to the cluster itself:
gcloud dataproc clusters create my-cluster \
--region global \
--zone europe-west1-d \
...
--properties spark:spark.jars.packages=mysql:mysql-connector-java:6.0.6 \
...
This seems to be working fine, but it doesn't feel like the right way to do it. Is there another way?

I believe you want to pass dataproc_pyspark_properties to the DataProcPySparkOperator.
See:
https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/operators/dataproc_operator.py

Google Dataproc Agent reports failure when using initialization script

I am trying to set up a cluster with an initialization script, but I get the following error:
[BAD JSON: JSON Parse error: Unexpected identifier "Google"]
In the log folder the init script output log is absent.
This seems rather strange as it seemed to work past week, and the error message does not seem related to the init script, but rather to the input arguments for the cluster creation. I used the following command:
gcloud beta dataproc clusters create <clustername> --bucket <bucket> --zone <zone> --master-machine-type n1-standard-1 --master-boot-disk-size 10 --num-workers 2 --worker-machine-type n1-standard-1 --worker-boot-disk-size 10 --project <projectname> --initialization-actions <gcs-uri of script>

Apparently changing
#!/bin/sh
to
#!/bin/bash
and removing all "sudo" occurrences did the trick.

This particular error occurs most often when the initialization script is in a Cloud Storage (GCS) bucket to which the project running the cluster does not have access.
I would recommend double-checking the project which is being used for the cluster has read access to the bucket.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Py4JJavaError - getting error while submit PySpark job in Dataproc - pyspark

Related

why dataproc not recognizing argument : spark.submit.deployMode=cluster?

How can I run uncompiled Spark Scala/spark-shell code as a Dataproc job?

Dataproc arguments not being read on spark submit

Pass packages to pyspark running on dataproc from airflow?

Google Dataproc Agent reports failure when using initialization script

Categories

Resources