How to get gcloud dataproc create flag in a spark job? - scala

I want to get flags used when creating a dataproc cluster in a spark job.
for example, I created my cluster using this command line:
gcloud dataproc clusters create cluster-name \
--region=region \
--bucket=bucket-name \
--temp-bucket=bucket-name \
other args ...
In my scala spark job I want to get the bucket name and other arguments how to do that, I know if I want to get the arguments of my job I must do that:
val sc = sparkSession.sparkContext
val conf_context=sc.getConf.getAll
conf_context.foreach(println)
Any help, please?
Thanks

Dataproc also publishes some attributes, including the bucket name, to GCE instance Metadata. You can also specify your own Metadata. See https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/metadata.
These will be available to you through the Metadata server. For example, if you want to read the bucket name, you can run
curl -s -H Metadata-Flavor:Google http://metadata.google.internal/computeMetadata/v1/instance/attributes/dataproc-bucket

You can use gcloud dataproc clusters describe shell command to get details about the cluster:
gcloud dataproc clusters describe $clusterName --region $clusterRegion
To get the bucket name from this command, you can use grep:
BUCKET_NAME=$(gcloud dataproc clusters describe $clusterName \
--region $clusterRegion \
| grep 'configBucket:' \
| sed 's/.* //')
You should be able to execute this from Scala, see this post for how to do.

Related

Get the Dataproc cluster name from within the PySpark code

From within a pyspark code running on a dataproc cluster, is it possible to get the dataproc cluster name where it is running?
Dataproc cluster name is available as VM metadata attributes/dataproc-cluster-name. You can get it through
CLI
/usr/share/google/get_metadata_value attributes/dataproc-cluster-name
HTTP
curl -H "Metadata-Flavor: Google" \
"http://metadata.google.internal/computeMetadata/v1/instance/attributes/dataproc-cluster-name"
For regular clusters (non personal-auth), you can also infer the cluster name from the VM host name, just remove the part after -m or -w.

How can I run uncompiled Spark Scala/spark-shell code as a Dataproc job?

Normally, if I'm using Scala for Spark jobs I'll compile a jarfile and submit it with gcloud dataproc jobs submit spark, but sometimes for very lightweight jobs I might be using uncompiled Scala code in a notebook or using the spark-shell REPL, where I assume a SparkContext is already available.
For some of these lightweight use cases I can equivalently use PySpark and submit with gcloud dataproc jobs submit pyspark but sometimes I need easier access to Scala/Java libraries such as directly creating a org.apache.hadoop.fs.FileSystem object inside of map functions. Is there any easy way to submit such "spark-shell" equivalent jobs directly from a command-line using Dataproc Jobs APIs?
At the moment, there isn't a specialized top-level Dataproc Job type for uncompiled Spark Scala, but under the hood, spark-shell is just using the same mechanisms as spark-submit to run a specialized REPL driver: org.apache.spark.repl.Main. Thus, combining this with the --files flag available in gcloud dataproc jobs submit spark, you can just write snippets of Scala that you may have tested in a spark-shell or notebook session, and run that as your entire Dataproc job, assuming job.scala is a local file on your machine:
gcloud dataproc jobs submit spark --cluster ${CLUSTER} \
--class org.apache.spark.repl.Main \
--files job.scala \
-- -i job.scala
Just like any other file, you can also specify any Hadoop-compatible path in the --files argument as well, such as gs:// or even hdfs://, assuming you've already placed your job.scala file there:
gcloud dataproc jobs submit spark --cluster ${CLUSTER} \
--class org.apache.spark.repl.Main \
--files gs://${BUCKET}/job.scala \
-- -i job.scala
gcloud dataproc jobs submit spark --cluster ${CLUSTER} \
--class org.apache.spark.repl.Main \
--files hdfs:///tmp/job.scala \
-- -i job.scala
If you've staged your job file onto the Dataproc master node via an init action, you'd use file:/// to specify that the file is found on the cluster's local filesystem instead of your local filesystem where you're running gcloud:
gcloud dataproc jobs submit spark --cluster ${CLUSTER} \
--class org.apache.spark.repl.Main \
--files file:///tmp/job.scala \
-- -i job.scala
Note in all cases, the file becomes a local file in the working-directory of the main driver job, so the argument to "-i" can just be a relative path to the filename.

Dataproc arguments not being read on spark submit

I am using dataproc to submit jobs on spark. However on spark-submit, non-spark arguments are being read as spark arguments!
I am receiving the error/warning below when running a particular job.
Warning: Ignoring non-spark config property: dataproc:dataproc.conscrypt.provider.enable=false
gcloud dataproc jobs submit spark \
--cluster my-cluster \
--region us-east1 \
--properties dataproc:dataproc.conscrypt.provider.enable=false,spark.executor.extraJavaOptions=$SPARK_CONF,spark.executor.memory=${MEMORY}G,spark.executor.cores=$total_cores \
--class com.sample.run \
--jars gs://jars/jobs.jar \
-- 1000
I would like to know whats wrong with my current format. Thanks in advance.
spark-submit just silently ignored conf options that did not start with spark.
thats the reason for this property it was saying it was ignored.
--properties dataproc:dataproc.conscrypt.provider.enable=false
any property you should pass as spark.propertyname
this is just warning.
Why this property required :
The Conscrypt security provider has been temporarily changed from the
default to an optional security provider. This change was made due to
incompatibilities with some workloads. The Conscrypt provider will be
re-enabled as the default with the release of Cloud Dataproc 1.2 in
the future. In the meantime, you can re-enable the Conscrypt provider
when creating a cluster by specifying this Cloud Dataproc property:
--properties dataproc:dataproc.conscrypt.provider.enable=true
This has to be specified when creating cluster since this is cluster property, not property of spark. (means spark framework cant able to understand this and simply ignored.)
Example usage :
gcloud beta dataproc clusters create my-test
--project my-project
--subnet prod-sub-1
--zone southamerica-east1-a
--region=southamerica-east1
--master-machine-type n1-standard-4
--master-boot-disk-size 40
--num-workers 5
--worker-machine-type n1-standard-4
--worker-boot-disk-size 20
--image-version 1.2
--tags internal,ssh,http-server,https-server
--properties dataproc:dataproc.conscrypt.provider.enable=false
--format=json
--max-idle=10m
and then start job like this...
gcloud dataproc jobs submit pyspark gs://path-to-script/spark_full_job.py
--cluster=my-test
--project=my-project
--region=southamerica-east1
--jars=gs://path-to-driver/mssql-jdbc-6.4.0.jre8.jar
--format=json -- [JOB_ARGS]

Pass packages to pyspark running on dataproc from airflow?

We have an Airflow DAG that involves running a pyspark job on Dataproc. We need a jdbc driver during the job, which I'd normally pass to the dataproc submit command:
gcloud dataproc jobs submit pyspark \
--cluster my-cluster \
--properties spark.jars.packages=mysql:mysql-connector-java:6.0.6 \
--py-files ...
But how can I do it with Airflow's DataProcPySparkOperator?
For now we're adding this library to the cluster itself:
gcloud dataproc clusters create my-cluster \
--region global \
--zone europe-west1-d \
...
--properties spark:spark.jars.packages=mysql:mysql-connector-java:6.0.6 \
...
This seems to be working fine, but it doesn't feel like the right way to do it. Is there another way?
I believe you want to pass dataproc_pyspark_properties to the DataProcPySparkOperator.
See:
https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/operators/dataproc_operator.py

How to run cluster initialization script on GCP after creation of cluster

I have created a Google Dataproc cluster, but need to install presto as I now have a requirement. Presto is provided as an initialization action on Dataproc here, how can I run this initialization action after creation of the cluster.
Most init actions would probably run even after the cluster is created (though I haven't tried the Presto init action).
I like to run clusters describe to get the instance names, then run something like gcloud compute ssh <NODE> -- -T sudo bash -s < presto.sh for each node. Reference: How to use SSH to run a shell script on a remote machine?.
Notes:
Everything after the -- are args to the normal ssh command
The -T means don't try to create an interactive session (otherwise you'll get a warning like "Pseudo-terminal will not be allocated because stdin is not a terminal.")
I use "sudo bash" because init actions scripts assume they're being run as root.
presto.sh must be a copy of the script on your local machine. You could alternatively ssh and gsutil cp gs://dataproc-initialization-actions/presto/presto.sh . && sudo bash presto.sh.
But #Kanji Hara is correct in general. Spinning up a new cluster is pretty fast/painless, so we advocate using initialization actions when creating a cluster.
You could use initialization-actions parameter
Ex:
gcloud dataproc clusters create $CLUSTERNAME \
--project $PROJECT \
--num-workers $WORKERS \
--bucket $BUCKET \
--master-machine-type $VMMASTER \
--worker-machine-type $VMWORKER \
--initialization-actions \
gs://dataproc-initialization-actions/presto/presto.sh \
--scopes cloud-platform
Maybe this script can help you: https://github.com/kanjih-ciandt/script-dataproc-datalab