Get the Dataproc cluster name from within the PySpark code - pyspark

From within a pyspark code running on a dataproc cluster, is it possible to get the dataproc cluster name where it is running?

Dataproc cluster name is available as VM metadata attributes/dataproc-cluster-name. You can get it through
CLI
/usr/share/google/get_metadata_value attributes/dataproc-cluster-name
HTTP
curl -H "Metadata-Flavor: Google" \
"http://metadata.google.internal/computeMetadata/v1/instance/attributes/dataproc-cluster-name"
For regular clusters (non personal-auth), you can also infer the cluster name from the VM host name, just remove the part after -m or -w.

Related

How to get gcloud dataproc create flag in a spark job?

I want to get flags used when creating a dataproc cluster in a spark job.
for example, I created my cluster using this command line:
gcloud dataproc clusters create cluster-name \
--region=region \
--bucket=bucket-name \
--temp-bucket=bucket-name \
other args ...
In my scala spark job I want to get the bucket name and other arguments how to do that, I know if I want to get the arguments of my job I must do that:
val sc = sparkSession.sparkContext
val conf_context=sc.getConf.getAll
conf_context.foreach(println)
Any help, please?
Thanks
Dataproc also publishes some attributes, including the bucket name, to GCE instance Metadata. You can also specify your own Metadata. See https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/metadata.
These will be available to you through the Metadata server. For example, if you want to read the bucket name, you can run
curl -s -H Metadata-Flavor:Google http://metadata.google.internal/computeMetadata/v1/instance/attributes/dataproc-bucket
You can use gcloud dataproc clusters describe shell command to get details about the cluster:
gcloud dataproc clusters describe $clusterName --region $clusterRegion
To get the bucket name from this command, you can use grep:
BUCKET_NAME=$(gcloud dataproc clusters describe $clusterName \
--region $clusterRegion \
| grep 'configBucket:' \
| sed 's/.* //')
You should be able to execute this from Scala, see this post for how to do.

Dataproc arguments not being read on spark submit

I am using dataproc to submit jobs on spark. However on spark-submit, non-spark arguments are being read as spark arguments!
I am receiving the error/warning below when running a particular job.
Warning: Ignoring non-spark config property: dataproc:dataproc.conscrypt.provider.enable=false
gcloud dataproc jobs submit spark \
--cluster my-cluster \
--region us-east1 \
--properties dataproc:dataproc.conscrypt.provider.enable=false,spark.executor.extraJavaOptions=$SPARK_CONF,spark.executor.memory=${MEMORY}G,spark.executor.cores=$total_cores \
--class com.sample.run \
--jars gs://jars/jobs.jar \
-- 1000
I would like to know whats wrong with my current format. Thanks in advance.
spark-submit just silently ignored conf options that did not start with spark.
thats the reason for this property it was saying it was ignored.
--properties dataproc:dataproc.conscrypt.provider.enable=false
any property you should pass as spark.propertyname
this is just warning.
Why this property required :
The Conscrypt security provider has been temporarily changed from the
default to an optional security provider. This change was made due to
incompatibilities with some workloads. The Conscrypt provider will be
re-enabled as the default with the release of Cloud Dataproc 1.2 in
the future. In the meantime, you can re-enable the Conscrypt provider
when creating a cluster by specifying this Cloud Dataproc property:
--properties dataproc:dataproc.conscrypt.provider.enable=true
This has to be specified when creating cluster since this is cluster property, not property of spark. (means spark framework cant able to understand this and simply ignored.)
Example usage :
gcloud beta dataproc clusters create my-test
--project my-project
--subnet prod-sub-1
--zone southamerica-east1-a
--region=southamerica-east1
--master-machine-type n1-standard-4
--master-boot-disk-size 40
--num-workers 5
--worker-machine-type n1-standard-4
--worker-boot-disk-size 20
--image-version 1.2
--tags internal,ssh,http-server,https-server
--properties dataproc:dataproc.conscrypt.provider.enable=false
--format=json
--max-idle=10m
and then start job like this...
gcloud dataproc jobs submit pyspark gs://path-to-script/spark_full_job.py
--cluster=my-test
--project=my-project
--region=southamerica-east1
--jars=gs://path-to-driver/mssql-jdbc-6.4.0.jre8.jar
--format=json -- [JOB_ARGS]

Using Cloud Shell to Access a Private Kubernetes Cluster in GCP

The following link https://cloud.google.com/kubernetes-engine/docs/how-to/private-clusters talks about the setting up of a private GKE cluster in a separate custom VPC. The Terraform code that creates the cluster and VPCs are available from https://github.com/rajtmana/gcp-terraform/blob/master/k8s-cluster/main.tf Cluster creation completed and I wanted to use some kubectl commands from the Google Cloud Shell. I used the following commands
$ gcloud container clusters get-credentials mservice-dev-cluster --region europe-west2
$ gcloud container clusters update mservice-dev-cluster \
> --region europe-west2 \
> --enable-master-authorized-networks \
> --master-authorized-networks "35.241.216.229/32"
Updating mservice-dev-cluster...done.
ERROR: (gcloud.container.clusters.update) Operation [<Operation
clusterConditions: []
detail: u'Patch failed'
$ gcloud container clusters update mservice-dev-cluster \
> --region europe-west2 \
> --enable-master-authorized-networks \
> --master-authorized-networks "172.17.0.2/32"
Updating mservice-dev-cluster...done.
Updated [https://container.googleapis.com/v1/projects/protean-
XXXX/zones/europe-west2/clusters/mservice-dev-cluster].
To inspect the contents of your cluster, go to:
https://console.cloud.google.com/kubernetes/workload_/gcloud/europe-
west2/mservice-dev-cluster?project=protean-XXXX
$ kubectl config current-context
gke_protean-XXXX_europe-west2_mservice-dev-cluster
$ kubectl get services
Unable to connect to the server: dial tcp 172.16.0.2:443: i/o timeout
When I give the public IP of the Cloud Shell, it says that public IP is not allowed with error message given above. If I give the internal IP of Cloud Shell starting with 172, the connection is timing out as well. Any thoughts? Appreciate the help.
Google suggest creating a VM within the same network as the cluster and then accessing that via SSH in the cloud shell and running kubectl commands from there:
https://cloud.google.com/solutions/creating-kubernetes-engine-private-clusters-with-net-proxies
Try to perform the following
gcloud container clusters get-credentials [CLUSTER_NAME]
And confirm that kubectl is using the right credentials:
gcloud auth application-default login

How to run cluster initialization script on GCP after creation of cluster

I have created a Google Dataproc cluster, but need to install presto as I now have a requirement. Presto is provided as an initialization action on Dataproc here, how can I run this initialization action after creation of the cluster.
Most init actions would probably run even after the cluster is created (though I haven't tried the Presto init action).
I like to run clusters describe to get the instance names, then run something like gcloud compute ssh <NODE> -- -T sudo bash -s < presto.sh for each node. Reference: How to use SSH to run a shell script on a remote machine?.
Notes:
Everything after the -- are args to the normal ssh command
The -T means don't try to create an interactive session (otherwise you'll get a warning like "Pseudo-terminal will not be allocated because stdin is not a terminal.")
I use "sudo bash" because init actions scripts assume they're being run as root.
presto.sh must be a copy of the script on your local machine. You could alternatively ssh and gsutil cp gs://dataproc-initialization-actions/presto/presto.sh . && sudo bash presto.sh.
But #Kanji Hara is correct in general. Spinning up a new cluster is pretty fast/painless, so we advocate using initialization actions when creating a cluster.
You could use initialization-actions parameter
Ex:
gcloud dataproc clusters create $CLUSTERNAME \
--project $PROJECT \
--num-workers $WORKERS \
--bucket $BUCKET \
--master-machine-type $VMMASTER \
--worker-machine-type $VMWORKER \
--initialization-actions \
gs://dataproc-initialization-actions/presto/presto.sh \
--scopes cloud-platform
Maybe this script can help you: https://github.com/kanjih-ciandt/script-dataproc-datalab

How to access Cloud SQL from dataproc?

I have a dataproc cluster and I'd like to have the cluster access a Cloud SQL instance. When I created the cluster I assigned scope --scopes sql-admin but after reading the Cloud SQL documentation it looks like I need to connect through a proxy. How can I configure this for access from dataproc?
UPDATE:
Until integration comes out of the box (#vadim's answer) I can get this working by using cloud proxy in my dataproc initialization script:
wget https://dl.google.com/cloudsql/cloud_sql_proxy.linux.amd64
mv cloud_sql_proxy.linux.amd64 cloud_sql_proxy
chmod +x cloud_sql_proxy
nohup ./cloud_sql_proxy -dir=/cloudsql --instances=my-project:us-central1:mysql-instance=tcp:3307 > cloud_proxy_nohup.log &
(note: port 3306 is already in use so Im using 3307 here)
Using a VPC with a private IP between Cloud SQL and Dataproc seems to be a good option. A proxy should no longer be needed.
There's a pending pull request for a dataproc initialization action that will install the Cloud SQL Proxy on all the nodes in the cluster:
https://github.com/GoogleCloudPlatform/dataproc-initialization-actions/pull/47/commits/ade93cc25d72c33e176840ddaa50671e5ed8ed4a