How to know when dataproc initialization actions are done - google-cloud-dataproc

I need to run a Dataproc cluster with both BigQuery and Cloud Storage connectors installed.
I use a variant of this script (because I have no access to the bucket used in the general one), everything is working fine but when I run a job, when the cluster is up and running, it always results in a Task was not acquired error.
I can fix this by simply restarting the dataproc agent on every nodes but I really need this to work properly to be able to run a job right after my cluster is created. it seems that this part of the script is not working properly:
# Restarts Dataproc Agent after successful initialization
# WARNING: this function relies on undocumented and not officially supported Dataproc Agent
# "sentinel" files to determine successful Agent initialization and not guaranteed
# to work in the future. Use at your own risk!
restart_dataproc_agent() {
# Because Dataproc Agent should be restarted after initialization, we need to wait until
# it will create a sentinel file that signals initialization competition (success or failure)
while [[ ! -f /var/lib/google/dataproc/has_run_before ]]; do
sleep 1
done
# If Dataproc Agent didn't create a sentinel file that signals initialization
# failure then it means that initialization succeded and it should be restarted
if [[ ! -f /var/lib/google/dataproc/has_failed_before ]]; then
service google-dataproc-agent restart
fi
}
export -f restart_dataproc_agent
# Schedule asynchronous Dataproc Agent restart so it will use updated connectors.
# It could not be restarted sycnhronously because Dataproc Agent should be restarted
# after its initialization, including init actions execution, has been completed.
bash -c restart_dataproc_agent & disown
My question here are:
How to know that the initialization actions are done?
Do I have/How to properly restart the Dataproc agent one my newly created cluster's nodes?
EDIT:
Here is the command I use to create a cluster (using the 1.3 image version):
gcloud dataproc --region europe-west1 \
clusters create my-cluster \
--bucket my-bucket \
--subnet default \
--zone europe-west1-b \
--master-machine-type n1-standard-1 \
--master-boot-disk-size 50 \
--num-workers 2 \
--worker-machine-type n1-standard-2 \
--worker-boot-disk-size 100 \
--image-version 1.3 \
--scopes 'https://www.googleapis.com/auth/cloud-platform' \
--project my-project \
--initialization-actions gs://dataproc-initialization-actions/connectors/connectors.sh \
--metadata 'gcs-connector-version=1.9.6' \
--metadata 'bigquery-connector-version=0.13.6'
Also, please note that the connectors initialization script has been fixed and works fine by now, so I am using it now but I still have to restart manually the dataproc agent to be able to run a job.

Dataproc agent logs Custom initialization actions finished. message in the /var/log/google-dataproc-agent.0.log file after initialization actions succeed.
No you don't need to restart Dataproc agent manually.
This issue is caused by Dataproc agent service restart in the connectors initialization action and should be resolved by this PR.

As for knowing when the initialization actions are finished, You can check the dataproc's status.state, if it's CREATING that means initialization actions are still being executed, if RUNNING that would mean that they are done!
Check here

Related

Dataproc arguments not being read on spark submit

I am using dataproc to submit jobs on spark. However on spark-submit, non-spark arguments are being read as spark arguments!
I am receiving the error/warning below when running a particular job.
Warning: Ignoring non-spark config property: dataproc:dataproc.conscrypt.provider.enable=false
gcloud dataproc jobs submit spark \
--cluster my-cluster \
--region us-east1 \
--properties dataproc:dataproc.conscrypt.provider.enable=false,spark.executor.extraJavaOptions=$SPARK_CONF,spark.executor.memory=${MEMORY}G,spark.executor.cores=$total_cores \
--class com.sample.run \
--jars gs://jars/jobs.jar \
-- 1000
I would like to know whats wrong with my current format. Thanks in advance.
spark-submit just silently ignored conf options that did not start with spark.
thats the reason for this property it was saying it was ignored.
--properties dataproc:dataproc.conscrypt.provider.enable=false
any property you should pass as spark.propertyname
this is just warning.
Why this property required :
The Conscrypt security provider has been temporarily changed from the
default to an optional security provider. This change was made due to
incompatibilities with some workloads. The Conscrypt provider will be
re-enabled as the default with the release of Cloud Dataproc 1.2 in
the future. In the meantime, you can re-enable the Conscrypt provider
when creating a cluster by specifying this Cloud Dataproc property:
--properties dataproc:dataproc.conscrypt.provider.enable=true
This has to be specified when creating cluster since this is cluster property, not property of spark. (means spark framework cant able to understand this and simply ignored.)
Example usage :
gcloud beta dataproc clusters create my-test
--project my-project
--subnet prod-sub-1
--zone southamerica-east1-a
--region=southamerica-east1
--master-machine-type n1-standard-4
--master-boot-disk-size 40
--num-workers 5
--worker-machine-type n1-standard-4
--worker-boot-disk-size 20
--image-version 1.2
--tags internal,ssh,http-server,https-server
--properties dataproc:dataproc.conscrypt.provider.enable=false
--format=json
--max-idle=10m
and then start job like this...
gcloud dataproc jobs submit pyspark gs://path-to-script/spark_full_job.py
--cluster=my-test
--project=my-project
--region=southamerica-east1
--jars=gs://path-to-driver/mssql-jdbc-6.4.0.jre8.jar
--format=json -- [JOB_ARGS]

How to create an SSH in gcloud, but keep getting API error

I am trying to set up datalab from my chrome book using the following tutorial https://cloud.google.com/dataproc/docs/tutorials/dataproc-datalab. However when trying to set up an SSH tunnel using the following guidelines https://cloud.google.com/dataproc/docs/concepts/accessing/cluster-web-interfaces#create_an_ssh_tunnel I keep on receiving the following error.
ERROR: (gcloud.compute.ssh) Could not fetch resource:
- Project 57800607318 is not found and cannot be used for API calls. If it is recently created, enable Compute Engine API by visiting https://console.developers.google
.com/apis/api/compute.googleapis.com/overview?project=57800607318 then retry. If you enabled this API recently, wait a few minutes for the action to propagate to our sy
stems and retry.
The error message would lead me to believe my "Compute Engine API" is not enabled. However, I have double checked and "Compute Engine API" is enabled.
Here is what I am entering into the cloud shell
gcloud compute ssh ${test-cluster-m} \
--project=${datalab-test-229519} --zone=${us-west1-b} -- \
-4 -N -L ${8080}:${test-cluster-m}:${8080}
The ${} is for accessing the local environment variable. You set them in the step before with:
export PROJECT=project;export HOSTNAME=hostname;export ZONE=zone;PORT=number
In this case would be:
export PROJECT=datalab-test-229519;export HOSTNAME=test-cluster-m;export ZONE=us-west1-b;PORT=8080
Either try this:
gcloud compute ssh test-cluster-m \
--project datalab-test-229519 --zone us-west1-b -- \
-D 8080 -N
Or access the enviroment variables with:
gcloud compute ssh ${HOSTNAME} \
--project=${PROJECT} --zone=${ZONE} -- \
-D ${PORT} -N
Also check the VM you are trying to access is running.

Gitlab Kubernetes executor runner execution job logs not visible

I configured Gitlab runner (version 11.4.2) to use Kubernetes executor.
Here is my non-interactive registrer command:
gitlab-runner register
--non-interactive \
--registration-token **** \
--url https://mygitlab.net/ \
--tls-ca-file /etc/gitlab-runner/certs/ca.crt \
--executor "kubernetes" \
--kubernetes-image-pull-secrets pull-internal \
--kubernetes-image-pull-secrets pull-external \
--name "kube-docker-runner" \
--tag-list "docker" \
--config "/etc/gitlab-runner/config.toml" \
--kubernetes-image "docker:latest" \
--kubernetes-helper-image "gitlab/gitlab-runner-helper:x86_64-latest" \
--output-limit 32768
It works fine and I can see the execution log in the Gitlab UI
In kubernetes, I see the runner pod composed by 2 containers : helper and build. I expected to see execution job logs by watching the build container logs but it's not the case. I would like to centralize these job execution log with a tool like fluentdbit by reading the container stdout output.
If I start the docker:latest alone (without runner execution) in a pod deployed in the same kubernetes cluster, I can see the logs on stdout. Any idea for configuring the stdout of build container properly ?

How to run cluster initialization script on GCP after creation of cluster

I have created a Google Dataproc cluster, but need to install presto as I now have a requirement. Presto is provided as an initialization action on Dataproc here, how can I run this initialization action after creation of the cluster.
Most init actions would probably run even after the cluster is created (though I haven't tried the Presto init action).
I like to run clusters describe to get the instance names, then run something like gcloud compute ssh <NODE> -- -T sudo bash -s < presto.sh for each node. Reference: How to use SSH to run a shell script on a remote machine?.
Notes:
Everything after the -- are args to the normal ssh command
The -T means don't try to create an interactive session (otherwise you'll get a warning like "Pseudo-terminal will not be allocated because stdin is not a terminal.")
I use "sudo bash" because init actions scripts assume they're being run as root.
presto.sh must be a copy of the script on your local machine. You could alternatively ssh and gsutil cp gs://dataproc-initialization-actions/presto/presto.sh . && sudo bash presto.sh.
But #Kanji Hara is correct in general. Spinning up a new cluster is pretty fast/painless, so we advocate using initialization actions when creating a cluster.
You could use initialization-actions parameter
Ex:
gcloud dataproc clusters create $CLUSTERNAME \
--project $PROJECT \
--num-workers $WORKERS \
--bucket $BUCKET \
--master-machine-type $VMMASTER \
--worker-machine-type $VMWORKER \
--initialization-actions \
gs://dataproc-initialization-actions/presto/presto.sh \
--scopes cloud-platform
Maybe this script can help you: https://github.com/kanjih-ciandt/script-dataproc-datalab

Google Dataproc Agent reports failure when using initialization script

I am trying to set up a cluster with an initialization script, but I get the following error:
[BAD JSON: JSON Parse error: Unexpected identifier "Google"]
In the log folder the init script output log is absent.
This seems rather strange as it seemed to work past week, and the error message does not seem related to the init script, but rather to the input arguments for the cluster creation. I used the following command:
gcloud beta dataproc clusters create <clustername> --bucket <bucket> --zone <zone> --master-machine-type n1-standard-1 --master-boot-disk-size 10 --num-workers 2 --worker-machine-type n1-standard-1 --worker-boot-disk-size 10 --project <projectname> --initialization-actions <gcs-uri of script>
Apparently changing
#!/bin/sh
to
#!/bin/bash
and removing all "sudo" occurrences did the trick.
This particular error occurs most often when the initialization script is in a Cloud Storage (GCS) bucket to which the project running the cluster does not have access.
I would recommend double-checking the project which is being used for the cluster has read access to the bucket.